Inexact Derivative-Free Optimization for Bilevel Learning

Variational regularization techniques are dominant in the field of mathematical imaging. A drawback of these techniques is that they are dependent on a number of parameters which have to be set by the user. A by-now common strategy to resolve this issue is to learn these parameters from data. While mathematically appealing, this strategy leads to a nested optimization problem (known as bilevel optimization) which is computationally very difficult to handle. It is common when solving the upper-level problem to assume access to exact solutions of the lower-level problem, which is practically infeasible. In this work we propose to solve these problems using inexact derivative-free optimization algorithms which never require exact lower-level problem solutions, but instead assume access to approximate solutions with controllable accuracy, which is achievable in practice. We prove global convergence and a worst-case complexity bound for our approach. We test our proposed framework on ROF denoising and learning MRI sampling patterns. Dynamically adjusting the lower-level accuracy yields learned parameters with similar reconstruction quality as high-accuracy evaluations but with dramatic reductions in computational work (up to 100 times faster in some cases).

1. Introduction.Variational regularization techniques are dominant in the field of mathematical imaging.For example, when solving a linear inverse problem Ax = y, variational regularization can be posed as the solution to min x D(Ax, y) + αR(x) .(1.1)Here the data fidelity D is usually chosen related to the assumed noise model of the data y and the regularizer R models our a-priori knowledge of the unknown solution.Many options have been proposed in the literature, see for instance [1][2][3][4][5] and references therein.An important parameter for any variational regularization technique is the regularization parameter α.While some theoretical results and heuristic choices have been proposed in there literature, see e.g.[2,6] and references therein or the L-curve criterion [7], the appropriate choice of the regularization parameter in a practical setting remains an open problem.Similarly, other parameters in (1.1) have to be chosen by the user, such as smoothing of the total variation [3], the hyperparameter for total generalized variation [8] or the sampling pattern in magnetic resonance imaging (MRI), see e.g.[9][10][11].
Instead of using heuristics for choosing all of these parameters, here we are interested in finding these from data.A by-now common strategy to learn parameters of a variational regularization model from data is bilevel learning, see e.g.[11][12][13][14][15][16][17] and references in [4] .Given labelled data (x i , y i ) i=1,...,n we find parameters θ ∈ Θ ⊂ R m by solving the upper-level The lower-level objective Φ i,θ could be of the form Φ i,θ (x) = D(Ax, y i ) + θR(x) as in (1.1) but we will not restrict ourselves to this special case.In general Φ i,θ will depend on the data y i .
While mathematically appealing, this nested optimization problem is computationally very difficult to handle since even the evaluation of the upper-level problem (1.2) requires the exact solution of the lower-level problems (1.3).This requirement is practically infeasible and common algorithms in the literature compute the lower-level solution only to some accuracy, thereby losing any theoretical performance guarantees, see e.g.[11][12][13]16].One reason for needing exact solutions is to compute the gradient of the upper-level objective using the implicit function theorem [11], which we address by using upper-level solvers which do not require gradient computations.
In this work we propose to solve these problems using inexact derivative-free optimization (DFO) algorithms which never require exact solutions to the lower-level problem while still yielding convergence guarantees.Moreover, by dynamically adjusting the accuracy we gain a significant computational speed-up compared to using a fixed accuracy for all lower-level solves.The proposed framework is tested on two problems: learning regularization parameters for ROF-denoising and learning the sampling pattern in MRI.
Aim: Use inexact computations of xi (θ) within a derivative-free upper-level solver, which makes (1.2) computationally tractable, while retaining convergence guarantees.
1.1.Derivative-free optimization.Derivative-free optimization methods-that is, optimization methods that do not require access to the derivatives of the objective (and/or constraints)-have grown in popularity in recent years, and are particularly suited to settings where the objective is computationally-expensive to evaluate and/or noisy; we refer the reader to [18,19] for background on DFO and examples of applications, and to [20] for a comprehensive survey of recent work.The use of DFO for algorithm tuning has previously been considered in a general framework [21], and in the specific case of hyperparameter tuning for neural networks in [22].
Here, we are interested in the particular setting of learning for variational methods (1.2), which has also been considered in [16] where a new DFO algorithm based on discrete gradients has been proposed.In [16] it was assumed that the lower-level problem can be solved exactly such that the bilevel problem can be reduced to a single nonconvex optimization problem.In the present work we lift this stringent assumption.
In this paper we focus on DFO methods which are adapted to nonlinear least-squares problems as analyzed in [23,24].These methods are so-called 'model-based', in that they construct a model approximating the objective at each iteration, locally minimize the model to select the next iterate, and update the model with new objective information.Our work also results include examples with up to 64 parameters.
We demonstrate that the dynamic accuracy DFO achieves comparable or better objective values than the fixed accuracy variants and final reconstructions of comparable quality.However our approach is able to achieve this with a dramatically reduced computational load, in some cases up to 100 times less work than the fixed accuracy variants.
New framework for learning MRI sampling.We introduce a new framework to learn the sampling pattern in MRI based on bilevel learning.Our idea is inspired by the image inpainting model of [32].Compared to other algorithms to learn the sampling pattern in MRI based on first-order methods [11], the proposed approach seems to be much more robust to initialisation and choice of solver for the lower-level problem.As with the denoising examples, our dynamic accuracy DFO achieves the same upper-level objective values and final reconstructions as fixed accuracy variants but with substantial reductions in computational work.
Regularization parameter choice rule with machine learning.Our numerical results suggest that the bilevel framework can learn regularization parameter choice rule which yields a convergent regularization method in the sense of [1,5], indicating for the first time that machine learning can be used to learn mathematically sound regularization methods.
1.3.Structure.In section 2 we describe problems where the lower-level model (1.1) applies and describe how to efficiently attain a given accuracy level using standard first-order methods.Then in section 3 we introduce the dynamic accuracy DFO algorithm and present our global convergence and worst-case complexity bounds.Finally, our numerical experiments are described in section 4.
1.4.Notation.Throughout, we let • we denote the Euclidean norm of a vector and the operator 2-norm of a matrix.We also define the weighted (semi)norm x 2 S := x T Sx for a symmetric and positive (semi)definite matrix S. The gradient of a scalar-valued function f : R n → R is denoted by ∇f : R n → R n , and the derivative of a vector-valued function f : R n → R m is denoted by ∂f : R n → R n×m , (∂f ) i,j = ∂ i f j where ∂ i f j denotes the partial derivative of f j with respect to the ith coordinate.If f is a function of two variables x and y, then ∂ x f denotes the derivative with respect to x.
1.5.Software.Our implementation of the DFO algorithm and all numerical testing code will be made public upon acceptance.
2. Lower-Level Problem.In order to have sufficient control over the accuracy of the solution to (1.3) we will assume that Φ i,θ are L i -smooth and µ i -convex, see definitions below.

Definition 2.1 (Smoothness). A function
Moreover, when the lower-level problem is strictly convex and smooth, with Φ i (x, θ) := Φ i,θ (x) we can equivalently describe the minimizer by Smoothness properties of xi follow from the implicit function theorem and its generalizations if Φ i is smooth and regular enough.Assumption 2.3.We assume that for all i = 1, . . ., n the following statements hold.1. Convexity: For all θ ∈ Θ the functions Φ i,θ are µ i -convex.

Smoothness in x:
and are continuous.
Proof.Ad 1) Finite and convex functions are continuous [33,Corollary 2.36].It is easy to show that µ-convex functions are coercive.Then the existence and uniqueness follows from classical theorems, e.g.[34,Theorem 6.31

Examples.
A relevant case of the model introduced above is the parameter tuning for linear inverse problems, which can be solved via the variational regularization model where TV(x) := m j=1 ∇x(j) denotes the discretized total variation, e.g.∇x(j) is the finite forward difference discretization of the spatial gradient of x at pixel j.
Using x ≈ x 2 + ν 2 , we can approximate problem (2.2) by a smooth and strongly convex problem of the form xi (θ) := arg min with the smoothed total variation given by TV ν(θ) (x) := m j=1 ∇x(j) 2 + ν(θ) 2 .Here we already introduced the notation that various parts of the problem may depend on a vector of parameters θ which usually needs to be selected manually.We will learn these parameters using the bilevel framework.For simplicity denote A θ := A(θ), S θ := S(θ), α θ := α(θ), ν θ := ν(θ) and ξ θ := ξ(θ).Note that Φ i,θ in (2.3) is L i -smooth and µ i -strongly convex with where λ min (A * θ S θ A θ ) denotes the smallest eigenvalue of A * θ S θ A θ .2.1.1.Total Variation-based Denoising.A particular problem we consider is a smoothed version of the ROF model [37], i.e.A θ = I, S θ = I.Then (2.3) simplifies to which is L i -smooth and µ i -strongly convex with constants as in (2.4) with A * θ S θ A θ = I = 1 and λ min (A * θ S θ A θ ) = λ min (I) = 1.In our numerical examples we will consider two cases.First, we will just learn the regularization parameter α given manually set ν and ξ.Second, we will learn all three parameters α, ν and ξ.
2.1.2.Undersampled MRI Reconstruction.Another problem we consider is the reconstruction from undersampled MRI data, see e.g.[38], which can be phrased as (2.3) with A θ = F where F is the discrete Fourier transform and which is L i -smooth and µ i -strongly convex with constants as in (2.4) with A * θ S θ A θ ≤ 1 and λ min (A * θ S θ A θ ) ≥ 0. The sampling coefficients s j indicate the relevance of a sampling location.The data term (2.6) can be rewritten as Most commonly the values s are binary and manually chosen.Here we aim to use bilevel learning to find a sparse s such that the images x i can be reconstructed well from sparse samples of y i .This approach was first proposed in [11].

Example Training Data.
Throughout this paper, we will consider training data of artificially-generated 1D images.Each ground truth image x i is randomly-generated piecewiseconstant function.For a desired image size N , we select values That is, each x i is zero except for a single randomly-generated subinterval of length 2R i centered around C i where it takes the value 1.
We then construct our y i by taking the signal to be reconstructed and adding Gaussian noise.Specifically, for the image denoising problem we take where σ > 0 and ω i ∈ R N is randomly-drawn vector of i.i.d.standard Gaussians.For the MRI sampling problem, we take where σ > 0 and ω i ∈ C N is a randomly-drawn vector with real and imaginary parts both standard Gaussians.
In Figure 1 we plot an example collection of 5 pairs (x i , y i ) for the image denoising problem with N = 256, and in Figure 2 we plot the solution to (2.5) for the first of these (x i , y i ) pairs for a variety of choices for the parameters α θ , θ , η θ .The lower-level problem (1.3) can be solved with gradient descent (GD) which converges linearly for L-smooth and µ-convex problems.One can show (e.g.[3]) that GD (2.11) with τ = 1/L, converges linearly to the unique solution x * of (1.3).More precisely, for all k ∈ N we have [39,Theorem 10.29] (2.12)Moreover, if one has a good estimate of the strong convexity constant µ, then it is better to choose τ = 2/(L + µ), which gives an improved linear rate [40, Theorem 2.1.15] (2.13) 2.3.2.FISTA.Similarly, we can use FISTA [41] to approximately solve the lower-level problem.FISTA applied to a smooth objective with convex constraints is a modification of [42] and can be formulated as the iteration where q := τ µ, and we choose τ = 1/L and t 0 = 0 [3, Algorithm 5].We then achieve linear convergence with [3, Theorem 4.10] and so, since Φ(x k ) − Φ(x * ) ≥ (µ/2) x k − x * 2 from µ-convexity, we get 2.3.3.Ensuring accuracy requirements.We will need to be able to solve the lower-level problem to sufficient accuracy that we can guarantee x k − x * 2 ≤ , for a suitable accuracy > 0. We can guarantee this accuracy by ensuring we terminate with k sufficiently large, given an estimate x 0 − x * 2 , using the a-priori bounds (2.12) or (2.16).A simple alternative is to use the a-posteriori bound x − x * ≤ ∇Φ(x) /µ for all x, and terminate once To compare these two options, we consider two test problems: i) a 10-dimensional version of Nesterov's quadratic [40,Section 2.1.4]and ii) 1D image denoising.Nesterov's quadratic is defined as for x ∈ R 10 , with µ = 1 and Q = 100, which is µ-strongly convex and L-smooth for µ ≈ 3 and L ≈ 98; we apply no constraints, X = R 10 .
We also consider a 1D denoising problem as in (2.5) with randomly-generated data y ∈ R N (with N = 100 pixels) as per Section 2.2, α = 0.3, ν = ξ = 10 −3 , and x * estimated by running 10 4 iterations of FISTA.Here, the problem is µ-convex and L-smooth with µ ≈ 1 and L ≈ 1, 201.We estimate the true solution x * by running FISTA for 10,000 iterations (which gives an upper bound estimate x k − x * 2 ≤ 3e−26 from (2.17)).In Figure 3, we compare the true error x k − x * 2 against the a-priori linear convergence bounds (2.12) or (2.16) with the true value of x 0 − x * 2 , and the a-posteriori gradient bound (2.17).In both cases, the gradient-based bound (2.17) provides a much tighter estimate of the error, particularly for high accuracy requirements.Thus, in our numerical results, we terminate the lower-level solver as soon (2.17) is achieved for our desired tolerance.The gradient-based bound has the additional advantage of not requiring an a priori estimate of x 0 − x * .For comparison, in our results below we will also consider terminating GD/FISTA after a fixed number of iterations.

Dynamic Accuracy DFO Algorithm.
3.1.DFO Background.Since evaluating xi (θ) in the upper-level problem (1.2) is only possible with some error (it is computed by running an iterative process), it is not straightforward or cheap to evaluate ∂ xi (θ).Hence for solving (1.2) we turn to DFO techniques, and specifically consider those which exploit the nonlinear least-squares problem structure.In this section we outline a model-based DFO method for nonlinear least-squares problems [24], a trust-region method based on the classical (derivative-based) Gauss-Newton method [43,Chapter 10].However, these approaches are based on having access to exact function evaluations, and so we augment this with a standard approach for dynamic accuracy trustregion methods [26,Chapter 10.6]; this was previously considered for general model-based DFO methods in [25].
Here, we write the upper-level problem (1.2) in the general form min where r i (θ) := xi (θ) − x i and r(θ) := [r 1 (θ), . . ., r n (θ)] T .Without loss of generality, we do not include a regularization term J (θ); we can incorporate this term by defining r n+1 (θ) := J (θ) and r(θ) := [r 1 (θ), . . ., r n+1 (θ)] T , for instance.The upper-level objective (3.1) assumes access to exact evaluations of the lower-level objective r i (θ), which is not achievable in practice.We therefore assume we only have access to inaccurate evaluations x i (θ) ≈ xi (θ), which gives us r i (θ Our overall algorithmic framework is based on trust-region methods, where at each iteration k we construct a model m k for the objective which we hope is accurate in a neighborhood of our current iterate θ k .Simultaneously we maintain a trust-region radius ∆ k > 0, which tracks the size of the neighborhood of θ k where we expect m k to be accurate.Our next iterate is determined by minimizing the model m k within a ball of size ∆ k around θ k . Usually m k is taken to be a quadratic function (e.g. a second-order Taylor series for f about θ k ).However here we use the least-squares problem structure (3.1) and construct a linear model where r(θ k ) is our approximate evaluation of r(θ k ) and J k ∈ R n×d is a matrix approximating ∂r(θ k ) T .We construct J k by interpolation: we maintain an interpolation set z 0 , . . ., z d ∈ R d (where z 0 := θ k at each iteration k) and choose J k so that These conditions ensure that the second approximation in (3.2) is exact for each z t in the interpolation set.We can therefore find J k by solving the d × d linear system (with n righthand sides): where gives a natural quadratic model for the full objective f : where We compute a tentative step s k as a(n approximate) minimizer of the trust-region subproblem min There are a variety of efficient algorithms for computing s k [26, Chapter 7].Finally, we evaluate f (θ k + s k ) and decide whether to accept or reject the step (i.e.set θ k+1 = θ k + s k or θ k+1 = θ k ) depending on the ratio Although we would like to accept/reject using ρ k , in reality we only observe the approximation and so we use this instead.
This gives us the key components of a standard trust-region algorithm.We have two extra considerations in our context: the accuracy of our derivative-free model (3.5) and the lack of exact evaluations of the objective.
Firstly, we require a procedure to verify if our model (3.5) is sufficiently accurate inside the trust-region, and if not, modify the model to ensure its accuracy.We discuss this in Section 3.2.The notion of 'sufficiently accurate' we use here is that m k is as good an approximation to f as a first-order Taylor series (up to constant factors), which we call 'fully linear'.
Secondly, we handle the inaccuracy in objective evaluations by ensuring f (θ k ) and f (θ k + s k ) are evaluated to a sufficiently high accuracy when we compute ρ k (3.8).Specifically, suppose we know that where η 1 > 0 is an algorithm parameter.We achieve this by running the lower-level solver for a sufficiently large number of iterations.
The full upper-level algorithm is given in Algorithm 3.1; it is similar to the approach in [25], the DFO method [18, Algorithm 10.1]-adapted for the least-squares problem structure-and the (derivative-based) dynamic accuracy trust-region method [26, Algorithm 10.6.1].
Our main convergence result is the below.
We summarize Theorem 3.2 as follows, noting that the iteration and evaluation counts match the standard results for model-based DFO (e.g.[24,44]).

11:
Set θ k+1 and ∆ k+1 as: If θ k+1 = θ k +s k , then build m k+1 by adding θ k+1 to the interpolation set (removing an existing point).Otherwise, set m k+1 = m k if m k is fully linear in B(θ k , ∆ k ), or form m k+1 by making m k fully linear in B(θ k+1 , ∆ k+1 ).13: end for 3.2.Guaranteeing Model Accuracy.As described above, we need a process to ensure that m k (3.5) is a fully linear model for f inside the trust region B(θ k , ∆ k ).For this, we need to consider the geometry of the interpolation set.Definition 3.4.The Lagrange polynomials of the interpolation set {z 0 , z 1 , . . ., z d } are the linear polynomials t , t = 0, . . ., d such that t (z s ) = δ s,t for all s, t = 0, . . ., d.
The Lagrange polynomials of {z 0 , . . ., z d } exist and are unique whenever the matrix in (3.4) is invertible.The required notion of 'good geometry' is given by the below definition (where small Λ indicates better geometry).Definition 3.5 (Λ-poisedness).For Λ > 0, the interpolation set {z 0 , . . ., The below result confirms that, provided our interpolation set has sufficiently good geometry, and our evaluations r(θ k ) and r(y t ) are sufficiently accurate, our interpolation models are fully linear.
Assumption 3.6.The extended level set is bounded, and r(θ) is continuously differentiable and ∂r(θ) is Lipschitz continuous with constant L J in B.
In particular, Assumption 3.6 implies that r(θ) and ∂r(θ) are uniformly bounded in the same region-that is, r(θ) ≤ r max and ∂r(θ) ≤ J max for all θ ∈ B-and Lemma 3.7.Suppose Assumption 3.6 holds and ) and for each evaluation t = 0, . . ., d and each i = 1, . . ., n we have for some c > 0, then the corresponding models M k (3.2) and m k (3.5) are fully linear models for r(θ) and f (θ) respectively.
We conclude by noting that for any Λ > 1 there are algorithms available to determine if a set is Λ-poised, and if it is not, change some interpolation points to make it so; details may be found in [18,Chapter 6], for instance.

3.3.
Lower-Level Objective Evaluations.We now consider the accuracy requirements that Algorithm 3.1 imposes on our lower-level objective evaluations.In particular, we require the ability to satisfy (3.11), which imposes requirements on the error in the calculated f , rather than the lower-level evaluations r.The connection between errors in r and f is given by the below result.Lemma 3.8.Suppose we compute x i (θ) satisfying x i (θ) − xi (θ) ≤ δ x for all i = 1, . . ., n.Then we have Proof.Letting (θ) := r(θ) − r(θ), we have and the first part of (3.18) follows since (θ) ∞ ≤ δ x from (3.17).The second part of (3.18) follows from an identical argument but writing f (θ) = 1 n r(θ)+ (θ) 2 , and the final conclusion follows immediately from the first part of (3.18).
We construct these bounds to rely mostly on f (θ), since this is the value which is observed by the algorithm (rather than the true value f (θ)).From the concavity of √ •, if f (θ) is larger then x i (θ) − xi (θ) must be smaller to achieve the same δ f .Lastly, we note the key reason why we require (3.11): it guarantees that our estimate ρ k of ρ k is not too inaccurate.

3.4.
Convergence and Worst-Case Complexity.We now prove the global convergence of Algorithm 3.1 and analyse its worst-case complexity (i.e. the number of iterations required to achieve ∇f (θ k ) ≤ for the first time).
Assumption 3.10.The computed trust-region step s k satisfies and there exists κ H ≥ 1 such that H k + 1 ≤ κ H for all k.Assumption 3.10 is standard and the condition (3.21) easy to achieve in practice [26,Chapter 6.3].
Firstly, we must show that the inner loops for the criticality and accuracy phases terminate.We begin with the criticality phase, and then consider the accuracy phase.Then the criticality phase terminates in finite time with where ∆ k init is the value of ∆ k before the criticality phase begins.Lemma 3.12 ([24, Lemma 3.7]).Suppose Assumption 3.6 holds.Then in all iterations we have We note that our presentation of the criticality phase here can be made more general by allowing g k ≥ C = as the entry test, setting ∆ k to ω i ∆ k for some ω ∈ (0, 1) possibly different to γ dec , and having an exit test ∆ k ≤ µ g k for some µ > 0. All the below results hold under these assumptions, with modifications as per [24].Lemma 3.13.If Assumptions 3.6 and 3.10 hold and ∇f (θ k ) ≥ > 0, then the accuracy phase terminates in finite time (i.e.line 10 of Algorithm 3.1 is eventually called) Proof.From Lemma 3.12 we have g k ≥ /(κ eg + 1), and the result then follows from [26,Lemma 10.6.1].
We now collect some key preliminary results required to establish complexity bounds.Lemma 3.14.Suppose Assumptions 3.6 and 3.10 hold, m k is fully linear in B(θ k , ∆ k ) and Proof.We compute From this and full linearity, we get and so ρ k ≥ η 2 + 2η 1 , hence ρ k ≥ η 2 from Lemma 3.9.Lemma 3.15.Suppose Assumptions 3.6 and 3.10 hold.Suppose ∇f (θ k ) ≥ for all k = 0, . . ., k and some ∈ (0, 1).Then, for all k ≤ k , Proof.As above, we let ∆ k init and m k init denote the values of ∆ k and m k before the criticality phase (i.e.∆ k init = ∆ k and m k init = m k if the criticality phase is not called).From Lemma 3.12, we know g k ≥ /(κ eg +1) for all k ≤ k .Suppose by contradiction k ≤ k is the first iteration such that ∆ k < ∆ min .Then from Lemma 3.11, That is, either the criticality phase is not called, or terminates with i = 0 (in this case, the model m k is formed simply by making If the accuracy phase loop occurs, we go back to the criticality phase, which can potentially happen multiple times.However, since the only change is that r(θ k ) is evaluated to higher accuracy, incorporating this information into the model m k can never destroy full linearity.Hence, after the accuracy phase, by the same reasoning as above, either one iteration of the criticality phase occurs (i.e.m k is made fully linear) or it is not called.If the accuracy phase is called multiple times and the criticality phase occurs multiple times, all times except the first have no effect (since the accuracy phase can never destroy full linearity).Thus ∆ k is unchanged by the accuracy phase.
Since ∆ min < ∆ 0 init , we have k ≥ 1.As k is the first iteration such that ∆ k < ∆ min and ∆ k = ∆ k init , we must have ∆ k init = γ dec ∆ k−1 (as this is the only other way ∆ k can be reduced).Therefore ∆ k−1 = ∆ k /γ dec < ∆ min /γ dec , and so We then have ∆ k−1 ≤ c 0 /(κ eg + 1) ≤ c 0 g k−1 , and so by Lemma 3.14 either ρ k ≥ η 2 or m k−1 is not fully linear.Either way, we set ∆ k init ≥ ∆ k−1 in (3.13).This contradicts ∆ k init = γ dec ∆ k−1 above, and we are done.We now bound the number of iterations of each type.Specifically, we suppose that k + 1 is the first k such that ∇f (θ k ) ≥ .Then, we define the sets of iterations: • S is the set of iterations k ∈ {0, . . ., k } which are 'successful'; i.e.
The iterate θ k is only changed on successful iterations (i.e.θ k+1 = θ k for all k / ∈ S ).Thus, as f (θ) ≥ 0 from the least-squares structure (3.1), we get and the result follows.
We are now in a position to prove our main results.
Proof of Theorem 3.2.To derive a contradiction, suppose that ∇f (θ k ) ≥ for all k ∈ {0, . . ., K}, and so g k ≥ /(κ eg + 1) and ∆ k ≥ ∆ min by Lemma 3.12 and Lemma 3.15 respectively.Since K ≤ k by definition of k , we will try to construct an upper bound on k .We already have an upper bound on |S | from Proposition 3. 16.
dec , and so noting we have changed ∆ min /∆ 0 < 1 to ∆ 0 /∆ min > 1 and used log γ dec < 0, so all terms in (3.39)  Proof of Corollary 3.3.Global convergence and the iteration bound follow directly from Theorem 3.2, noting that ∆ min = O(κ −2 ).For the evaluation bound, we also need to count the number of inner iterations of the criticality phase.Suppose ∇f (θ k ) < for k = 0, . . ., k .Similar to the above, we define: (a) C M to be the number of criticality phase iterations corresponding to the first iteration of i = 0 where m k whas not already fully linear, in iterations 0, . . ., k ; and (b) C U to be the number of criticality phase iterations corresponding to all other iterations i > 0 (where ∆ k is reduced and m k is made fully linear) in iterations 0, . . ., k .
From Lemma 3.15 we have ∆ k ≥ ∆ min for all k ≤ k .We note that ∆ k is reduced by a factor γ dec for every iteration of the criticality phase in C U .Thus by a more careful reasoning as we used to reach (3.39), we conclude Also, after every iteration k in which the first iteration of criticality phase makes m k fully linear, we have either a (very) successful or unsuccessful step, not a model-improving step.From the same reasoning as in Lemma 3.15, the accuracy phase can only cause at most one more step criticality phase in which m k is made fully linear, regardless of how many times it is called. 3Thus, where the second inequality follows from Proposition 3.16.
If < 1, we conclude that the number of times we make m k fully linear before ∇f (θ k ) < for the first time is the same as the number of iterations, O(κ 3 −2 ).Since each iteration requires one new objective evaluation (at θ k + s k ) and each time we make m k fully linear requires at most O(d) objective evaluations (corresponding to replacing the entire interpolation set), we get the stated evaluation complexity bound.

3.5.
Estimating the Lower-Level Work.We have from Corollary 3.3 that we can achieve ∇f (θ k ) < in O( −2 ) evaluations of r(θ).In this section, we use the fact that evaluations of r(θ) come from finitely terminating a linearly-convergent procedure (i.e.strongly convex optimization) to estimate the total work required in the lower-level problem.This is particularly relevant in an imaging context, where the lower-level problem can be large-scale and poorly-conditioned; this can be the dominant cost of Algorithm 3.1.Proposition 3.17.Suppose Assumptions 3.6 and 3.10 hold and ∇f (θ k ) ≥ for all k = 0, . . ., k and some ∈ (0, 1].Then for every objective evaluation in iterations k ≤ k it suffices to guarantee that x i (θ) − xi (θ) = O( 2) for all i = 1, . . ., n.
Proof.For all k ≤ k we have g k ≥ /(κ eg + 1) and ∆ k ≥ ∆ min by Lemma 3.12 and Lemma 3.15 respectively.There are two places where we require upper bounds on x i (θ) − xi (θ) in our objective evaluations: ensuring f (θ k ) and f (θ k + s k ) satisfy (3.11) and ensuring our model is fully linear using Lemma 3.7.
In the first case, we note that by Assumption 3.10 and using ∆ min < c 0 /(κ eg + 1) ≤ /[κ H (κ eg + 1)] by definition of ∆ min (3.28) and c 0 (3.23).Therefore to ensure (3.11) it suffices to guarantee From Lemma 3.8, specifically the second part of (3.18), this means to achieve (3.11) it suffices to guarantee (3.49) for all i = 1, . . ., n, where θ ∈ ∪ k≤k {θ k , θ k + s k }.From Assumption 3.6 we have f (θ) ≤ f max := r 2 max /n, and so from the fundamental theorem of calculus we have Corollary 3.3 and Proposition 3.17 say that to ensure ∇f (θ k ) < for some k, we have to perform O(dκ 3 −2 ) upper-level objective evaluations, each requiring accuracy at most x i (θ)− xi (θ) = O( 2 ) for all i.Since our lower-level evaluations correspond to using GD/FISTA to solve a strongly convex problem, the computational cost of each upper-level evaluation is O(n log( −2 )) provided we have reasonable initial iterates.From this, we conclude that the total computational cost before achieving ∇f (θ k ) < is at most O( −2 log( −1 )) iterations of the lower-level algorithm.However, this is a conservative approach to estimating the cost: many of the iterations correspond to ∇f (θ k ) , and so the work required for these is less.This suggests the question: can we more carefully estimate the work required at different accuracy levels to prove a lower -dependence on the total work?We now argue that this is not possible without further information about asymptotic convergence rates (e.g.local convergence theory).For simplicity we drop all constants and O(•) notation in the below.
Suppose we count the work required to achieve progressively higher accuracy levels 1 = for some desired accuracy 1.Since each i < 1, we assume that we require −2 i evaluations to achieve accuracy i , where each evaluation requires log( −1 i ) computational work.We may choose 0 < 1, since the cost to achieve accuracy 0 is fixed (i.e.independent of our desired accuracy ), so does not affect our asymptotic bounds.Counting the total lower-level problem work-which we denote W ( )-in this way, we get The second term of (3.51) corresponds to a right Riemann sum approximating Since x → log( √ x) = log(x)/2 is strictly increasing, the right Riemann sum overestimates the integral; hence independent of our choices of 1 , . . ., N −1 .That is, as → 0, we have W ( ) ∼ −2 log( −1 ), so our naïve estimate is tight.We further note that this naïve bound applies more generally.Suppose the work required for a single evaluation of the lower-level objective to accuracy is w( −2 ) ≥ 0 (e.g.w(x) = log(x)/2 above).Assuming w is increasing (i.e. higher accuracy evaluations require more work), we get, similarly to above, Since w is increasing and nonnegative, by does not increase too quickly.This holds in a variety of cases, such as w(x) bounded, concave or polynomial (but not if w(x) grows exponentially).In particular, this holds for w(x) ∼ log(x)/2 as above, and w(x) ∼ x 1/2 and w(x) ∼ x, which correspond to the work required (via standard sublinear complexity bounds) if the lower-level problem is a strongly convex, convex or nonconvex optimization problem respectively.

Numerical Results.
4.1.Upper-level solver (DFO-LS).We implement the dynamic accuracy algorithm (Algorithm 3.1) in DFO-LS [45], an open-source Python package which solves nonlinear leastsquares problems subject to bound constraints using model-based DFO. 4 As described in [45], DFO-LS has a number of modifications compared to the theoretical algorithm Algorithm 3.1.The most notable modifications here are that DFO-LS: • Allows for bound constraints (and internally scales variables so that the feasible region is [0, 1] for all variables); • Does not implement a criticality phase; • Uses a simplified model-improving step; • Maintains two trust-region radii to avoid decreasing ∆ k too quickly; • Implements a 'safety phase', which treats iterations with short steps s k ∆ k similarly to unsuccessful iterations.More discussion on DFO-LS can be found [24,45].
Here, we use DFO-LS v1.1.1,modified for the dynamic accuracy framework as described above.When determining the accuracy level for a given evaluation, we require accuracy level δ x = 10(∆ k ) 2 for all evaluations (c.f.Lemma 3.7), and also (3.11) when checking objective decrease (3.8).
4.2.Application: 1D Image Denoising.In this section, we consider the application of DFO-LS to the problem of learning the regularization and smoothing parameters for the image denoising model (2.5) as described in subsection 2.1.1.We use training data constructed using the method described in subsection 2.2 with N = 256 and σ = 0.1.
3-parameter case.We also consider the more complex problem of learning three parameters for the denoising problem (namely α, ν and ξ).We choose to penalize a large condition number of the lower-level problem, thus promotes efficient solution of the lower-level problem after training.To be precise we choose where L and µ are the smoothness and strong convexity constants given in subsection 2.1.1.The problem is solved using the parametrization α = 10 θ 1 , ν = 10 θ 2 and ξ = 10 θ 3 .Here, we use a training set of n = 20 randomly-generated images, and optimize over θ ∈ [−7, 7] × [−7, 0] 2 .Our default starting value is θ 0 = (0, −1, −1) and our default choice of upper-level regularization parameter is β = 10 −6 .
Solver settings.We run DFO-LS with a budget of 20 and 100 evaluations of the upper-level objective f for the 1-and 3-parameter cases respectively, and with ρ end = 10 −6 in both cases.
We compare the dynamic accuracy variant of DFO-LS (given by Algorithm 3.1) against two variants of DFO-LS (as originally implemented in [45]): 1. Low-accuracy evaluations: each value xi received by DFO-LS is inaccurately estimated via a fixed number of iterations of GD/FISTA; we use 1,000 iterations of GD and 200 iterations of FISTA.2. High-accuracy evaluations: each value z i received by DFO-LS is estimated using 10,000 iterations of GD or 1,000 iterations of FISTA.We estimate δ f in the plots below by taking δ r to be the maximum estimate of xi (θ) − x i for each i = 1, . . ., n.When running the lower-level solvers, our starting point is the final reconstruction from the previous upper-level evaluation, which we hope is a good estimate of the solution.
1-parameter denoising results.In Figure 4 we compare the six algorithm variants (low, high and dynamic accuracy versions of both GD and FISTA) on the 1-parameter denoising problem.Firstly in Figures 4a and 4b, we show the best upper-level objective value observed against 'computational cost', measured as the total GD/FISTA iterations performed (over all upper-level evaluations).For each variant, we plot the value f (θ) and the uncertainty range f (θ) ± δ f associated with that evaluation.In Figure 4c we show the best α θ found against the same measure of computational cost.
We see that both low-accuracy variants do not converge to the optimal θ.Both highaccuracy variants converge to the same objective value and θ, but take much more computational effort to do this.Indeed, we did not know a priori how many GD/FISTA iterations would be required to achieve convergence.By contrast, both dynamic accuracy variants find the optimal θ without any tuning.
Moreover, dynamic accuracy FISTA converges faster than high-accuracy FISTA, but the reverse is true for GD.In Figure 4d we show the cumulative number of GD/FISTA iterations performed after each evaluation of the upper-level objective.We see that the reason for dynamic accuracy GD converging slower than than high-accuracy GD is that the initial upperlevel evaluations require many GD iterations; the same behavior is seen in dynamic accuracy FISTA, but to a lesser degree.This behavior is entirely determined by our (arbitrary) choices of θ 0 and ∆ 0 .We also note that the number of GD/FISTA iterations required by the dynamic accuracy variants after the initial phase is much lower than both the fixed accuracy variants.Finally, in Figure 5 we show the reconstructions achieved using the α θ found by dynamic accuracy FISTA.All reconstructions are close to the ground truth, with a small loss of contrast.
To further understand the impact of the initial evaluations and the robustness of our framework, in Figure 4 we run the same problem with different choices θ 0 ∈ {−2, −1, 1} (where θ 0 = 0 before).In Figure 6 we show best α θ found for a given computational effort for these choices.When θ 0 > 0, the lower-level problem is starts more ill-conditioned, and so the first upper-level evaluations for the dynamic accuracy variants require more GD/FISTA iterations.However, when θ 0 < 0, we initially have a well-conditioned lower-level problem, and so the dynamic accuracy variants require many fewer GD/FISTA iterations initially, and they converge at the same or a faster rate than the high-accuracy variants.
These results also demonstrate that the dynamic accuracy variants give a final regularization parameter which is robust to the choice of θ 0 .In Figure 7 we plot the final learned α θ value compared to the initial choice of α θ for all variants.The low-accuracy variants do not Lower-level problem iterations Upper-level evalutions   reach a consistent minimizer for different starting values, but the dynamic and high-accuracy variants both reach the same minimizer for all starting points.Thus although our upper-level problem is nonconvex, we see that our dynamic accuracy approach can produce solutions which are robust to the choice of starting point.
3-parameter denoising results.Next, we consider the 3-parameter (α θ , ν θ and ξ θ ) denoising problem.As shown in Figure 8, both dynamic accuracy variants (GD and FISTA) achieve the best objective value at least one order of magnitude faster than the corresponding low-and high-accuracy variants.We note that (for instance) 200 FISTA iterations was insufficient to achieve convergence in the 1-parameter case, but converges here.By contrast, aside from the substantial speedup in the 3-parameter case, our approach converges in both cases without needing to select the computational effort in advance.
The final reconstructions achieved by the optimal parameters for dynamic accuracy FISTA are shown in Figure 9.We note that all variants produced very similar reconstructions (since they converged to similar parameter values), and that all training images are recovered with high accuracy.
Next, we consider the effect of the upper-level regularization parameter β.If the smaller β value of 10 −8 is chosen, all variants converge to slightly smaller values of ν θ and ξ θ as the original β = 10 −6 , but produce reconstructions of a similar quality.However, increasing the value of β yields parameters which give noticeably worse reconstructions.The reconstructions for β = 10 −4 are shown in Figure 10.
We conclude by demonstrating in Figure 11 that, aside from reducing our upper-level objective , the parameters found by DFO-LS do in fact progressively improve the quality of the reconstructions.The figure shows the reconstructions of one training image achieved by the best parameters found (by the dynamic accuracy FISTA variant) after a given number of upper-level objective evaluations.We see a clear improvement in the quality of the reconstruction as the upper-level optimization progresses.4.3.Application: 2D denoising.Next, we demonstrate the performance of dynamic accuracy DFO-LS on the same 3-parameter denoising problem from subsection 4.2, but applied to 2D images.Our training data are the 25 images from the Kodak dataset. 5 We select the central 256 × 256-pixel region of each image, convert to monochrome and add Gaussian noise N (0, σ 2 ) with σ = 0.1 to each pixel independently.We run DFO-LS for 200 upper-level evaluations with ρ end = 10 −6 .Unlike subsection 4.2, we find that there is no need to regularize the upper-level problem with the condition number of the lower-level problem (i.e.J (θ) = 0 for these results).
The resulting objective decrease, final parameter values and cumulative lower-level iterations are shown in Figure 12.All variants achieve the same (upper-level) objective value and parameter α θ , but the dynamic accuracy variants achieve this with substantially fewer GD/FISTA iterations compared to the low-and high-accuracy variants.Interestingly, despite all variants achieving the same upper-level objective value, they do not reach a consistent choice for ν θ and ξ θ .
In Figure 13 we show the reconstructions achieved by the dynamic accuracy FISTA variant for three of the training images.We see high-quality reconstructions in each case, where the piecewise-constant reconstructions favored by TV regularization are evident.
Lastly, we study the impact of changing the noise level σ on the calibrated total variational regularization parameter α θ .We run DFO-LS with dynamic accuracy FISTA for 200 upperlevel evaluations on the same training data, but corrupted with noise level σ ranging from 10 −1 (as above) to 10 −8 , see Figure 14.We see that as σ → 0, so does α θ and σ 2 /α θ .Note that this is a common assumption on the parameter choice rule in regularization theory to yield a convergent regularization method [1,5].It is remarkable that the learned optimal parameter also has this property.Lower-level problem iterations Upper-level evalutions   4.4.Application: Learning MRI Sampling Patterns.Lastly, we turn our attention to the problem of learning MRI sampling patterns.In this case, our lower-level problem is (2.3) with A(θ) = F , where F is the Fourier transform, and S(θ) is a nonnegative diagonal sampling matrix.Following [32], we aim to find sampling parameters θ ∈ [0, 1] d corresponding to the weight associated to each Fourier mode, our sampling matrix is defined as The resulting lower-level problem is µ-convex and L-smooth as per (2.4) with A * θ S θ A θ = S(θ) = max i θ i /(1 − θ i ) and λ min (A * θ S θ A θ ) = λ min (S θ ) = min i θ i /(1 − θ i ).For our testing, we fix the regularization and smoothness parameters α = 0.01, ν = 0.01 and ξ = 10 −4 in (2.3).We use n = 10 training images constructed using the method described in subsection 2.2 with N = 64 and σ = 0.05.Lastly, we add a penalty to our upper-level objective to encourage sparse sampling patterns: J (θ) := β θ 1 , where we take β = 0.1.To fit the least-squares structure (3.1), we rewrite this term as J (θ) = ( β θ 1 ) 2 .To ensure that S(θ) remains finite and J (θ) remains L-smooth, we restrict 0.001 ≤ θ i ≤ 0.99.
We run DFO-LS with a budget of 3000 evaluations of the upper-level objective and ρ end = 10 −6 .As in subsection 4.2, we compare dynamic accuracy DFO-LS against (fixed accuracy) DFO-LS with low-and high-accuracy evaluations given by a 1,000 and 10,000 iterations of GD or 200 and 1,000 iterations of FISTA.
With our 1 penalty on θ, we expect DFO-LS to find a solution where many entries of θ are at their lower bound θ i = 0.001.Our final sampling pattern is chosen by using the corresponding θ i if θ i > 0.001, otherwise we set that Fourier mode weight to zero.
In Figure 15 we show the objective decrease achieved by each variant, and the cumulative lower-level work required by each variant.All variants except low-accuracy GD achieve the best objective value with low uncertainty.However, as above, the dynamic accuracy variants achieve this value significantly earlier than the fixed accuracy variants, largely as a result of needing much fewer GD/FISTA iterations in the (lower accuracy) early upper-level evaluations.In particular dynamic accuracy GD reaches the minimum objective value about 100 times faster than high-accuracy GD.We note that FISTA with 200 iterations ends up requiring fewer lower-level iterations after a large number of upper-level evaluations, but the dynamic accuracy variant achieves is minimum objective value sooner.
We show the final pattern of sampled Fourier coefficients (after thresholding) in Figure 16.Of the five variants which found the best objective value, all reached a similar set of 'active' coefficients θ i > 0.001 with broadly similar values for θ i at all frequencies.For demonstration purposes we plot the reconstructions corresponding to the coefficients from the 'dynamic FISTA' variant in Figure 17 (the reconstructions of the other variants were all similar).All the training images are reconstructed to high accuracy, with only a small loss of contrast near the jumps.

Conclusion.
We introduce a dynamic accuracy model-based DFO algorithm for solving bilevel learning problems.This approach allows us to learn potentially large numbers of parameters, and allowing inexact upper-level objective evaluations with which we dramatically reduce the lower-level computational effort required, particularly in the early phases of the algorithm.Compared to fixed accuracy DFO methods, we achieve similar or better upperlevel objective values with much less work: in some cases up to 100 times faster.These observations can be made for both lower-level solvers GD and FISTA, with different fixed accuracy requirements, for ROF-denoising and learning MRI sampling patterns.Thus the proposed approach is robust in practice, computationally efficient and backed by convergence and worst-case complexity guarantees.Although the upper-level problem is nonconvex, our numerics do not suggest that convergence to non-global minima is a point for concern here.
Future work in this area includes relaxing the smoothness and/or strong convexity assumptions on the lower-level problem (making the upper-level problem less theoretically tractable).Our theoretical analysis would benefit from a full proof that our worst-case complexity bound on the lower-level computational work is tight.Another approach for tackling bilevel learning problems would be to consider gradient-based methods which allow inexact gradient informa-tion.Lastly, bilevel learning appears to compute a regularization parameter choice strategy which yields a convergent regularization method.Further investigation is required to back these numerical results by sound mathematical theory.

2. 3 .
Approximate Solutions.2.3.1.Gradient Descent.For simplicity we drop the dependence on i for the remainder of this section.

Definition 3 . 1 (
Fully linear model).The model m k (3.5) is a fully linear model for f
43) and (3.44) with (3.39) and (3.40), we conclude that the number of times we make m k fully linear is

Figure 11 . 3 -
Figure 11.3-parameter results (dynamic accuracy FISTA): reconstructions of first training image using best parameters θ after N evaluations of upper-level objective (reconstruction based on 1,000 FISTA iterations).

Figure 15 .
Figure 15.Results for the MRI sampling problem.Note: the low-accuracy GD variant (K = 1, 000) terminates on a small trust-region radius, as it is unable to make further progress.

Figure 16 .
Figure 16.Final (after thresholding) MRI sampling patterns found by each DFO-LS variant.This only shows which Fourier coefficients have θi > 0.001, it does not show the relative magnitudes of each θi.

Figure 17 .
Figure 17.Reconstructions using final (after thresholding) MRI sampling pattern found by the dynamic FISTA variant of DFO-LS.Results from running 2,000 FISTA iterations of the lower-level problem.