Binary optimal control by trust-region steepest descent

We present a trust-region steepest descent method for dynamic optimal control problems with binary-valued integrable control functions. Our method interprets the control function as an indicator function of a measurable set and makes set-valued adjustments derived from the sublevel sets of a topological gradient function. By combining this type of update with a trust-region framework, we are able to show by theoretical argument that our method achieves asymptotic stationarity despite possible discretization errors and truncation errors during step determination. To demonstrate the practical applicability of our method, we solve two optimal control problems constrained by ordinary and partial differential equations, respectively, and one topological optimization problem.


Introduction
We consider optimization problems of the form where the variable U is a measurable set selected from a finite atomless measure space (Ω, Σ, μ) and J : Σ → R is differentiable with respect to its set-valued argument U . Specifically, we focus on cases where there exist a Banach space Y , a continuously Fréchet differentiable map J : Y → R, and a vector measure ν : Σ → Y of bounded variation, such that Such optimization problems occur implicitly in the context of binary optimal control whenever Lebesgue measurable control functions are considered. For instance, the Lotka-Volterra fishing problem, a test problem from ODE-constrained optimal control that we address in Sect. 4 Here, ν maps a Lebesgue-measurable control set U ⊆ [0, t f ] to its characteristic function χ U which serves as w. Each w corresponds to an ODE solution y w . J (w) is then given by . We develop our theory for the abstract problem (1). However, this more concrete example may aid the reader's understanding.
Like decomposition methods, we use the fact that, under mild assumptions, control sets form a continuum and can therefore be improved incrementally. To achieve improvement, we use local linearizations based on a gradient concept similar to the topological gradient function as defined in [28].
In topology optimization, the topological gradient is often dependent on perturbation shape. This is the case, for instance, if new boundary conditions are added by perturbations (see, e.g., [28,36]) and can limit the ways in which perturbations can be made. There are solution heuristics (see, e.g., [12]) that are applicable in these situations. However, more rigorous approaches based on, for instance, fixed-point iterations [9,27], gradient descent on level set functions [2], and gradient descent on binary-valued indicator functions [10] do also exist. For a broader overview over shape and topology optimization, we refer to [11,13].
The method presented here is not designed to solve topology optimization problems. It differs from [2] in the sense that it does not operate on level set functions. Instead, it operates directly on sets, like in [9,27]. It differs from these fixed-point iteration schemes in that it works cumulatively by deriving steps that are "added" to the current control set U rather than completely replacing it.
It most closely resembles the method of [10]. The main difference here is that it is stated as a generic trust-region method within a metric space. Much of its theoretical framework mirrors that of existing trust-region methods. It therefore offers great potential for future extensions into constrained optimization and optimization with higher-order derivatives, both of which are difficult with the specific framework given in [10].
Finally, we note that there are optimal control methods that use measure-valued controls, e.g., [22]. We note that our method does not use measure-valued controls, but rather set-valued controls. While hybrid measure and set optimization methods may be an interesting avenue of research for the future, we will not discuss measurevalued optimization here.
Contribution We derive the topological gradient as a derivative with respect to changes in the control set U . We provide a framework for transforming optimal control problems into this form and deriving topological gradients. We demonstrate this for both ODE-and PDE-constrained problems.
Our main theoretical contribution lies in the use of the metric structure of measure spaces. We show that asymptotic stationarity can be guaranteed using the trust-region framework which automatically discovers appropriate step sizes by solving a succession of subproblems. We describe a solution method for these subproblems in Procedure 1. We show that our steps are of adequate quality despite discretization and truncation errors in Lemma 9 and Theorem 3. We follow the reasoning of other trust-region convergence proofs to show that, given a small amount of groundwork, many of the steps are transferrable from conventional nonlinear optimization.
We present our work with the caveat that we do not prove ultimate convergence of the control sequence and that we do not even guarantee that it is a Cauchy sequence. We do, however, guarantee an actual improvement in objective and approximate stationarity in the limit. Under certain assumptions, the latter can be used to derive bounds on the optimality gap.
Outline In Sect. 2 we introduce basic notation, definitions, and prior results. In Sect. 3, we state our trust-region algorithm and show its asymptotic behavior. In Sect. 4 we apply the algorithm to three test problems as a proof of concept. In Sect. 5 we discuss our results and address some possible criticisms of the methodology. We provide some concluding remarks and speculate on avenues of future research.

Preliminaries and notation
We denote the set of natural numbers with zero by N 0 and the set of non-negative real numbers by R + . We define the shorthands for index sets. For basic definitions and results of measure theory, we refer to [7]. By 2 Ω , we denote the power set of the universal set Ω. In addition to positive and signed measures, we use vector measures, which are σ -additive maps ν : Σ → Y , where Σ is a σ -algebra, Y is a Banach space, and ν(∅) = 0. Every vector measure ν : Σ → Y has an associated measure |ν| : Σ → R + ∪ {∞} given by This measure is referred to as the total variation of ν. If ν := |ν|(Ω) < ∞, then ν is said to be of bounded variation.
The concept of atomlessness is particularly important to our argument. A measure space (Ω, Σ, μ) is atomless if it contains no atoms, that is, no sets A ∈ Σ with μ(A) > 0 such that for every measurable subset B ⊆ A, we have either μ(B) = μ(A) or μ(B) = 0. Given an atomless measure space, a measurable set A ∈ Σ, and any number θ ∈ [0, μ(A)], there exists a measurable set B ⊆ A with μ(B) = θ .
If μ is a measure and ν is a signed or vector measure over (Ω, Σ) with |ν|(A) = 0 for all A ∈ Σ with |μ|(A) = 0, then ν is called absolutely continuous with respect to μ and is written as ν μ. For finite signed measures, absolute continuity implies the existence of a density function.
Lemma 1 (Radon-Nikodym) Let ϕ be a finite signed measure over a finite measure space (Ω, Σ, μ) such that ϕ μ. Then there exists a μ-integrable function f : Ω → R such that The function f in Lemma 1 can be seen as the density function of ϕ with respect to μ. We will subsequently refer to it as such. The average of a density function over a given μ-measurable set D is given by 1 . If μ is the Lebesgue measure in R n , then Lebesgue's differentiation theorem shows that f (x) can be calculated almost everywhere by taking the limit for infinitesimally small balls around x. We refer to [7, Thm. 5.6.2] for proof.

The symmetric difference between two sets A, B is given by
Given a finite measure space (Ω, Σ, μ), the map d : is a pseudometric, meaning that it is symmetric and subadditive. By considering the quotient space with respect to the equivalence relation of being equal up to a μnullset, d can be made into a metric. We note that U (U D) = D and therefore Given a measure space (Ω, Σ, μ) and a μ-measurable function g : Ω → R, we denote the various types of level sets of g by where " ∼" ∈ { " <", " ≤", " =", " ≥", " >" } and η ∈ R. These level sets are μ-measurable which implies that their symmetric difference U L g∼η with a μmeasurable control set U ∈ Σ is μ-measurable. The same is true for unions, intersections, and differences with other μ-measurable sets.

Algorithm
In this section, we state the algorithm and prove its correctness. We split this discussion into four subsections. Section 3.1 defines the topological gradient as we use it and shows how to derive it from Fréchet derivatives. Section 3.2 states a gradient-based necessary optimality criterion. Section 3.3 derives the gradient density function for two types of optimal control problems. Section 3.4 states the algorithm and its subroutines and proves that the algorithm achieves stationarity in the limit. We formulate the algorithm in a form that allows for inexactness in some steps to ensure that the procedure remains practically implementable.
Throughout this section, we use the following assumptions.

Taylor expansion
In traditional nonlinear optimization, we make use of the fact that sufficiently smooth functions are locally approximated by truncations of their Taylor series. We transfer this property from the Fréchet differentiable objective J to J using Assumption 1.4. This requires the chain rule.
Along with Assumption 1.6, this implies that for all D ∈ Σ, which shows that ϕ U (D) < ∞ and ϕ U μ, assuming ϕ U can be shown to be a signed measure.
Because T U is linear and ν is a vector measure, ϕ U is finitely additive, and we have ϕ U (∅) = 0. To show σ -additivity, let (D i ) i∈N ⊆ Σ consist of pairwise disjoint measurable sets. For N ∈ N, the finite additivity of ϕ implies In both cases, convergence to zero is guaranteed by the fact that (D i ) i∈N is a sequence of pairwise disjoint subsets of Ω, which has finite measure. In total, this means that which proves that ϕ is σ -additive. Absolute convergence is proven by the fact that the limit is identical for all rearrangements of the sequence.
By using the Fréchet derivative J ν(U ) as T U in Lemma 2, we can prove the existence of a finite signed measure J (U ) that acts as a first-order derivative of J . With this measure, we can formulate a local first-order Taylor expansion of J around U .
Theorem 1 (Local First-Order Taylor Expansion) Let Ω, Σ, Y , J, μ, ν, and J satisfy Assumptions 1.1 to 1.6. For every U ∈ Σ, let J (U ) : Σ → R be given by (2) ) is a bounded linear operator. Lemma 2 shows that J (U ) as defined in (2) is a finite signed measure and that J (U ) μ. Let R U : Σ → R be given by To establish (3), we need to show that R U (D) → 0 for μ(D) → 0. Let (D i ) i∈N be a sequence in Σ such that μ(D i ) → 0, and let Without loss of generality, we consider only sequences where μ(D i ) > 0 and d i Y > 0 for all i ∈ N. By combining (2), (4), and the Fréchet differentiability of J , we can conclude that Theorem 1 can be extended to include higher-order derivatives. This requires a technical measure extension argument and does not contribute to the method under discussion.
Because μ and J (U ) are finite and J (U ) μ, J (U ) has a μ-integrable density function that we use to construct steepest descent steps.

Corollary 1
Let Ω, Σ, μ, Y , J , J , and ν satisfy Assumptions 1.1 to 1.6. For every U ∈ Σ, there exists g U ∈ L 1 (Ω, μ) such that Proof Let U ∈ Σ be given. According to Theorem 1, J (U ) is a finite signed measure with J (U ) μ. Because μ is also finite, Lemma 1 proves the existence of a g U ∈ L 1 (Ω) with the stated properties.
There are different ways to calculate g U . Section 3.3 derives g U for two exemplary ODE-and PDE-constrained problems. The approximation (3) is only accurate in a small neighborhood of U . To obtain estimates for the error accumulated over larger distances, we need to cover a connecting line with such neighborhoods and use a compactness argument to extract a finite subcover. We can then choose a finite number of support points such that first-order Taylor approximations are accurate when traveling from one support point to the next, as illustrated in Fig. 1.
Sets do not have an exact equivalent of convex combinations or "connecting lines". We could construct a similar argument by using geodesics. In the interest of brevity, however, we assume instead that the derivative of J : Y → R satisfies a Lipschitz condition. We can then make the argument in the underlying vector space Y .

Lemma 3
Let Ω, Σ, Y , J, μ, ν, and J satisfy Assumptions 1.1 to 1.6. Furthermore, let C > 0 be as specified in Assumption 1.6, and let the Fréchet derivative J of J be such that there exists a constant L > 0 with Then we have Proof Let U , D ∈ Σ be given and let x := ν(U ) and d := ν(D\U ) − ν(D ∩ U ).
We have According to (2) in Theorem 1, J (U )(D) is given by according to Assumption 1.6. Therefore, we have which proves (6).
We note that a geodesic-based argument would not require a Lipschitz condition over conv ν(Σ) , but only over ν(Σ), because support points can be chosen exclusively from ν(Σ).

Optimality criterion
Our method works towards achieving a necessary optimality criterion based on stationarity. We have shown in Corollary 1 that the derivative measure has an integrable density function g U . The integral of g U over a step set D approximates the change in objective for small steps. Therefore, if then all sufficiently small steps are predicted to either maintain or increase the objective. The following elementary lemma shows that negative density values on non-nullsets always translate into descent steps.
Because (Ω, Σ, μ) is atomless, we can choose a measurable set D 2 ⊂D 2 \D 1 such that μ(D 2 ) = λ·μ(Ω)−μ(D 1 ). We define D := D 1 ∪ D 2 and obtain μ(D) = λ·μ(Ω) and D ⊆ L g≤η * . We therefore have If we were to assume that D g dμ > λ · Ω g dμ, it would imply that We could then conclude that which proves by contradiction that D g dμ ≤ λ · Ω g dμ and therefore that g realizes or exceeds its overall average value on D.
It is useful to think of λ = μ(D) μ(Ω) and g = g U for some U ∈ Σ. We then find a step D ∈ Σ such that For any measure less than or equal to μ(Ω), we can therefore choose a μ-measurable subset D ⊆ Ω with that exact measure such that the predicted decrease in objective for the step D is no worse than the average over all of Ω. This does not imply that the predicted change is negative. We can ensure a decrease by applying Lemma 4 to the finite atomless measure space There then exists a step D of given size that captures at least a corresponding fraction of the maximal achievable predicted decrease. We can use this to prove that (7) is a necessary criterion for local optimality.

Lemma 5
Let Ω, Σ, Y , J, μ, ν, and J satisfy Assumptions 1.1 to 1.7. Let U ∈ Σ be a locally optimal solution such that there exists R > 0 with Proof We assume the contrary, i.e., that there exists From Theorem 1, there existsR > 0 such that which implies that Because g U is the density function of the gradient measure, (7) is essentially a stationarity condition. We subsequently refer to points that satisfy (7) as stationary points. Similarly, we refer to U ∈ Σ as ε-stationary for given ε > 0 if and only if We refer to the integral on the left hand side as the instationarity of U . One cannot always find solutions that satisfy the necessary optimality criterion (7). For instance, if the vector measure ν maps a set U ∈ Σ to its characteristic function ν(U ) = χ U ∈ L 1 (Ω), then the image ν(Σ) is not closed, and accumulation points of sequences may not themselves be characteristic functions of measurable sets.
A weak guarantee can be given for the degree of suboptimality of an ε-stationary point. If we make an assumption of limited curvature in the sense of Lemma 3, then the following estimate holds for every D ∈ Σ.
Here, L > 0 is the Lipschitz constant associated with changes in the Fréchet derivative of J , and C > 0 is the constant from Assumption 1.6. This implies that within a given radius of R > 0, U is suboptimal by at most ε + LC 2 2 R 2 . It is further possible to infer that the sequence of control functions ν(U i ) forms a Cauchy sequence, if J (U i ) approach the infimum of J over conv(ν(Σ)) and the underlying functional J is strictly convex.

Approximating gradient densities
In previous sections, we have shown that, under Assumption 1, there exists a gradient density function g U : Ω → R for all control sets U ∈ Σ such that It may not be immediately clear how we would calculate a useful representation of g U . We will now briefly discuss two cases in which g U can be determined relatively easily.

ODE-constrained optimal control
We first consider a case where Problem (1) is derived from an ODE-constrained optimal control problem of the form min w,y with constant t f > 0 and y 0 ∈ R n . This includes the Lotka-Volterra problem from Sect. 1. We only discuss autonomous ODEs here, which serves primarily to unclutter notation. To guarantee that unique solutions and derivatives exist, we make some assumptions on l and f .
Assumption 2 Let t f > 0, D ⊆ R n , l : D × R → R, and f : D × R → R n satisfy the following assumptions: almost everywhere and every τ ∈ (0, t f ], every absolutely continuous function It may be possible to further relax these assumptions. However, this formulation is sufficiently general to apply to the Lotka-Volterra test problem which we discuss in greater depth in Sect. 4.2. While demanding affine linearity in w may seem restrictive, it is always achievable for binary-valued controls by partial outer convexification [30,31].
We begin with a stability result that allows us to extend the subsequent existence result to control functions whose values lie outside of [0, 1].
Then we have Proof By using the Lipschitz condition in Assumption 2.4, we find that for all t ∈ [0, τ ). According to Gronwall's inequality, it follows that Lemma 6 shows that the solution to the initial value problem exists not only for control functions with values in [0, 1], but also in a small L 1 environment around them.
Lemma 7 Let t f > 0, D ⊆ R n , and f satisfy Assumption 2, and let ε > 0 denote the constant from Assumption 2.5. Then there exists δ > 0 such that for every w ∈ L 1 ([0, t f ]) with w(t) ∈ [0, 1] and every v ∈ B δ (w), there exists a unique absolutely Proof We use the generalized existence theory based on the Carathéodory condition as described in [ We can extend these local solutions to their maximal existence interval.
We first consider the case v = w, where w(t) ∈ [0, 1] almost everywhere is guaranteed. If the maximal existence interval were to end at τ ≤ t f , then the continuous extension of y w would satisfy y w (t) → ∂ D for t → τ , which would contradict Assumption 2.5. We can therefore extend y w to [0, t f ] and guarantee that y Let ε > 0 denote the constant given in Assumption 2.5. We can then use Lemma 6 to extend this argument to all Lemma 7 ensures that there exists a small neighborhood around each admissible control function in which the ODE solution is well-defined. This is important for the definition of derivatives, which are defined using changes over infinitesimally small neighborhoods.
As our measure space (Ω, Σ), we choose [0, t f ] with the Lebesgue σ -algebra. The Banach space Y is the space L 1 (Ω) and ν maps U → χ U where χ U denotes the characteristic function of U . The objective J : Y → R is given by The function J is then given by For the measure μ, we can choose either the Lebesgue measure or an equivalent measure. In either case, there is a density function To show that these choices satisfy Assumption 1, we have to prove that We first show that the ODE solutions corresponding to control functions in the δ-environment established in Lemma 7 are uniformly bounded. This allows us to consider f only on a compact convex subset U ⊆ R n where the norms of the second derivatives of f and l can be bounded. The conjunction of Lemma 6 with the norm bounds on the second derivatives and Assumption 2.3 allows us to control residual terms that appear when deriving the Fréchet derivative of J from the Lagrangian function of the original problem.
Let δ > 0 be the constant derived in Lemma 7. There exists a convex compact set We note that α is increasing in t. By applying Gronwall's inequality, we obtain the estimate We then have We note that U is the intersection of two convex closed sets and is therefore convex and closed. Since U is also bounded, it is compact.
Since f and l are twice continuously differentiable with respect to (y, w), this implies that their first and second derivatives are also bounded on U .
We briefly note that the uniform bound of all first and second derivatives of f and l also implies that J satisfies the curvature condition required by Lemma 3. This implies the suboptimality bound described at the end of Sect. 3.2. Let With the costate function λ, we can define the Lagrangian function We use integration by parts for the second reformulation. We note that we have Λ(y w , w, λ) = J (w) for all λ and all w. Therefore, the change in objective can be written as is Fréchet differentiable in v and its derivative is given by where λ v denotes the costate function associated with the control function v and its initial value problem solution y v .
Proof We make use of the derivative expression of the costate equation. We then perform a truncated Taylor expansion with integral residual expressions to the objective, which yields Δw ds dt.
For the sake of brevity, we write Δw := (w − v) and Δy := (y w − y v ). We note that the first integral in the last step is equal to D J (v)Δw. According to Lemma 8, there exists a compact convex set U ⊆ D such that y w (t) ∈ U and y v (t) ∈ U for all v, w that are either [0, 1]-valued or in a δ-neighborhood of such a control function. Given that all convex combinations between y v and y w lie in the compact set U , there exists a constant L > 0 such that Because Δy ≤ C · Δw L 1 for all t ∈ [0, t f ], we have From this, it follows that This shows that J is Fréchet differentiable with respect to L 1 changes in the control function.
Let U ∈ Σ denote the current control set. The gradient measure ϕ U in U can be derived from the bounded linear operator D J (ν(U )) = D J (χ U ) using Lemma 2. For a given step D ∈ Σ, we have where m is the density function of the measure μ with respect to the Lebesgue measure. The density function of ϕ U is then given by We note the close connection of this expression to the Hamiltonian and to the maximum principle for hybrid systems, e.g., [35]. The basic idea of the Competing Hamiltonian approach to mixed-integer optimal control, which was first described in [5,6], can therefore be seen as a special case of our approach. Since the process by which this density function is derived is very close to the adjoint differentiation scheme commonly used to extract derivatives from numerical integrators, it is possible to extract the density function from off-the-shelf integrators. We use this approach in Sect. 4.2 to extract the density function from the CVODES solver of the SUNDIALS suite.

PDE-constrained optimization
Next, we consider the case where Problem (1) is derived from a PDE constrained optimization problem of the form min w,y j(y, w) where Ω ⊆ R n denotes a bounded domain, X and Z are suitably chosen Banach spaces, j : X × L 1 (Ω) → R, and f : X × L 1 (Ω) → Z . We make additional assumptions to ensure the existence and uniqueness of solutions.
We further assume that there is a sequence of partitions ( While Assumptions 3.5 to 3.9 are somewhat similar to the assumptions on "orderconserving domain dissections" stated in [26], we note that they differ in that they do not require the partition sequence to be order-conserving. Furthermore, [26] has no counterpart to Assumption 3.6, which we require because we divide by the measure of the cells. Therefore, these sets of assumptions should not be confused. Under Assumptions 3.1 to 3.4, the implicit function theorem [20, Thm. 1.41] shows that the mapping w → y w is continuously Fréchet differentiable with Our objective functional in this case is the reduced objective which is continuously Fréchet differentiable according to the chain rule and satisfies D J (w) = − j y (y w , w) f −1 y (y w , w) f w (y w , w) + j w (y w , w).
We note that D J (w) is a bounded linear form in L 1 (Ω). Since L ∞ (Ω) is the dual space of L 1 (Ω), there exists a function g w ∈ L ∞ (Ω) such that The vector measure ν is once more given by ν(U ) := χ U and the underlying measure space is the Lebesgue σ -algebra on Ω with the Lebesgue measure or an equivalent measure with weight function m ∈ L ∞ (Ω). As was the case for the ODEconstrained case in Sect. 3.3.1, the gradient density measure is then given by Accordingly, the gradient density function g U for a given control set U ∈ Σ is given by To approximate the value of g w , we use the fact that Ω is bounded and therefore L ∞ (Ω) ⊂ L 1 (Ω). We consider the sequence of mesh partitions described in Assumption 3. The family of all mesh cells T (i) j contracts to nullsets according to Assumption 3.7, can be used to approximate almost all points in Ω according to Assumption 3.8, and is of bounded eccentricity according to Assumption 3.9. We can therefore apply the Lebesgue differentiation theorem to obtain almost everywhere. If we assume that for a given mesh index i, control functions on the i-th mesh are given by then the derivative of the objective function with respect to the degree of freedom w Therefore, if we start with a sufficiently fine mesh to express the desired control function w and maintain the same w for all higher mesh indices, which we can do because of Assumption 3.5, then we find that for a.a. x ∈ Ω.
Thus, the density function g U can be approximated using the gradients of the objective function J with respect to the degrees of freedom (DOFs) of a piecewise constant control function on successively refined meshes. We note that this does not take into account discretization errors in function evaluation. Controlling such errors usually requires considerations much more specific to the discretization or problem in question. We also note that the application of the bounded inverse f −1 y (y w , w) usually requires the solution of a PDE that involves the adjoint of a linearization of the original differential operator. This method of deriving the gradient is therefore sometimes known as the adjoint method.
We note that the curvature condition of Lemma 3 can be translated into a Lipschitz condition on the derivatives of f and j in this setting.

Algorithm
In trust-region terminology, we determine our step using the "affine linear" model function

Accordingly, the trust-region subproblem in a given point
where Δ > 0 is the trust-region radius. Given that U ∈ Σ is fixed for each instance of the trust-region subproblem, the term J (U ) can be omitted and we can rewrite the subproblem as The proof of Lemma 4 suggests a method by which the subproblem can be solved. Because g U is a μ-measurable function, its sublevel and level sets are measurable, and we can select D ∈ Σ to encompass exactly those x ∈ Ω where g U (x) assumes its smallest values. In the proof, we used the reference level In practice, we need to approximate η * and g U . We state the solution procedure for (8) in a way that allows the use of an approximation of g U . On discrete meshes, it may also not be possible to choose D with the exact desired measure. Therefore, we allow for some deviation. The resulting algorithm is Procedure 1. Line 15 in Procedure 1 selects a subset D 2 ⊆ L g≤η 2 \L g≤η 1 with a given size range. The existence of D 2 is guaranteed due to the atomlessness of the underlying measure space. The set D 2 is used to ensure that the resulting step is close enough to the trust-region radius Δ to guarantee sufficient descent.
// Accept full step if possible 3 else 4 (η 1 , η 2 ) ← (−δ, 0); 5 while μ L g≤η 1 > Δ do 6 (η 1 , η 2 ) ← (2η 1 , η 1 ) ; // Establish bisection range 7 while η 2 − η 1 > δ 2 do // Narrow infimum range through bisection from selecting mesh cells in a pre-defined or random order to approximating knapsack solutions to achieve as large a step as possible. Given that the gradient function value is guaranteed to be between η 1 and η 2 at almost all points in D 2 , the approximate optimality of the step is guaranteed for all selections of D 2 . A range of sizes is given to accommodate problem implementations where the boundaries of the step D need to run along mesh boundaries that cannot be moved arbitrarily. In such cases, mesh refinement may become necessary if it is impossible to find a set D 2 of suitable size at the current mesh resolution. If the gradient density function is sufficiently well-approximated and is assumed to remain stable with respect to local mesh refinement, then the main goal of mesh refinement is to achieve sufficient mesh granularity in the candidate set L g≤η 2 \L g≤η 1 from which D 2 is chosen. Therefore, it is sufficient to refine all cells in this set until sufficient granularity is achieved to select D 2 within the given measure margins.
If the gradient function does change noticeably due to the refinement, then the candidate set may change on subsequent iterations. This is highly undesirable. However, the refinement loop will necessarily terminate as soon as every cell in the candidate set has a measure of less than δΔ −2η 2 where −η 2 is bounded from above due to the fact that Therefore, this simplistic mesh refinement scheme will, in the worst case, terminate as soon as the entire mesh is refined to sufficient granularity.
The resulting step cannot be assumed to be optimal. As stated in Procedure 1, however, one can automatically determine the required levels of accuracy in order to obtain a solution that has arbitrarily small optimality gap. This claim is proven in Lemma 9. (Ω, Σ, μ) be an atomless measure space, let g ∈ L 1 (Σ, μ), let 0 < Δ ≤ μ(Ω), let δ > 0, and let D be the set returned by Procedure 1. Then D is μ-measurable, satisfies μ(D) ≤ Δ, and is nearly a solution of Eq. (8) in the sense that
Using Procedure 1, we can state the main trust region loop in Algorithm 2. We allow for the use of an approximationg ∈ L 1 (Ω, μ) of the gradient density g U ∈ L 1 (Ω, μ), assuming that it can be made arbitrarily accurate according to the L 1 norm.

Theorem 3
Let Ω, Σ, Y , J, μ, ν, J , and C > 0 satisfy Assumptions 1.1 to 1.7. Furthermore let U 0 ∈ Σ, Δ max ∈ (0, μ(Ω)), Δ 0 ∈ (0, Δ max ), ε > 0, 0 < σ 1 < σ 2 ≤ 1, and ω ∈ (0, 1) with ω < 3−3σ 1 3−2σ 1 . Let J (Σ) be bounded from below, and let L > 0 be a constant such that Then Algorithm 2 terminates in finite time and yields i ∈ N 0 and U i ∈ Σ such that Proof For given i ∈ N 0 , let S i := L g U i <0 ⊆ Ω andS i := Lg i <0 ⊆ Ω. If the stationarity test in Line 4 succeeds, then We note that the initial choice of Δ 0 ≤ Δ max ≤ μ(Ω) and the fact that Δ i is never increased above Δ max guarantee that Δ i μ(Ω) ≤ 1 for all i. If the test fails, then Therefore, the stationarity test will succeed at some point as the solution becomes stationary, but it will succeed only for solutions that satisfy the tolerance ε. According to Lemma 9, D i is such that for every D ∈ Σ with μ(D ) ≤ Δ i , According to Lemma 4, there exists a D * ∈ Σ with μ(D * ) ≤ Δ i and For every accepted step, we therefore have Next, we prove that there existsΔ > 0 such that step D i is accepted for all i ∈ N with Δ i ≤Δ. From (10), we can invoke Lemma 3 to show that for all i in which the stationarity test does not detect stationarity. For these iterations, we can then conclude that Note that 1 − ω 3−2ω > σ 1 if and only if ω < 3−3σ 1 3−2σ 1 . Thus, we find that ρ i ≥ σ 1 for all i with This implies that there is never an endless loop of rejected steps. In addition, because Δ i is only halved upon rejection, it follows that Δ i ≥ min{Δ,Δ 0 } 2 for all i. When substituted into our prior estimate of the decrease per accepted step, this yields The fact that J is bounded from below therefore implies that the number of accepted steps is finite. Because there is not an endless loop of rejected steps, the algorithm must terminate. The only manner in which it can do so is if the stationarity test succeeds after a finite number of steps.

Experiments
In this section, we present three test problems. The first two problems are optimal control problems and are intended to show the viability of our method for optimal control. The first problem, presented in Sect. 4.1, is a mesh-dependent PDE-constrained source inversion problem for the Poisson equations in two dimensions. In contrast, the second test problem, presented in Sect. 4.2, is constrained by the Lotka-Volterra ODE system. The third test problem, presented in Sect. 4.3, is a topology optimization problem based on the linearized elasticity equations and is inspired by [4].

Source inversion for the Poisson equation
Our first test problem is a source inversion problem using a weak form of the Poisson equation with Dirichlet boundary conditions. It has the form min y,w where Ω := [0, 1] 2 , H 1 0 (Ω) denotes the Banach space of functions in L 2 (Ω) that have one weak derivative in L 2 (Ω) and whose trace disappears,ȳ ∈ H 1 0 (Ω) denotes a constant reference function, and w * k denotes the convolution where σ controls the severity of the blurring effect, τ > 0 is a threshold value. We note that R 2 k(x) dx = 1. For this test, we choose σ := 0.1, τ := 0.01, α := 10 −5 as problem parameters. The cutoff threshold τ ensures a degree of sparsity in the mollification operator w → w * k.
In order to make the problem accessible to our method, we follow the approach laid out in Sect. 3.3.2 with X = H 1 0 (Ω), Z = H 1 0 (Ω) * and It is easy to verify that j is continuously Fréchet differentiable in y and w. Furthermore, it is well-known that the bilinear form a satisfies the conditions of the Lax-Milgram lemma and is strongly coercitive. Therefore, f is continuously Fréchet differentiable in y and the derivative f y has a bounded inverse. Therefore, we only have to show that the linear operator w → L(w * k, v) = w * k, v L 2 is bounded. This is true according to Young's convolution inequality which states that In conjunction with the Cauchy-Schwartz inequality, this shows the boundedness of w → L(w * k, v). We note that the Lax-Milgram lemma also states that there always exists a solution of the weak equation, meaning that the problem itself satisfies Assumption 3. As stated in Sect. 3.3.2, we then choose the Lebesgue measure as μ, ν(U ) := χ U , Y = L 1 (Ω), and J (w) := j(y w , w). We discretize the problem using a finite element method on a triangle mesh. To generate the initial mesh we subdivide the domain into 32 equally large slices along both axes, which yields 1024 equally sized squares. Each square is then subdivided into four triangles along both of its diagonals, yielding a triangle mesh with 4096 cells, each of which has an area of 1 4096 and is contained within a ball of radius 1 64 , centered on the middle of its longest side. If we denote each cell of this initial mesh by T (1) j and the ball by B (1) j , then we have For local mesh refinement, we use the two-dimensional skeleton-based refinement algorithm described by Plaza and Carey in [29]. Because our initial triangles are isosceles with a height that is exactly half of the length of its base, the triangles resulting from this refinement will always have the same eccentricity bound, meaning that this form of refinement satisfies Assumption 3.9 irrespective of the order in which triangles are refined.
To numerically solve the PDE in (13), we use a finite element method with continuous first-order Lagrange elements. The cellwise averages of the gradient density function are determined from the gradient of the objective with respect to the cell values of a piecewise constant function as described in Sect. 3.3.2.
To determine the set D 2 in Procedure 1, we approximately solve a subset sum problem using a standard fully polynomial approximation scheme using dynamic programming that is described, e.g., in [21]. If we cannot reach a solution within the size margins, we refine all triangles that are contained within the candidate set Lg i ≤η 2 \Lg i ≤η 1 and resolve the PDE on the refined mesh. We had briefly addressed the validity of this approach in Sect. 3.
To determine the reference stateȳ ∈ H 1 0 (Ω), we approximately solve the problem where N = 12 and This reference problem is resolved every time the mesh is refined. We use the arctangent to ensure that the pointwise value of the resulting right-hand side function remains within the interval [0, 1] and is therefore similar in magnitude to the convolutions used in the inversion problem. For the remaining algorithmic parameters of Algorithm 2, we choose σ 1 := 0.3, σ 2 := 0.7, ω := 0.01, ε := 10 −8 , The finite element discretization of the PDE is performed by using FEniCS 2019.1.0 1 [1, [23][24][25]. Local refinement is performed by using FEniCS's built-in refine function, which uses the method described in [29]. Although the algorithm is technically parallelizable if the convolution operator is sufficiently sparse, we execute it in a single thread and record the CPU times needed for five components of the solver: initial setup (init), PDE solves (solve), step determination (step), mesh refinement (refine), and state recording (record).
We note that the single-threading implementation of the trust region loop does not preclude the possibility of multithreading being used by libraries such as FEniCS.  The overall trust-region loop terminates after 187 iterations with an objective function value of 1.309 · 10 −6 after having reached the optimality threshold. On a test machine with an Intel i5-10210U Quad-Core CPU, this takes 167.79 CPU seconds. The precise breakdown of the CPU times used by components of the solver is displayed in Table 1. Due to measurement and rounding errors, these number do not always add up to the correct totals. Figure 2 shows the progression of the objective function value, runtime per iteration, instationarity, and step size over the iterations of the trust-region loop. A selection of iterates is shown in Fig. 3.
As we expect with a first-order method, we see a clear pattern of diminishing returns in the tail end of the iteration.
Step lengths and instationarity level out after a certain point. Meanwhile, the time per iteration increases over time, especially during those iterations requiring mesh refinement. Mesh refinement, which includes recalculation of the reference solution and reassembly of the convolution operator, accounts for 87% of the total runtime of the algorithm. By contrast, the step-finding procedure accounts for approximately 2.8% of the total runtime. While mesh refinement is a necessary component of our method, we can interpret this as an indication that our method would perform even better on a problem where mesh refinement can be implemented more efficiently.

Lotka-Volterra fishing problem
The Lotka-Volterra fishing problem is a test problem from ODE-constrained binary optimal control. It is described in [30,32] and is based on the classical Lotka-Volterra predator-prey system with one predator and one prey species. The goal is to minimize the deviation of predator and prey population from an equilibrium state according to the L 2 norm. The optimization problem has the form min y,w where the right hand side of the ODE system (14b) is given by and the integrand of the Lagrange term is given by For the purposes of this test, we use the parameter values set in [32]: The optimal solution of (14) with controls w(t) ∈ [0, 1] is known to have a singular arc whose behavior cannot be replicated exactly by binary controls. The binary solution therefore always exhibits chattering behavior. In order to apply our method to this problem, we follow the approach laid out in Sect. 3.3.1 where we had used compatible notation. Given that f and l are polynomials, both are twice differentiable. As functions of w, they are also affine linear. In order to identify a suitable set D such that f is uniformly Lipschitz continuous with respect to y, we must first establish an a priori bound on the component values of f for [0, 1]-valued controls. We first consider that Given that y 0,1 > 0, we have 0 < y 1 (t) ≤ y 0,1 · e t . We also have f 2 (y, w) = y 2 · (−1 + y 1 − c 2 w) ≤ y 2 · y 1 ≤ y 2 · y 0,1 · e t f which implies 0 < y 2 (t) ≤ y 0,2 e t f ·y 0,1 ·e t f . Thus, we can restrict the function f to the bounded set D := (−1, y 0,1 · e t f + 1) × (−1, y 0,2 · e t f ·y 0,1 ·e t f + 1).
Since the closure of D is compact, f is uniformly Lipschitz continuous in y. We note that this choice also satisfies Assumption 2.5 with ε = 1.
We observe that the objective function is an integral over time. Therefore, control changes will likely have greater impact if they occur early in the domain [0, t f ]. This is not always the case. If the system is asymptotically stable, then controls can have the same impact regardless of when they are made. The Lotka-Volterra system, absent any control input, exhibits periodic behavior. It is therefore reasonable to assume that the impact of a a control change is proportional to the amount of time remaining to the end of the time horizon.
To counteract this effect, we make use of the weight function m that we had allowed for in Sect. 3.3.2. Bearing in mind that m may not assume the value 0, we choose Accordingly, the measure μ is given by where Σ is still the Lebesgue σ -algebra on [0, t f ]. The vector measure ν is once more given by ν(U ) := χ U , and Y = L 1 ([0, t f ]). In these circumstances, Sect. 3.3.1 states that the gradient density function g U in U ∈ Σ can be derived from the costate function λ χ U via We use CVODES from the SUNDIALS suite 2 [19] to solve the initial value problem. CVODES supports adjoint sensitivity analysis and allows us to record the value of g U (t) using callbacks. This means that we are not bound by a fixed control grid and can always use the grid chosen by the integrator. We note that some care must be taken to accurately record the sign flips of g U that occur at the boundary of U .
To determine the set D 2 in Procedure 1, we select time points from the candidate set in decreasing order. Because there is no fixed time grid, the trust region radius is always matched exactly.
The algorithmic parameters chosen for this test are  CVODES is configured to use the Adams-Moulton method with relative and absolute tolerances fixed to 10 −10 for both the forward and adjoint runs. In total, our algorithm requires 90 iterations of the outer trust-region loop, which are performed in 16.72 s (wall time). The final objective function value is 1.34424. In Sect. 4.1, mesh refinement contributes significantly to the algorithm's runtime. In this section, mesh refinement is implicitly performed by the adaptive integrator and does not require reassembly of large matrices. Therefore, we omit a detailed breakdown of the CPU times for this test. Figure 4 depicts the development of the objective function value, instationarity, step size, and wall time per iteration over the course of the trust-region iteration. Plots of the ODE solution and gradient density function for several iterations are given in Fig. 5.
Once more, we observe that the initial improvements are substantial and level out toward the end of the iteration. However, the fast overall execution time may be on par with relaxation solvers used in first-discretize-then-optimize methods and demonstrates that Algorithm 2 is useful in an ODE setting, where it can benefit from adaptive solvers. As we note in our conclusions, this is one of the advantages our algorithm has over conventional enumerative MINLP methods.

Topology optimization
In this section, we discuss a topology optimization problem inspired by the problem discussed in Chapter 1 of [4]. We preface this by emphasizing that our method is not The goal is to distribute an isotropic material in Ω in a way that minimizes compliance if the material is fixed via a Dirichlet boundary condition on Γ D and carries both its own weight and an external weight attached via a Neumann boundary condition on Γ N . The structure of the domain is illustrated in Fig. 6. Let w : Ω → {0, 1} denote the control function which specifies whether material is placed at a given point, and let y : Ω → R 2 denote the displacement function which assigns to each point in Ω its displacement in equilibrium. The equations relating w with y are not uniquely solvable if w = 0 implies that absolutely no material is placed. The gravitational pull on the material force is given by the force per unit volume The function y is then the solution of the boundary value problem in Ω, where n denotes the outer unit normal of Ω.
The weak formulation of this boundary value problem draws y from the solution space where ·| Γ D denotes the trace on Γ D . The weak formulation of the boundary value problem then takes the form where Since p(w) ≥ > 0 for all w ∈ [0, 1], the bilinear form (y, v) → a(y, v; w) is both bounded and strongly elliptic for all w ∈ L 1 (Ω) ∩ L ∞ (Ω) which guarantees that for every w ∈ L 1 (Ω) with w(x) ∈ [0, 1] almost everywhere, there exists a unique function y w ∈ V that satisfies (15). The objective function is a weighted sum of material cost and compliance. As stated in [4], it is more common to make either cost or compliance the objective and limit the other using a constraint. However, our method cannot accommodate constraints yet. Therefore, we choose a weighted sum. The objective functional is where the penalty parameter α is fixed to 10 6 .
Following the approach laid out in Sect. 3.3.2, our problem has the form min y,w j(y, w) Once more, the Lax-Milgram lemma shows that the Fréchet derivative f y (y, w) has a bounded inverse. Given that both w → (v → a(y, v; w)) and w → (v → L(v; w)) are bounded linear operators with respect to the L 1 norm of w, the map w → y w is continuously Fréchet differentiable as a map from L 1 (Ω) to V . In conjunction with the easily verifiable continuous Fréchet differentiability of j : V × L 1 (Ω) → R, this means that the problem satisfies Assumption 3. The strict limits on Δ avoid large steps that disconnect chunks of material from the fixed boundary Γ D . Whenever this occurs, gradients escalate and require aggressive error control that is beyond the scope of this discussion.
For numerical approximation, we use the same discretization and refinement method described in Sect. 4.1. For the initial mesh generation, we subdivide the domain into 100 equally sized slices along the x axis and 50 equally sized slices along the y axis. We use FEniCS 2019.1.0 to implement the finite-element discretization and approximate the density function using the objective gradient with respect to control DOFs as described in Sect. 3.3.2.
As opposed to Sect. 4.1, we do not solve a subset sum problem to determine D 2 but simply select triangles from the candidate set Lg i ≤η 2 \Lg i ≤η 1 in descending order of their area, skipping those that are too large to fit into the remaining size margin. Refinement is triggered if the result is smaller than the lower size bound for D 2 . Figure 7 shows how objective function value, time per iteration, instationarity, and the step quality measure ρ develop over the iterations of the outer trust-region loop. Because of numerical issues, the step length can be allowed to become neither too small nor too large. We therefore tune the parameters such that the step is adjusted as little as possible. In the given test run, the step length was never adjusted. Therefore, we show Fig. 8 Plots of the control set U i for selected iterations of the topology optimization problem. The mesh has been warped by 100 · y χ U i to illustrate how the design affects compliance the prediction quality ρ i instead of the step length. This illustrates the significance of error control because prediction quality decays drastically when numerical errors start exceeding actual improvements in objective. The algorithm terminates due to meeting the instationarity threshold after 26 iterations and takes a total of 484.94 s (wall time) on a test machine with an Intel i5-10210U Quad-Core CPU. We depict control sets from various iterations in Fig. 8. When examining Fig. 8 carefully, we can see that the algorithm starts to "thin out" the structure towards the end of the iteration. This is likely an artifact of the penalty approach we have chosen. Once a good balance between compliance and cost is found, the algorithm is incentivized to thin out the structure to save costs as long as the greater compliance adds less to the objective. This "thinning out" of the structure leads to escalating gradients and numerical issues and forces us to stop the iteration at a relatively high instationarity.
Another problem are checkerboard patterns. Given that our method requires high degrees of local mesh refinement, non-physical microstructures are nearly unavoidable. While such structures are intermittently observable around joints in the structure, they do not seem to appear on a large scale. This may be due to the fact that our solutions are not optimal for the given mesh, but are rather based on approximations of the density function g U which is derived from the underlying infinite-dimensional problem.
Checkerboard patterns are, to a certain extent, a side effect of the selected discretization method and can be mitigated by careful choice of such methods. For instance, there exist approaches in the field of topology optimization, e.g., in [27], that allow level set boundaries to pass through the interior of a cell. In conjunction with the use of higher order finite elements, this can mitigate or avoid the issue of checkerboard patterns. While such methods exceed the scope of this paper, we have attempted to describe our method without making overly restrictive assumptions on numerical methodology so that it can be integrated into a variety of different problems and solution approaches.
As it stands, our method should not be seen as a competitive topology optimization method. Rather, we present these results as a proof of concept to show that future extensions of this method may also be applicable to the field of topology optimization.

Conclusions and outlook
In this paper, we present a trust-region algorithm that solves binary optimal control problems by iteratively improving on existing solutions. We exploit the fact that although the controls in these problems are binary valued, they represent points in a continuum of measurable sets. Within the metric space formed by these measurable sets, some objective functions can be shown to be differentiable in a manner that allows for relatively straightforward construction of steepest-descent steps. As a result, we are able to design an algorithm that is almost completely analogous to a conventional steepest-descent trust-region method and whose asymptotic behavior can be proven in a similar way.
We have not extensively compared the performance of our method with that of other methods. Outside of the field of topology optimization, where similar methods already exist, it is generally difficult to design fair comparisons to other methods. Many optimal control methods assume fixed or uniformly refined control meshes, which makes it difficult to find "fair" parameters for comparisons. Enumerative techniques such as branch-and-bound on a fixed control mesh, for instance, suffer from an extreme disadvantage if the control mesh is too fine, whereas our method would be expected to become more accurate with refinement.
We therefore present this work as proof of concept in the hope that, as one method among many, it may expand the scope of practically solvable optimal control problems. We have kept very closely to the theoretical framework used to validate conventional NLP methods in hopes that, in the future, it may become possible to transfer more of conventional NLP theory to this setting, which may enable constrained optimization or higher-order methods to be transfered to measure spaces. If this could be merged with continuous optimal control methods, it could then give rise to a category of fast methods for mixed-integer optimal control with both ODE and PDE constraints.