In this section, we state the algorithm and prove its correctness. We split this discussion into four subsections. Section 3.1 defines the topological gradient as we use it and shows how to derive it from Fréchet derivatives. Section 3.2 states a gradient-based necessary optimality criterion. Section 3.3 derives the gradient density function for two types of optimal control problems. Section 3.4 states the algorithm and its subroutines and proves that the algorithm achieves stationarity in the limit. We formulate the algorithm in a form that allows for inexactness in some steps to ensure that the procedure remains practically implementable.
Throughout this section, we use the following assumptions.
Assumption 1
Let \(\varOmega \), Y, \(\varSigma \subseteq 2^\varOmega \), \(\mu :\varSigma \rightarrow {\mathbb {R}}_+ \cup \lbrace \infty \rbrace \), \(\nu :\varSigma \rightarrow Y\), \(J :Y \rightarrow {\mathbb {R}}\), and \({\mathcal {J}} :\varSigma \rightarrow {\mathbb {R}}\) satisfy the following assumptions:
-
1.
\((\varOmega , \varSigma )\) is a measurable space, and Y is a Banach space;
-
2.
\(\nu \) is a vector measure of bounded variation;
-
3.
J is Fréchet differentiable;
-
4.
\({\mathcal {J}} = J \circ \nu \);
-
5.
\(\mu \) is a finite measure;
-
6.
there exists \(C > 0\) such that \(\Vert \nu (D) \Vert _Y \le C \cdot \mu (D)\) for all \(D \in \varSigma \); and
-
7.
\((\varOmega , \varSigma , \mu )\) is atomless.
Taylor expansion
In traditional nonlinear optimization, we make use of the fact that sufficiently smooth functions are locally approximated by truncations of their Taylor series. We transfer this property from the Fréchet differentiable objective J to \({\mathcal {J}}\) using Assumption 1.4. This requires the chain rule.
Lemma 2
Let \(\varOmega \), \(\varSigma \), Y, \(\mu \), and \(\nu \) satisfy Assumptions 1.1, 1.2, 1.5 and 1.6; let \(U \in \varSigma \); and let \(T_U :Y \rightarrow {\mathbb {R}}\) be a bounded linear form. Then \(\varphi _U :\varSigma \rightarrow {\mathbb {R}}\) with
$$\begin{aligned} \varphi _U(D) :=T_U \bigl ( \nu (D {\setminus } U) - \nu (D \cap U) \bigr ) \quad \forall D \in \varSigma \end{aligned}$$
is a finite signed measure that is absolutely continuous; i.e., \(\varphi _U \ll \mu \).
Proof
Because \(T_U\) is bounded, there exists \(M > 0\) such that \(|T_U y| \le M \Vert y\Vert _Y\) for all \(y \in Y\). Along with Assumption 1.6, this implies that for all \(D \in \varSigma \),
$$\begin{aligned} |\varphi _U(D)|&\le M \cdot \bigl \Vert \nu (D {\setminus } U) - \nu (D \cap U) \bigr \Vert _Y \\&\le M \cdot \bigl ( \Vert \nu (D {\setminus } U) \Vert _Y + \Vert \nu (D \cap U) \Vert _Y \bigr ) \\&\le M C \cdot \mu \bigl ( (D {\setminus } U) \cup (D \cap U) \bigr ), \end{aligned}$$
which shows that \(\varphi _U(D) < \infty \) and \(\varphi _U \ll \mu \), assuming \(\varphi _U\) can be shown to be a signed measure.
Because \(T_U\) is linear and \(\nu \) is a vector measure, \(\varphi _U\) is finitely additive, and we have \(\varphi _U(\emptyset ) = 0\). To show \(\sigma \)-additivity, let \((D_i)_{i \in {\mathbb {N}}} \subseteq \varSigma \) consist of pairwise disjoint measurable sets. For \(N \in {\mathbb {N}}\), the finite additivity of \(\varphi \) implies
$$\begin{aligned} \Bigl | \sum _{i = 1}^\infty \varphi _U(D_i) - \varphi _U\Bigl ( \bigcup _{i = 1}^\infty D_i \Bigr ) \Bigr |&= \Bigl |\mathop {\sum _{i = N + 1}^\infty } \varphi _U(D_i) - \varphi _U\Bigl ( \mathop {\bigcup _{i = N + 1}^\infty } D_i \Bigr ) \Bigr | \\&\le \Bigl | \mathop {\sum _{i = N + 1}^\infty } \varphi _U(D_i) \Bigr | + \Bigl | \varphi _U\Bigl ( \mathop {\bigcup _{i = N + 1}^\infty } D_i \Bigr ) \Bigr |. \end{aligned}$$
For every \(i \in {\mathbb {N}}\), we have \(|\varphi _U(D_i)| \le M C \cdot \mu (D_i)\). This implies
$$\begin{aligned} \Bigl | \mathop {\sum _{i = N + 1}^{\infty }} \varphi _U(D_i) \Bigr |\le & {} \mathop {\sum _{i = N + 1}^\infty } \bigl | \varphi _U(D_i) \bigr | \le M C \cdot \mathop {\sum _{i = N + 1}^\infty } \mu (D_i) \\= & {} M C \cdot \mu \Bigl ( \mathop {\bigcup _{i = N + 1}^\infty } D_i \Bigr ) \xrightarrow {N \rightarrow \infty } 0. \end{aligned}$$
Similarly, we have
$$\begin{aligned} \Bigl | \varphi _U \Bigl ( \mathop {\bigcup _{i = N + 1}^\infty } D_i \Bigr ) \Bigr | \le M C \cdot \mu \Bigl ( \mathop {\bigcup _{i = N + 1}^\infty } D_i \Bigr ) \xrightarrow {N \rightarrow \infty } 0. \end{aligned}$$
In both cases, convergence to zero is guaranteed by the fact that \((D_i)_{i \in {\mathbb {N}}}\) is a sequence of pairwise disjoint subsets of \(\varOmega \), which has finite measure. In total, this means that
$$\begin{aligned} \Bigl | \sum _{i = 1}^\infty \varphi _U(D_i) - \varphi _U\Bigl ( \bigcup _{i = 1}^\infty D_i \Bigr ) \Bigr | = 0, \end{aligned}$$
which proves that \(\varphi \) is \(\sigma \)-additive. Absolute convergence is proven by the fact that the limit is identical for all rearrangements of the sequence. \(\square \)
By using the Fréchet derivative \(J'\bigl (\nu (U)\bigr )\) as \(T_U\) in Lemma 2, we can prove the existence of a finite signed measure \({\mathcal {J}}'(U)\) that acts as a first-order derivative of \({\mathcal {J}}\). With this measure, we can formulate a local first-order Taylor expansion of \({\mathcal {J}}\) around U.
Theorem 1
(Local First-Order Taylor Expansion) Let \(\varOmega \), \(\varSigma \), Y, J, \(\mu \), \(\nu \), and \({\mathcal {J}}\) satisfy Assumptions 1.1 to 1.6. For every \(U \in \varSigma \), let \({\mathcal {J}}'(U) :\varSigma \rightarrow {\mathbb {R}}\) be given by
$$\begin{aligned} {\mathcal {J}}'(U)(D) :=J'(\nu (U)) \bigl ( \nu (D {\setminus } U) - \nu (D \cap U) \bigr ) \quad \forall D \in \varSigma . \end{aligned}$$
(2)
Then \({\mathcal {J}}'(U)\) is a finite signed measure with \({{\mathcal {J}}'}(U) \ll \mu \), and
$$\begin{aligned} {\mathcal {J}}(U \mathbin {\triangle }D) = {\mathcal {J}}(U) + {\mathcal {J}}'(U)(D) + o\bigl ( \mu (D) \bigr ) \quad \forall D \in \varSigma . \end{aligned}$$
(3)
Proof
Let \(U \in \varSigma \). Because J is Fréchet differentiable in \(\nu (U)\), \(J'(\nu (U))\) is a bounded linear operator. Lemma 2 shows that \({\mathcal {J}}'(U)\) as defined in (2) is a finite signed measure and that \({\mathcal {J}}'(U) \ll \mu \). Let \({\mathcal {R}}_U :\varSigma \rightarrow {\mathbb {R}}\) be given by
$$\begin{aligned} {\mathcal {R}}_{U}(D) :={\left\{ \begin{array}{ll} \frac{1}{\mu (D)} \bigl ( {\mathcal {J}}(U \mathbin {\triangle }D) - {\mathcal {J}}(U) - {\mathcal {J}}'(U)(D) \bigr ), &{}\quad \mu (D) \ne 0 \\ 0, &{}\quad \mu (D) = 0 \end{array}\right. } \quad \forall D \in \varSigma . \end{aligned}$$
To establish (3), we need to show that \({\mathcal {R}}_U(D) \rightarrow 0\) for \(\mu (D) \rightarrow 0\). Let \((D_i)_{i \in {\mathbb {N}}}\) be a sequence in \(\varSigma \) such that \(\mu (D_i) \rightarrow 0\), and let
$$\begin{aligned} d_i :=\nu (U \mathbin {\triangle }D_i) - \nu (U) = \nu (D_i {\setminus } U) - \nu (D_i \cap U) \qquad \forall i \in {\mathbb {N}}. \end{aligned}$$
(4)
Then we have \(\Vert d_i\Vert _Y \le C \cdot \mu (D_i)\) for all \(i \in {\mathbb {N}}\) with \(C > 0\) from Assumption 1.6. Therefore, \(d_i \rightarrow 0\) for \(i \rightarrow \infty \). If \(d_i = 0\), then \({\mathcal {J}}(U \mathbin {\triangle }D_i) = {\mathcal {J}}(U)\) and \({\mathcal {J}}'(U)(D_i) = 0\), which implies \({\mathcal {R}}_U(D_i) = 0\). Without loss of generality, we consider only sequences where \(\mu (D_i) > 0\) and \(\Vert d_i\Vert _Y > 0\) for all \(i \in {\mathbb {N}}\). By combining (2), (4), and the Fréchet differentiability of J, we can conclude that
$$\begin{aligned} {\mathcal {R}}_U(D_i)&{\mathop {=}\limits ^{(2)}} \frac{1}{\mu (D_i)} \Bigl ( J\bigl (\nu (U \mathbin {\triangle }D_i)\bigr ) - J\bigl (\nu (U)\bigr ) - J'\bigl (\nu (U)\bigr ) d_i \Bigr ) \\&{\mathop {=}\limits ^{(4)}} \underbrace{\frac{\Vert d_i\Vert _Y}{\mu (D_i)}}_{\in (0, C]} \frac{1}{\Vert d_i\Vert _Y} \Bigl ( \underbrace{J\bigl (\nu (U) + d_i\bigr ) - J\bigl (\nu (U)\bigr ) - J'\bigl (\nu (U)\bigr ) d_i}_{= o({\Vert d_i \Vert }_Y)} \Bigr ) \xrightarrow {i \rightarrow \infty } 0, \end{aligned}$$
which proves (3). \(\square \)
Theorem 1 can be extended to include higher-order derivatives. This requires a technical measure extension argument and does not contribute to the method under discussion.
Because \(\mu \) and \({\mathcal {J}}'(U)\) are finite and \({\mathcal {J}}'(U) \ll \mu \), \({\mathcal {J}}'(U)\) has a \(\mu \)-integrable density function that we use to construct steepest descent steps.
Corollary 1
Let \(\varOmega \), \(\varSigma \), \(\mu \), Y, \({\mathcal {J}}\), J, and \(\nu \) satisfy Assumptions 1.1 to 1.6. For every \(U \in \varSigma \), there exists \(g_U \in L^1(\varOmega , \mu )\) such that
$$\begin{aligned} {\mathcal {J}}'(U)(D) = \int _D g_U \,\mathrm {d}\mu \quad \forall D \in \varSigma . \end{aligned}$$
Proof
Let \(U \in \varSigma \) be given. According to Theorem 1, \({\mathcal {J}}'(U)\) is a finite signed measure with \({\mathcal {J}}'(U) \ll \mu \). Because \(\mu \) is also finite, Lemma 1 proves the existence of a \(g_U \in L^1(\varOmega )\) with the stated properties. \(\square \)
There are different ways to calculate \(g_U\). Section 3.3 derives \(g_U\) for two exemplary ODE- and PDE-constrained problems. The approximation (3) is only accurate in a small neighborhood of U. To obtain estimates for the error accumulated over larger distances, we need to cover a connecting line with such neighborhoods and use a compactness argument to extract a finite subcover. We can then choose a finite number of support points such that first-order Taylor approximations are accurate when traveling from one support point to the next, as illustrated in Fig. 1.
Sets do not have an exact equivalent of convex combinations or “connecting lines”. We could construct a similar argument by using geodesics. In the interest of brevity, however, we assume instead that the derivative of \(J :Y \rightarrow {\mathbb {R}}\) satisfies a Lipschitz condition. We can then make the argument in the underlying vector space Y.
Lemma 3
Let \(\varOmega \), \(\varSigma \), Y, J, \(\mu \), \(\nu \), and \({\mathcal {J}}\) satisfy Assumptions 1.1 to 1.6. Furthermore, let \(C > 0\) be as specified in Assumption 1.6, and let the Fréchet derivative \(J'\) of J be such that there exists a constant \(L > 0\) with
$$\begin{aligned} {\bigl \Vert J'(x) - J'(y) \bigr \Vert }_{Y^*} \le L \cdot \Vert x - y \Vert _Y \quad \forall x, y \in {\text {conv}}\bigl (\nu (\varSigma )\bigr ) \subseteq Y. \end{aligned}$$
(5)
Then we have
$$\begin{aligned} \bigl | {\mathcal {J}}(U \mathbin {\triangle }D) - {\mathcal {J}}(U) - {\mathcal {J}}'(U)(D) \bigr | \le \frac{L C^2}{2} {\mu (D)}^2 \qquad \forall U, D \in \varSigma . \end{aligned}$$
(6)
Proof
Let \(U, D \in \varSigma \) be given and let \(x :=\nu (U)\) and \(d :=\nu (D {\setminus } U) - \nu (D \cap U)\). We have
$$\begin{aligned} x + d = \nu (U) + \nu (D {\setminus } U) - \nu (D \cap U) = \nu \bigl ( (U \cup D) {\setminus } (U \cap D) \bigr ) = \nu ( U \mathbin {\triangle }D). \end{aligned}$$
According to (2) in Theorem 1, \({\mathcal {J}}'(U)(D)\) is given by
$$\begin{aligned} {\mathcal {J}}'(U)(D) = J'(x)\bigl ( \nu (D {\setminus } U) - \nu (D \cap U) \bigr ) = J'(x) d. \end{aligned}$$
We further find that \({\Vert d \Vert }_Y \le {\bigl \Vert \nu (D {\setminus } U) \bigr \Vert }_Y + {\bigl \Vert \nu (D \cap U) \bigr \Vert }_Y \le C \cdot \mu (D)\) according to Assumption 1.6. Therefore, we have
$$\begin{aligned} \bigl | {\mathcal {J}}(U \mathbin {\triangle }D) - {\mathcal {J}}(U) - {\mathcal {J}}'(U)(D) \bigr |&= \Bigl | \int _0^1 \bigl ( J'(x + \lambda d) - J'(x) \bigr ) d \,\mathrm {d}\lambda \Bigr | \\&\le \int _0^1 \bigl \Vert J'(x + \lambda d) - J'(x) \bigr \Vert _{Y^*} \Vert d\Vert _Y \,\mathrm {d}\lambda \\&\le \Vert d \Vert _Y \cdot \int _0^1 L \cdot {\Vert x + \lambda d - x \Vert }_Y \,\mathrm {d}\lambda \\&= \frac{L}{2} \underbrace{{\Vert d \Vert }_Y^2}_{\le C^2 {\mu (D)}^2}, \end{aligned}$$
which proves (6). \(\square \)
We note that a geodesic-based argument would not require a Lipschitz condition over \({\text {conv}}\bigl ( \nu (\varSigma ) \bigr )\), but only over \(\nu (\varSigma )\), because support points can be chosen exclusively from \(\nu (\varSigma )\).
Optimality criterion
Our method works towards achieving a necessary optimality criterion based on stationarity. We have shown in Corollary 1 that the derivative measure has an integrable density function \(g_U\). The integral of \(g_U\) over a step set D approximates the change in objective for small steps. Therefore, if
$$\begin{aligned} g_U(x) \ge 0 \qquad \text {a.e. in}\ \varOmega , \end{aligned}$$
(7)
then all sufficiently small steps are predicted to either maintain or increase the objective. The following elementary lemma shows that negative density values on non-nullsets always translate into descent steps.
Lemma 4
Let \((\varOmega , \varSigma , \mu )\) be a finite atomless measure space, \(g \in L^1(\varOmega , \mu )\), and \(\lambda \in [0,1]\). Then there exists \(D \in \varSigma \) such that \(\mu (D) = \lambda \cdot \mu (\varOmega )\) and
$$\begin{aligned} \int _D g \,\mathrm {d}\mu \le \lambda \cdot \int _\varOmega g \,\mathrm {d}\mu . \end{aligned}$$
Proof
For \(\lambda = 0\), we can choose \(D :=\emptyset \). Similarly, for \(\lambda = 1\), we can choose \(D :=\varOmega \). Therefore, we consider only \(\lambda \in (0,1)\) here. Because g is \(\mu \)-integrable, we have \(\mu ({\mathcal {L}}_{g \le \eta }) \xrightarrow {\eta \rightarrow -\infty } 0\) and \(\mu ({\mathcal {L}}_{g \le \eta }) \xrightarrow {\eta \rightarrow \infty } \mu (\varOmega )\). Therefore,
$$\begin{aligned} \eta ^*:=\inf \bigl \lbrace \eta \in {\mathbb {R}}\bigm \vert \mu ({\mathcal {L}}_{g \le \eta }) > \lambda \mu (\varOmega ) \bigr \rbrace \end{aligned}$$
is a finite real number.
Let \(({\check{\eta }}_i)_{i \in {\mathbb {N}}} \subset {\mathbb {R}}\) be an ascending sequence with \({\check{\eta }}_i < \eta ^*\ \forall i \in {\mathbb {N}}\) and \({\check{\eta }}_i \rightarrow \eta ^*\) for \(i \rightarrow \infty \). The corresponding sequence of sublevel sets \({\mathcal {L}}_{g \le {\check{\eta }}_i}\) is increasing with
$$\begin{aligned} D_1 :={\mathcal {L}}_{g < \eta ^*} = \bigcup _{i = 1}^\infty {\mathcal {L}}_{g \le {\check{\eta }}_i}. \end{aligned}$$
Since \(\mu ({\mathcal {L}}_{g \le {\check{\eta }}_i}) \le \lambda \cdot \mu (\varOmega )\) for all \(i \in {\mathbb {N}}\), we have \(\mu (D_1) \le \lambda \cdot \mu (\varOmega )\).
Conversely, let \(({\hat{\eta }}_i)_{i \in {\mathbb {N}}} \subset {\mathbb {R}}\) be a descending sequence with \({\hat{\eta }}_i > \eta ^*\ \forall i \in {\mathbb {N}}\) and \({\hat{\eta }}_i \rightarrow \eta ^*\) for \(i \rightarrow \infty \). The corresponding sequence of sublevel sets \({\mathcal {L}}_{g \le {\hat{\eta }}_i}\) is decreasing with
$$\begin{aligned} {\bar{D}}_2 :={\mathcal {L}}_{g \le \eta ^*} = \bigcap _{i = 1}^\infty {\mathcal {L}}_{g \le {\hat{\eta }}_i}. \end{aligned}$$
Here, we find that \(\mu ({\bar{D}}_2) \ge \lambda \cdot \mu (\varOmega )\).
Because \((\varOmega , \varSigma , \mu )\) is atomless, we can choose a measurable set \(D_2 \subset {\bar{D}}_2 {\setminus }D_1\) such that \(\mu (D_2) = \lambda \cdot \mu (\varOmega ) - \mu (D_1)\). We define \(D :=D_1 \cup D_2\) and obtain \(\mu (D) = \lambda \cdot \mu (\varOmega )\) and \(D \subseteq {\mathcal {L}}_{g \le \eta ^*}\). We therefore have
$$\begin{aligned} \int _D g \,\mathrm {d}\mu \le \eta ^*\cdot \lambda \cdot \mu (\varOmega ). \end{aligned}$$
If we were to assume that \(\int _D g \,\mathrm {d}\mu > \lambda \cdot \int _\varOmega g \,\mathrm {d}\mu \), it would imply that
$$\begin{aligned} \eta ^*> \frac{1}{\mu (\varOmega )} \int _\varOmega g \,\mathrm {d}\mu . \end{aligned}$$
We could then conclude that
$$\begin{aligned} \int _\varOmega g \,\mathrm {d}\mu&= \int _D g \,\mathrm {d}\mu + \int _{\varOmega {\setminus } D} \underbrace{g}_{{\ge \eta ^*}} \,\mathrm {d}\mu \\&> \underbrace{\lambda }_{= \frac{\mu (D)}{\mu (\varOmega )}} \cdot \int _\varOmega g \,\mathrm {d}\mu + \eta ^*\cdot \bigl ( \mu (\varOmega ) - \mu (D) \bigr ) \\&> \frac{\mu (D)}{\mu (\varOmega )} \cdot \int _\varOmega g \,\mathrm {d}\mu + \frac{\mu (\varOmega ) - \mu (D)}{\mu (\varOmega )} \cdot \int _\varOmega g \,\mathrm {d}\mu \\&= \int _\varOmega g \,\mathrm {d}\mu , \end{aligned}$$
which proves by contradiction that
$$\begin{aligned} \int _D g \,\mathrm {d}\mu \le \lambda \cdot \int _\varOmega g \,\mathrm {d}\mu \end{aligned}$$
and therefore that g realizes or exceeds its overall average value on D. \(\square \)
It is useful to think of \(\lambda = \frac{\mu (D)}{\mu (\varOmega )}\) and \(g = g_U\) for some \(U \in \varSigma \). We then find a step \(D \in \varSigma \) such that
$$\begin{aligned} \frac{1}{\mu (D)} \int _D g_U \,\mathrm {d}\mu \le \frac{1}{\mu (\varOmega )} \int _\varOmega g_U \,\mathrm {d}\mu . \end{aligned}$$
For any measure less than or equal to \(\mu (\varOmega )\), we can therefore choose a \(\mu \)-measurable subset \(D \subseteq \varOmega \) with that exact measure such that the predicted decrease in objective for the step D is no worse than the average over all of \(\varOmega \). This does not imply that the predicted change is negative. We can ensure a decrease by applying Lemma 4 to the finite atomless measure space \(\bigl (D_0,\, \varSigma \cap D_0,\, \mu \vert _{\varSigma \cap D_0}\bigr )\) for \(D_0 :={\mathcal {L}}_{g_U < 0} \in \varSigma \). There then exists a step D of given size that captures at least a corresponding fraction of the maximal achievable predicted decrease. We can use this to prove that (7) is a necessary criterion for local optimality.
Lemma 5
Let \(\varOmega \), \(\varSigma \), Y, J, \(\mu \), \(\nu \), and \({\mathcal {J}}\) satisfy Assumptions 1.1 to 1.7. Let \(U \in \varSigma \) be a locally optimal solution such that there exists \(R > 0\) with
$$\begin{aligned} {\mathcal {J}}(U \mathbin {\triangle }D) \ge {\mathcal {J}}(U) \quad \forall D \in \varSigma :\mu (D) \le R, \end{aligned}$$
and let \(g_U \in L^1(\varOmega , \mu )\) denote the \(\mu \)-integrable density function of \({\mathcal {J}}'(U)\). Then
$$\begin{aligned} g_U(x) \ge 0 \quad \mu \text {-a.e. in }\varOmega . \end{aligned}$$
Proof
We assume the contrary, i.e., that there exists \(D_0 \in \varSigma \) with \(\mu (D_0) > 0\) and \(g_U(x) < 0\) for all \(x \in D_0\). We have
$$\begin{aligned} \delta :=\frac{1}{\mu (D_0)} \int _{D_0} g_U \,\mathrm {d}\mu < 0. \end{aligned}$$
From Theorem 1, there exists \({\bar{R}} > 0\) such that
$$\begin{aligned} \bigl | {\mathcal {J}}(U \mathbin {\triangle }D) - {\mathcal {J}}(D) - {\mathcal {J}}'(U)(D) \bigr | \le -\frac{\delta }{2} \mu (D) \quad \forall D \in \varSigma :\mu (D) \le {\bar{R}}. \end{aligned}$$
Let \(R' :=\min \lbrace R,\, {\bar{R}},\, \mu (D_0) \rbrace > 0\). This implies that \(\lambda :=\frac{R'}{\mu (D_0)} \in (0, 1]\). According to Lemma 4, there exists \(D \in \varSigma \) with \(D \subseteq D_0\), \(\mu (D) = R' \le R\), and
$$\begin{aligned} \int _D g_U \,\mathrm {d}\mu \le \frac{\mu (D)}{\mu (D_0)} \cdot \int _{D_0} g_U \,\mathrm {d}\mu = \delta \cdot \mu (D), \end{aligned}$$
which implies that
$$\begin{aligned} {\mathcal {J}}(U \mathbin {\triangle }D) \le {\mathcal {J}}(U) + \int _{D} g_U \,\mathrm {d}\mu - \frac{\delta }{2} \mu (D) \le {\mathcal {J}}(U) + \underbrace{\frac{\delta }{2} \mu (D)}_{< 0}. \end{aligned}$$
This would contradict \({\mathcal {J}}(U \mathbin {\triangle }D) \ge {\mathcal {J}}(U)\) for all \(D \in \varSigma \) with \(\mu (D) \le R\). Therefore, no such \(D_0\) can exist. \(\square \)
Because \(g_U\) is the density function of the gradient measure, (7) is essentially a stationarity condition. We subsequently refer to points that satisfy (7) as stationary points. Similarly, we refer to \(U \in \varSigma \) as \(\varepsilon \)-stationary for given \(\varepsilon > 0\) if and only if
$$\begin{aligned} \int _\varOmega \Bigl | \min \bigl \lbrace 0, g_U(x) \bigr \rbrace \Bigr | \,\mathrm {d}\mu \le \varepsilon . \end{aligned}$$
We refer to the integral on the left hand side as the instationarity of U. One cannot always find solutions that satisfy the necessary optimality criterion (7). For instance, if the vector measure \(\nu \) maps a set \(U \in \varSigma \) to its characteristic function \(\nu (U) = \chi _U \in L^1(\varOmega )\), then the image \(\nu (\varSigma )\) is not closed, and accumulation points of sequences may not themselves be characteristic functions of measurable sets.
A weak guarantee can be given for the degree of suboptimality of an \(\varepsilon \)-stationary point. If we make an assumption of limited curvature in the sense of Lemma 3, then the following estimate holds for every \(D \in \varSigma \).
$$\begin{aligned} {\mathcal {J}}(U \mathbin {\triangle }D)&\ge {\mathcal {J}}(U) + {\mathcal {J}}'(U)(D) - \frac{L C^2}{2} {\mu (D)}^2 \\&= {\mathcal {J}}(U) + \int _D g_U \mathrm {d}\mu - \frac{L C^2}{2} {\mu (D)}^2 \\&\ge {\mathcal {J}}(U) - \varepsilon - \frac{L C^2}{2} {\mu (D)}^2 \end{aligned}$$
Here, \(L > 0\) is the Lipschitz constant associated with changes in the Fréchet derivative of J, and \(C > 0\) is the constant from Assumption 1.6. This implies that within a given radius of \(R > 0\), U is suboptimal by at most \(\varepsilon + \frac{L C^2}{2} R^2\).
It is further possible to infer that the sequence of control functions \(\nu (U_i)\) forms a Cauchy sequence, if \({\mathcal {J}}(U_i)\) approach the infimum of J over \({\text {conv}}(\nu (\varSigma ))\) and the underlying functional J is strictly convex.
Approximating gradient densities
In previous sections, we have shown that, under Assumption 1, there exists a gradient density function \(g_U :\varOmega \rightarrow {\mathbb {R}}\) for all control sets \(U \in \varSigma \) such that
$$\begin{aligned} {\mathcal {J}}(U \mathbin {\triangle }D) - {\mathcal {J}}(U) \approx \int _{D} g_U \,\mathrm {d}\mu . \end{aligned}$$
It may not be immediately clear how we would calculate a useful representation of \(g_U\). We will now briefly discuss two cases in which \(g_U\) can be determined relatively easily.
ODE-constrained optimal control
We first consider a case where Problem (1) is derived from an ODE-constrained optimal control problem of the form
$$\begin{aligned} \begin{aligned}&\min _{w, y}\, \int _0^{t_f} l(y(t), w(t)) \,\mathrm {d}t \\&\text {s.t.}\, {\dot{y}}(t) = f(y(t), w(t)) \quad \text {for a.a.}\ t \in (0, t_f) \\&\,\,\quad w(t) \in \lbrace 0, 1 \rbrace \qquad \qquad \,\,\quad \text {for a.a.}\ t \in (0, t_f) \\&\,\,\,\quad y(0) = y_0 \\&\quad \quad \quad w \in L^1([0, t_f])\\ \end{aligned} \end{aligned}$$
with constant \(t_f > 0\) and \(y_0 \in {\mathbb {R}}^n\). This includes the Lotka–Volterra problem from Sect. 1. We only discuss autonomous ODEs here, which serves primarily to unclutter notation. To guarantee that unique solutions and derivatives exist, we make some assumptions on l and f.
Assumption 2
Let \(t_f > 0\), \(D \subseteq {\mathbb {R}}^n\), \(l :D \times {\mathbb {R}}\rightarrow {\mathbb {R}}\), and \(f :D \times {\mathbb {R}}\rightarrow {\mathbb {R}}^n\) satisfy the following assumptions:
-
1.
D is a convex open set with \(y_0 \in D\),
-
2.
f and l are twice continuously differentiable w.r.t. (y, w) on \(D \times {\mathbb {R}}\),
-
3.
f and l are affine linear in w, i.e., \(f(y, w) = f(y, 0) + w \cdot \left( f(y, 1) - f(y, 0)\right) \) and \(l(y, w) = l(y, 0) + w \cdot \left( l(y, 1) - l(y, 0)\right) \),
-
4.
there exists a constant L such that \(\Vert f(x, v) - f(y, w) \Vert \le L \cdot \Vert x - y \Vert + L \cdot |v - w|\) for all \(x, y \in D\) and \(v, w \in {\mathbb {R}}\),
-
5.
there exists \(\varepsilon > 0\) such that for all \(\alpha \in L^1([0, t_f])\) with \(\alpha (t) \in [0, 1]\) almost everywhere and every \(\tau \in (0, t_f]\), every absolutely continuous function \(y :[0, \tau ] \rightarrow D\) with \(y(0) = y_0\) and \({\dot{y}}(t) = f(y(t), \alpha (t))\) almost everywhere satisfies \({\text {dist}}(y(t), \partial D) \ge \varepsilon \) for all \(t \in [0, \tau ]\).
It may be possible to further relax these assumptions. However, this formulation is sufficiently general to apply to the Lotka–Volterra test problem which we discuss in greater depth in Sect. 4.2. While demanding affine linearity in w may seem restrictive, it is always achievable for binary-valued controls by partial outer convexification [30, 31].
We begin with a stability result that allows us to extend the subsequent existence result to control functions whose values lie outside of [0, 1].
Lemma 6
Let \(t_f > 0\), D, f satisfy Assumption 2, let \(L > 0\) denote the constant from Assumption 2.4, and let \(\tau \in (0, t_f]\). Let \(v, w \in L^1([0, t_f])\), and let \(y_v :[0, \tau ) \rightarrow D\) and \(y_w :[0, \tau ) \rightarrow D\) be absolutely continuous such that
$$\begin{aligned} \begin{array}{ll} \,y_v(0) = y_0,&{} \\ \,\,{\dot{y}}_v(t) = f(y_v(t), v(t)) &{}\qquad \text {for a.a.}\ t \in [0, \tau ), \\ y_w(0) = y_0, &{}\\ \,{\dot{y}}_w(t) = f(y_w(t), w(t)) &{}\qquad \text {for a.a.}\ t \in [0, \tau ). \end{array} \end{aligned}$$
Then we have
$$\begin{aligned} \Vert y_v(t) - y_w(t) \Vert \le L \cdot e^{L t} \cdot {\Vert v - w\Vert }_{L^1} \qquad \forall t \in [0, \tau ). \end{aligned}$$
Proof
By using the Lipschitz condition in Assumption 2.4, we find that
$$\begin{aligned} \Vert y_v(t) - y_w(t) \Vert&\le \int _0^t \Vert f(y_v, v) - f(y_w, w) \Vert \,\mathrm {d}s \\&\le L \cdot \int _0^t | v - w | \,\mathrm {d}s + \int _0^t L \cdot \Vert y_v - y_w \Vert \,\mathrm {d}s. \end{aligned}$$
for all \(t \in [0, \tau )\). According to Gronwall’s inequality, it follows that
$$\begin{aligned} \Vert y_v(t) - y_w(t) \Vert&\le L \cdot \int _0^t |v - w| \,\mathrm {d}s + \int _0^t L^2 e^{L (t - s)} \cdot \int _0^s |v - w| \,\mathrm {d}\tau \,\mathrm {d}s \\&\le L \cdot \Vert v - w \Vert _{L^1} + \int _0^t L^2 e^{L (t - s)} \cdot \Vert v - w \Vert _{L^1} \,\mathrm {d}s \\&= L \cdot \Bigl ( 1 + \underbrace{\int _0^t L e^{L (t - s)} \,\mathrm {d}s}_{= e^{L t} - 1} \Bigr ) \cdot \Vert v - w\Vert _{L^1} \\&= L \cdot e^{L t} \cdot \Vert v - w \Vert _{L^1} \end{aligned}$$
for all \(t \in [0, \tau )\). \(\square \)
Lemma 6 shows that the solution to the initial value problem exists not only for control functions with values in [0, 1], but also in a small \(L^1\) environment around them.
Lemma 7
Let \(t_f > 0\), \(D \subseteq {\mathbb {R}}^n\), and f satisfy Assumption 2, and let \(\varepsilon > 0\) denote the constant from Assumption 2.5. Then there exists \(\delta > 0\) such that for every \(w \in L^1([0, t_f])\) with \(w(t) \in [0, 1]\) and every \(v \in B_{\delta }(w)\), there exists a unique absolutely continuous function \(y_v :[0, t_f] \rightarrow D\) such that \(y_v(0) = y_0\), \({\text {dist}}(y_v(t), \partial D) \ge \frac{\varepsilon }{2}\) for all \(t \in [0, t_f]\), and
$$\begin{aligned} {\dot{y}}_v(t) = f\left( y_v(t), v(t)\right) \qquad \text {for a.a.}\ t \in [0, t_f]. \end{aligned}$$
Proof
We use the generalized existence theory based on the Carathéodory condition as described in [17, Sec. I.5]. Assumptions 2.1 to 2.4 imply the Carathéodory condition for all \(v \in L^1([0, t_f])\). This implies the existence of unique absolutely continuous local solutions \(y_v :[0, \tau ) \rightarrow D\) with \(y_v(0) = y_0\). We can extend these local solutions to their maximal existence interval.
We first consider the case \(v = w\), where \(w(t) \in [0, 1]\) almost everywhere is guaranteed. If the maximal existence interval were to end at \(\tau \le t_f\), then the continuous extension of \(y_w\) would satisfy \(y_w(t) \rightarrow \partial D\) for \(t \rightarrow \tau \), which would contradict Assumption 2.5. We can therefore extend \(y_w\) to \([0, t_f]\) and guarantee that \(y_w(t) \in D\) for all \(t \in [0, t_f]\).
Let \(\varepsilon > 0\) denote the constant given in Assumption 2.5. We can then use Lemma 6 to extend this argument to all \(v \in L^1([0, t_f])\) with
$$\begin{aligned} {\Vert v - w\Vert }_{L^1} \le \underbrace{\frac{\varepsilon }{2 L \cdot e^{L t_f}}}_{=:\delta }. \end{aligned}$$
as this yields \({\text {dist}}(y_v, \partial D) \ge \frac{\varepsilon }{2}\). \(\square \)
Lemma 7 ensures that there exists a small neighborhood around each admissible control function in which the ODE solution is well-defined. This is important for the definition of derivatives, which are defined using changes over infinitesimally small neighborhoods.
As our measure space \((\varOmega , \varSigma )\), we choose \([0, t_f]\) with the Lebesgue \(\sigma \)-algebra. The Banach space Y is the space \(L^1(\varOmega )\) and \(\nu \) maps \(U \mapsto \chi _U\) where \(\chi _U\) denotes the characteristic function of U. The objective \(J :Y \rightarrow {\mathbb {R}}\) is given by
$$\begin{aligned} J(w) :=\int _0^{t_f} l(y_w(t), w(t)) \,\mathrm {d}t. \end{aligned}$$
The function \({\mathcal {J}}\) is then given by
$$\begin{aligned} {\mathcal {J}}(U) = J(\nu (U)) = \int _0^{t_f} l(y_{\chi _U}(t), \chi _U(t)) \,\mathrm {d}t \qquad \forall U \in \varSigma . \end{aligned}$$
For the measure \(\mu \), we can choose either the Lebesgue measure or an equivalent measure. In either case, there is a density function \(m \in L^1([0, t_f]) \cap L^\infty ([0, t_f])\) with \(m > 0\) almost everywhere such that
$$\begin{aligned} \mu (A) = \int _A m(t) \,\mathrm {d}t \qquad \forall A \in \varSigma . \end{aligned}$$
To show that these choices satisfy Assumption 1, we have to prove that \(J :L^1([0, t_f]) \rightarrow {\mathbb {R}}\) is Fréchet differentiable around every \(w \in L^1([0, t_f])\) with \(w(t) \in [0, 1]\) almost everywhere. We first show that the ODE solutions corresponding to control functions in the \(\delta \)-environment established in Lemma 7 are uniformly bounded. This allows us to consider f only on a compact convex subset \(U \subseteq {\mathbb {R}}^n\) where the norms of the second derivatives of f and l can be bounded. The conjunction of Lemma 6 with the norm bounds on the second derivatives and Assumption 2.3 allows us to control residual terms that appear when deriving the Fréchet derivative of J from the Lagrangian function of the original problem.
Lemma 8
Let \(t_f > 0\), D, f satisfy Assumption 2, and let \(L > 0\) denote the constant from Assumption 2.4. Let \(w \in L^1([0, t_f])\) be such that there exists a unique absolutely continuous function \(y_w :[0, t_f] \rightarrow D\) with \(y_w(0) = y_0\) and \({\dot{y}}_w(t) = f(y_w(t), w(t))\) almost everywhere. Then we have
$$\begin{aligned} \Vert y_w(t) - y_0 \Vert \le \bigl ( \Vert f(y_0, 0)\Vert \cdot t_f + L \cdot \Vert w\Vert _{L^1} \bigr ) \cdot e^{L t_f} \qquad \forall t \in [0, t_f]. \end{aligned}$$
Let \(\delta > 0\) be the constant derived in Lemma 7. There exists a convex compact set \(U \subset D\) such that \(y_v(t) \in U\) almost everywhere for all \(v \in L^1([0, t_f])\) such that there exists \(w \in L^1([0, t_f])\) with \(w(t) \in [0, 1]\) for almost all \(t \in [0, t_f]\) and \(\Vert v - w\Vert _{L_1} \le \delta \).
Proof
Assumption 2.4 guarantees for all \(t \in [0, t_f]\) that
$$\begin{aligned} \Vert y_w(t) - y_0\Vert \le \Bigl \Vert \int _0^t f(y_w, w) \,\mathrm {d}s \Bigr \Vert \le \int _0^t \underbrace{\Vert f(y_0, w)\Vert }_{{\le \Vert f(y_0, 0)\Vert + L \cdot |w|}} \,\mathrm {d}s + \int _0^t L \Vert y_w - y_0\Vert \,\mathrm {d}s. \end{aligned}$$
Let
$$\begin{aligned} \alpha (t) :=t \cdot \Vert f(y_0, 0) \Vert + L \cdot \int _0^t |w(s)| \,\mathrm {d}s. \end{aligned}$$
We note that \(\alpha \) is increasing in t. By applying Gronwall’s inequality, we obtain the estimate
$$\begin{aligned} \Vert y_w(t) - y_0\Vert&\le \alpha (t) + \int _0^t \alpha (s) \cdot L \cdot e^{L \cdot (t - s)} \,\mathrm {d}s \\&\le \alpha (t) \cdot \Bigl ( 1 + \int _0^t L \cdot e^{L \cdot (t - s)} \,\mathrm {d}s \Bigr ) \\&= \alpha (t) \cdot e^{L t} \\&\le \alpha (t_f) \cdot e^{L t_f} \\&= \bigl ( \Vert f(y_0, 0)\Vert \cdot t_f + L \cdot \Vert w \Vert _{L^1} \bigr ) \cdot e^{L t_f} \end{aligned}$$
for all \(t \in [0, t_f]\). Let \(\varepsilon > 0\) denote the constant from Assumption 2.5. For all \(w \in L^1([0, t_f])\) with \(w(t) \in [0, 1]\) almost everywhere, we have \(\Vert w\Vert _{L^1} \le t_f\). For all \(v \in B_{\delta }(w)\), we have \(\Vert v\Vert _{L^1} \le t_f + \delta \). For such v, we have a unique ODE solution \(y_v\) that satisfies
$$\begin{aligned} \Vert y_v(t) - y_0\Vert \le \bigl ( \Vert f(y_0, 0)\Vert \cdot t_f + L \cdot (t_f + \delta ) \bigr ) \cdot e^{L t_f} =:R \qquad \forall t \in [0, t_f]. \end{aligned}$$
We then have
$$\begin{aligned} y_v(t) \in U :=\left\{ y \in D \,\big |\, {\text {dist}}(y, \partial D) \ge \frac{\varepsilon }{2} \right\} \cap \overline{B_R(y_0)} \qquad \forall t \in [0, t_f]. \end{aligned}$$
We note that U is the intersection of two convex closed sets and is therefore convex and closed. Since U is also bounded, it is compact. \(\square \)
Since f and l are twice continuously differentiable with respect to (y, w), this implies that their first and second derivatives are also bounded on U.
We briefly note that the uniform bound of all first and second derivatives of f and l also implies that J satisfies the curvature condition required by Lemma 3. This implies the suboptimality bound described at the end of Sect. 3.2.
Let \(C :=L \cdot (1 + e^{L t_f})\). Let \(\lambda _v :[0, t_f] \rightarrow {\mathbb {R}}^n\) be the unique, absolutely continuous solution of the costate equations
$$\begin{aligned} {\dot{\lambda }}_v= & {} -l_y(y_v, v) - \lambda _v^T f_y(y_v, v) \qquad \text {for a.a.}\ t \in [0, t_f], \\ \lambda _v(t_f)= & {} 0. \end{aligned}$$
With the costate function \(\lambda \), we can define the Lagrangian function
$$\begin{aligned} \varLambda (y, w, \lambda )&= \int _0^{t_f} l(y, w) + \lambda ^T (f(y, w) - {\dot{y}}) \,\mathrm {d}t \\&= \lambda (0)^T y(0) - \lambda (t_f)^T y(t_f) + \int _0^{t_f} l(y, w) + \lambda ^T f(y, w) + {\dot{\lambda }}^T y \,\mathrm {d}t. \end{aligned}$$
We use integration by parts for the second reformulation. We note that we have \(\varLambda (y_w, w, \lambda ) = J(w)\) for all \(\lambda \) and all w. Therefore, the change in objective can be written as \(J(w) - J(v) = \varLambda (y_w, w, \lambda _w) - \varLambda (y_v, v, \lambda _v)\).
Theorem 2
Let \(t_f > 0\), \(D \in {\mathbb {R}}^n\), l, f satisfy Assumption 2, and let \(L > 0\) denote the constant in Assumption 2.4. Let \(v \in L^1([0, t_f])\) with \(v(t) \in [0, 1]\) almost everywhere. Then the objective function \(J :L^1([0, t_f]) \rightarrow {\mathbb {R}}\) with
$$\begin{aligned} J(w) :=\int _0^{t_f} l(y_w(t), w(t)) \,\mathrm {d}t \end{aligned}$$
is Fréchet differentiable in v and its derivative is given by
$$\begin{aligned} DJ(v) d :=\int _0^{t_f} \left( l_w(y_v, v) + \lambda _v^T f_w(y_v, v) \right) \cdot d \,\mathrm {d}t \qquad \forall d \in L^1([0, t_f]) \end{aligned}$$
where \(\lambda _v\) denotes the costate function associated with the control function v and its initial value problem solution \(y_v\).
Proof
We make use of the derivative expression of the costate equation. We then perform a truncated Taylor expansion with integral residual expressions to the objective, which yields
$$\begin{aligned}&J(w) - J(v) \\&\quad = \varLambda (y_w, w, \lambda _v) - \varLambda (y_v, v, \lambda _v) \\&\quad = \int _0^{t_f} l(y_w, w) - l(y_v, v) + \lambda _v^T (f(y_w, w) - f(y_v, v))\\&\qquad + {\dot{\lambda }}_v^T (y_w - y_v) \,\mathrm {d}t \\&\quad = \int _0^{t_f} \left( l_w(y_v, v) + \lambda _v^T f_w(y_v, v) \right) (w - v) \\&\qquad + \underbrace{\left( l_y(y_v, v) + \lambda _v^T f_y(y_v, v) + {\dot{\lambda }}_v^T \right) }_{= 0} (y_w - y_v) \,\mathrm {d}t \\&\qquad + \int _0^{t_f} \int _0^1 (1 - s) \varDelta y^T \nabla ^2_{xx} l(y_v + s \varDelta y, v + s \varDelta w) \varDelta y \,\mathrm {d}s \,\mathrm {d}t \\&\qquad + 2 \int _0^{t_f} \int _0^1 (1 - s) \varDelta y^T \nabla ^2_{xw} l(y_v + s \varDelta y, v + s \varDelta w) \varDelta w \,\mathrm {d}s \,\mathrm {d}t \\&\qquad + \int _0^{t_f} \int _0^1 (1 - s) \varDelta w^T \underbrace{\nabla ^2_{ww} l(y_v + s \varDelta y, v + s \varDelta w)}_{= 0} \varDelta w \,\mathrm {d}s \,\mathrm {d}t \\&\qquad + \sum _{i = 1}^n \int _0^{t_f} \int _0^1 (1 - s) \varDelta y^T \nabla ^2_{xx} f_i(y_v + s \varDelta y, v + s \varDelta w) \varDelta y \,\mathrm {d}s \,\mathrm {d}t \\&\qquad + 2 \sum _{i = 1}^n \int _0^{t_f} \int _0^1 (1 - s) \lambda _i \varDelta y^T \nabla ^2_{xw} f_i(y_v + s \varDelta y, v + s \varDelta w) \varDelta w \,\mathrm {d}s \,\mathrm {d}t \\&\qquad + \sum _{i = 1}^n \int _0^{t_f} \int _0^1 (1 - s) \lambda _i \varDelta w^T \underbrace{\nabla ^2_{ww} f_i(y_v + s \varDelta y, v + s \varDelta w)}_{= 0} \varDelta w \,\mathrm {d}s \,\mathrm {d}t. \end{aligned}$$
For the sake of brevity, we write \(\varDelta w :=(w - v)\) and \(\varDelta y :=(y_w - y_v)\). We note that the first integral in the last step is equal to \(DJ(v) \varDelta w\). According to Lemma 8, there exists a compact convex set \(U \subseteq D\) such that \(y_w(t) \in U\) and \(y_v(t) \in U\) for all v, w that are either [0, 1]-valued or in a \(\delta \)-neighborhood of such a control function. Given that all convex combinations between \(y_v\) and \(y_w\) lie in the compact set U, there exists a constant \(L' > 0\) such that
$$\begin{aligned} \Vert \nabla ^2_{xx} l \Vert&\le L', \\ \Vert \nabla ^2_{xw} l \Vert&\le L', \\ \Vert \nabla ^2_{xx} f \Vert&\le L', \\ \Vert \nabla ^2_{xw} f \Vert&\le L'. \end{aligned}$$
Because \(\Vert \varDelta y \Vert \le C \cdot \Vert \varDelta w \Vert _{L^1}\) for all \(t \in [0, t_f]\), we have
$$\begin{aligned}&\left| J(w) - J(v) - DJ(v)(w - v) \right| \qquad \\&\quad \le \Bigl | \int _0^{t_f} \int _0^1 (1 - s) \varDelta y^T \nabla ^2_{xx} l(y_v + s \varDelta y, v + s \varDelta w) \varDelta y \,\mathrm {d}s \,\mathrm {d}t \Bigr | \\&\qquad + 2 \Bigl | \int _0^{t_f} \int _0^1 (1 - s) \varDelta y^T \nabla ^2_{xw} l(y_v + s \varDelta y, v + s \varDelta w) \varDelta w \,\mathrm {d}s \,\mathrm {d}t \Bigr | \\&\qquad + \sum _{i = 1}^n \Bigl | \int _0^{t_f} \int _0^1 (1 - s) \varDelta y^T \nabla ^2_{xx} f_i(y_v + s \varDelta y, v + s \varDelta w) \varDelta y \,\mathrm {d}s \,\mathrm {d}t \Bigr | \\&\qquad + 2 \sum _{i = 1}^n \Bigl | \int _0^{t_f} \int _0^1 (1 - s) \lambda _i \varDelta y^T \nabla ^2_{xw} f_i(y_v + s \varDelta y, v + s \varDelta w) \varDelta w \,\mathrm {d}s \,\mathrm {d}t \Bigr | \\&\quad \le (n + 1) L' C^2 t_f \cdot \Vert w - v \Vert _{L^1}^2 + 2 (n + 1) L' C \cdot \Vert w - v \Vert _{L^1} \cdot \int _0^{t_f} \frac{1}{2} \varDelta w \,\mathrm {d}t \\&\quad \le (n + 1) L' t_f \cdot (C^2 + C) \cdot \Vert w - v \Vert _{L^1}^2. \end{aligned}$$
From this, it follows that
$$\begin{aligned} \frac{1}{\Vert w - v\Vert _{L^1}} \left| J(w) - J(v) - DJ(v)(w - v) \right|&\le L' t_f (n + 1) (C^2 + C) \cdot \Vert w - v \Vert _{L^1} \\&\xrightarrow {\Vert w - v\Vert _{L^1} \rightarrow 0} 0. \end{aligned}$$
This shows that J is Fréchet differentiable with respect to \(L^1\) changes in the control function. \(\square \)
Let \(U \in \varSigma \) denote the current control set. The gradient measure \(\varphi _U\) in U can be derived from the bounded linear operator \(DJ(\nu (U)) = DJ(\chi _U)\) using Lemma 2. For a given step \(D \in \varSigma \), we have
$$\begin{aligned} \varphi _U(D)&= DJ(\chi _U) \left( \chi _{D {\setminus } U} - \chi _{D \cap U} \right) \\&= \int _0^{t_f} \left( l_w(y_{\chi _U}, \chi _U) + \lambda _v^T f_w(y_{\chi _U}, \chi _U) \right) \cdot \left( \chi _{D {\setminus } U} - \chi _{D \cap U} \right) \,\mathrm {d}t \\&= \int _D \left( l_w(y_{\chi _U}, \chi _U) + \lambda _v^T f_w(y_{\chi _U}, \chi _U) \right) \cdot \left( 1 - 2 \chi _U \right) \,\mathrm {d}t \\&= \int _D \frac{1 - 2 \chi _U}{m} \cdot \left( l_w(y_{\chi _U}, \chi _U) + \lambda _v^T f_w(y_{\chi _U}, \chi _U) \right) \,\mathrm {d}\mu \end{aligned}$$
where m is the density function of the measure \(\mu \) with respect to the Lebesgue measure. The density function of \(\varphi _U\) is then given by
$$\begin{aligned} g_U(t) :=\frac{1 - 2 \chi _U(t)}{m(t)} \cdot \left( l_w(y_{\chi _U}(t), \chi _U(t)) + \lambda _v^T f_w(y_{\chi _U}(t), \chi _U(t)) \right) . \end{aligned}$$
We note the close connection of this expression to the Hamiltonian
$$\begin{aligned} {\mathcal {H}}(y, w, \lambda ) = l(y, w) + \lambda ^T f(y, w) \end{aligned}$$
and to the maximum principle for hybrid systems, e.g., [35]. The basic idea of the Competing Hamiltonian approach to mixed-integer optimal control, which was first described in [5, 6], can therefore be seen as a special case of our approach.
Since the process by which this density function is derived is very close to the adjoint differentiation scheme commonly used to extract derivatives from numerical integrators, it is possible to extract the density function from off-the-shelf integrators. We use this approach in Sect. 4.2 to extract the density function from the CVODES solver of the SUNDIALS suite.
PDE-constrained optimization
Next, we consider the case where Problem (1) is derived from a PDE constrained optimization problem of the form
$$\begin{aligned} \min _{w, y}\, j(y, w)&\\ \text {s.t.}\, f(y, w)= & {} 0_Z \\ w(x)\in & {} \lbrace 0, 1 \rbrace \quad \text {for a.a.}\ x \in \varOmega \\ w\in & {} L^1(\varOmega ) \\ y\in & {} X\\ \end{aligned}$$
where \(\varOmega \subseteq {\mathbb {R}}^n\) denotes a bounded domain, X and Z are suitably chosen Banach spaces, \(j :X \times L^1(\varOmega ) \rightarrow {\mathbb {R}}\), and \(f :X \times L^1(\varOmega ) \rightarrow Z\). We make additional assumptions to ensure the existence and uniqueness of solutions.
Assumption 3
Let \(\lambda :\varSigma \rightarrow {\mathbb {R}}\) denote the Lebesgue measure. We assume that
-
1.
\(j :X \times L^1(\varOmega ) \rightarrow {\mathbb {R}}\) is continuously Fréchet differentiable,
-
2.
\(f :X \times L^1(\varOmega ) \rightarrow Z\) is continuously Fréchet differentiable,
-
3.
\(f_y(y, w)\) has a bounded inverse for all \((y, w) \in X \times L^1(\varOmega )\),
-
4.
for every \(w \in L^1(\varOmega )\), there exists \(y_w \in X\) such that \(f(y_w, w) = 0\).
We further assume that there is a sequence of partitions \(( \lbrace T^{(i)}_1, \dots , T^{(i)}_{N_i} \rbrace )_{i \in {\mathbb {N}}}\) such that
-
5.
for \(i > 1, j \in [N_i]\), there exists \(j' \in [N_{i - 1}]\) with \(T^{(i)}_j \subseteq T^{{(i - 1)}}_{j'}\),
-
6.
\(\lambda (T^{(i)}_j) > 0\) for all \(i \in {\mathbb {N}}, j \in [N_i]\),
-
7.
\(\max \lbrace \lambda (T^{(i)}_j) \mid j \in [N_i] \rbrace \xrightarrow {i \rightarrow \infty } 0\),
-
8.
for a.a. \(x \in \varOmega \), there exist \(j_i(x) \in [N_i]\) for all \(i \in {\mathbb {N}}\) such that \(x \in T^{(i)}_{j_i(x)}\),
-
9.
there exists \(C > 0\) such that for every \(i \in {\mathbb {N}}, j \in [N_i]\), there is a ball \(B^{(i)}_j\) with \(T^{(i)}_j \subseteq B^{(i)}_j\) and \(\lambda (T^{(i)}_j) \ge C \lambda (B^{(i)}_j)\).
While Assumptions 3.5 to 3.9 are somewhat similar to the assumptions on “order-conserving domain dissections” stated in [26], we note that they differ in that they do not require the partition sequence to be order-conserving. Furthermore, [26] has no counterpart to Assumption 3.6, which we require because we divide by the measure of the cells. Therefore, these sets of assumptions should not be confused.
Under Assumptions 3.1 to 3.4, the implicit function theorem [20, Thm. 1.41] shows that the mapping \(w \mapsto y_w\) is continuously Fréchet differentiable with
$$\begin{aligned} D_w y_w = -f_y^{-1}(y_w, w) f_w(y_w, w). \end{aligned}$$
Our objective functional in this case is the reduced objective
$$\begin{aligned} J(w) :=j(y_w, w) \qquad \forall w \in L^1(\varOmega ) \end{aligned}$$
which is continuously Fréchet differentiable according to the chain rule and satisfies
$$\begin{aligned} DJ(w) = -j_y(y_w, w) f_y^{-1}(y_w, w) f_w(y_w, w) + j_w(y_w, w). \end{aligned}$$
We note that DJ(w) is a bounded linear form in \(L^1(\varOmega )\). Since \(L^\infty (\varOmega )\) is the dual space of \(L^1(\varOmega )\), there exists a function \(g_w \in L^\infty (\varOmega )\) such that
$$\begin{aligned} DJ(w) d = \int _\varOmega g_w d \,\mathrm {d}x. \end{aligned}$$
The vector measure \(\nu \) is once more given by \(\nu (U) :=\chi _U\) and the underlying measure space is the Lebesgue \(\sigma \)-algebra on \(\varOmega \) with the Lebesgue measure or an equivalent measure with weight function \(m \in L^\infty (\varOmega )\). As was the case for the ODE-constrained case in Sect. 3.3.1, the gradient density measure is then given by
$$\begin{aligned} \varphi _U(D)&= DJ(\chi _U) ( \chi _{D {\setminus } U} - \chi _{D \cap U} ) \\&= \int _\varOmega g_w ( \chi _{D {\setminus } U} - \chi _{D \cap U} ) \,\mathrm {d}x \\&= \int _D g_w ( 1 - 2 \chi _U ) \,\mathrm {d}x \\&= \int _D \frac{1 - 2 \chi _U}{m} g_w \,\mathrm {d}\mu \end{aligned}$$
Accordingly, the gradient density function \(g_U\) for a given control set \(U \in \varSigma \) is given by
$$\begin{aligned} g_U(x) = \frac{1 - 2 \chi _U(x)}{m(x)} g_w(x) \qquad \forall x \in \varOmega . \end{aligned}$$
To approximate the value of \(g_w\), we use the fact that \(\varOmega \) is bounded and therefore \(L^\infty (\varOmega ) \subset L^1(\varOmega )\). We consider the sequence of mesh partitions described in Assumption 3. The family of all mesh cells \(T^{(i)}_j\) contracts to nullsets according to Assumption 3.7, can be used to approximate almost all points in \(\varOmega \) according to Assumption 3.8, and is of bounded eccentricity according to Assumption 3.9. We can therefore apply the Lebesgue differentiation theorem to obtain
$$\begin{aligned} g_w(x) = \lim _{i \rightarrow \infty } \frac{1}{\lambda (T^{(i)}_{j_i(x)})} \int _{T^{(i)}_{j_i(x)}} g_w(x) \,\mathrm {d}x \end{aligned}$$
almost everywhere. If we assume that for a given mesh index i, control functions on the i-th mesh are given by
$$\begin{aligned} w(x) = \sum _{j = 1}^{N_i} w^{(i)}_j \chi _{T^{(i)}_j}(x), \end{aligned}$$
then the derivative of the objective function with respect to the degree of freedom \(w^{(i)}_j\) is equal to
$$\begin{aligned} \frac{\mathrm {d}}{\mathrm {d}w^{(i)}_j} J(w) = \int _{T^{(i)}_j} g_w(x) \,\mathrm {d}x. \end{aligned}$$
Therefore, if we start with a sufficiently fine mesh to express the desired control function w and maintain the same w for all higher mesh indices, which we can do because of Assumption 3.5, then we find that
$$\begin{aligned} g_w(x) = \lim _{i \rightarrow \infty } \frac{\frac{\mathrm {d}}{\mathrm {d}w^{(i)}_{j_i(x)}} J(w)}{\lambda (T^{(i)}_{j_i(x)})} \end{aligned}$$
which implies that
$$\begin{aligned} g_U(x) = \frac{1 - 2 \chi _U(x)}{m(x)} \cdot \lim _{i \rightarrow \infty } \frac{\frac{\mathrm {d}}{\mathrm {d}w^{(i)}_{j_i(x)}} J(w)}{\lambda (T^{(i)}_{j_i(x)})} \qquad \text {for a.a.}\ x \in \varOmega . \end{aligned}$$
Thus, the density function \(g_U\) can be approximated using the gradients of the objective function J with respect to the degrees of freedom (DOFs) of a piecewise constant control function on successively refined meshes. We note that this does not take into account discretization errors in function evaluation. Controlling such errors usually requires considerations much more specific to the discretization or problem in question.
We also note that the application of the bounded inverse \(f_y^{-1}(y_w, w)\) usually requires the solution of a PDE that involves the adjoint of a linearization of the original differential operator. This method of deriving the gradient is therefore sometimes known as the adjoint method.
We note that the curvature condition of Lemma 3 can be translated into a Lipschitz condition on the derivatives of f and j in this setting.
Algorithm
In trust-region terminology, we determine our step using the “affine linear” model function
$$\begin{aligned} \phi _U :D \mapsto {\mathcal {J}}(U) + {\mathcal {J}}'(U)(D). \end{aligned}$$
Accordingly, the trust-region subproblem in a given point \(U \in \varSigma \) is
$$\begin{aligned} \min _{D \in \varSigma } \phi _U(D)\ \text {s.t.}\ \mu (D) \le \varDelta , \end{aligned}$$
where \(\varDelta > 0\) is the trust-region radius. Given that \(U \in \varSigma \) is fixed for each instance of the trust-region subproblem, the term \({\mathcal {J}}(U)\) can be omitted and we can rewrite the subproblem as
$$\begin{aligned} \min _{D \in \varSigma } \int _D g_U \,\mathrm {d}\mu \ \text {s.t.}\ \mu (D) \le \varDelta . \end{aligned}$$
(8)
The proof of Lemma 4 suggests a method by which the subproblem can be solved. Because \(g_U\) is a \(\mu \)-measurable function, its sublevel and level sets are measurable, and we can select \(D \in \varSigma \) to encompass exactly those \(x \in \varOmega \) where \(g_U(x)\) assumes its smallest values. In the proof, we used the reference level
$$\begin{aligned} \eta ^*:=\inf \bigl \lbrace \eta \in {\mathbb {R}}\bigm \vert \mu ({\mathcal {L}}_{g_U \le \eta }) > \varDelta \bigr \rbrace . \end{aligned}$$
In practice, we need to approximate \(\eta ^*\) and \(g_U\). We state the solution procedure for (8) in a way that allows the use of an approximation of \(g_U\). On discrete meshes, it may also not be possible to choose D with the exact desired measure. Therefore, we allow for some deviation. The resulting algorithm is Procedure 1.
Line 15 in Procedure 1 selects a subset \(D_2 \subseteq {\mathcal {L}}_{g \le \eta _2} {\setminus } {\mathcal {L}}_{g \le \eta _1}\) with a given size range. The existence of \(D_2\) is guaranteed due to the atomlessness of the underlying measure space. The set \(D_2\) is used to ensure that the resulting step is close enough to the trust-region radius \(\varDelta \) to guarantee sufficient descent.
The way in which \(D_2\) is chosen is arbitrary and can be designed in a way that is suitable and convenient for the given problem implementation. Methods can range from selecting mesh cells in a pre-defined or random order to approximating knapsack solutions to achieve as large a step as possible. Given that the gradient function value is guaranteed to be between \(\eta _1\) and \(\eta _2\) at almost all points in \(D_2\), the approximate optimality of the step is guaranteed for all selections of \(D_2\).
A range of sizes is given to accommodate problem implementations where the boundaries of the step D need to run along mesh boundaries that cannot be moved arbitrarily. In such cases, mesh refinement may become necessary if it is impossible to find a set \(D_2\) of suitable size at the current mesh resolution. If the gradient density function is sufficiently well-approximated and is assumed to remain stable with respect to local mesh refinement, then the main goal of mesh refinement is to achieve sufficient mesh granularity in the candidate set \({\mathcal {L}}_{g \le \eta _2} {\setminus } {\mathcal {L}}_{g \le \eta _1}\) from which \(D_2\) is chosen. Therefore, it is sufficient to refine all cells in this set until sufficient granularity is achieved to select \(D_2\) within the given measure margins.
If the gradient function does change noticeably due to the refinement, then the candidate set may change on subsequent iterations. This is highly undesirable. However, the refinement loop will necessarily terminate as soon as every cell in the candidate set has a measure of less than \(\frac{\delta \varDelta }{-2 \eta _2}\) where \(-\eta _2\) is bounded from above due to the fact that
$$\begin{aligned} \eta _2 \cdot \varDelta \ge \int _\varOmega \min \lbrace 0, g \rbrace \mathrm {d}\mu . \end{aligned}$$
Therefore, this simplistic mesh refinement scheme will, in the worst case, terminate as soon as the entire mesh is refined to sufficient granularity.
The resulting step cannot be assumed to be optimal. As stated in Procedure 1, however, one can automatically determine the required levels of accuracy in order to obtain a solution that has arbitrarily small optimality gap. This claim is proven in Lemma 9.
Lemma 9
Let \((\varOmega , \varSigma , \mu )\) be an atomless measure space, let \(g \in L^1(\varSigma , \mu )\), let \(0 < \varDelta \le \mu (\varOmega )\), let \(\delta > 0\), and let D be the set returned by Procedure 1. Then D is \(\mu \)-measurable, satisfies \(\mu (D) \le \varDelta \), and is nearly a solution of Eq. (8) in the sense that
$$\begin{aligned} \int _D g \,\mathrm {d}\mu \le \int _{D'} g \,\mathrm {d}\mu + \delta \varDelta \quad \forall D' \in \varSigma :\mu (D') \le \varDelta . \end{aligned}$$
(9)
Proof
First, we consider the case that \(\mu \bigl ({\mathcal {L}}_{g < 0}\bigr ) \le \varDelta \). In this case, Procedure 1 returns in Line 2 with \(D = {\mathcal {L}}_{g < 0}\). Certainly, \(D \in \varSigma \) and \(\mu (D) \le \varDelta \). For every \(D' \in \varSigma \), we have
$$\begin{aligned} \int _{D'} g \,\mathrm {d}\mu = \int _D g \,\mathrm {d}\mu - \int _{D {\setminus } D'} \underbrace{g}_{< 0} \,\mathrm {d}\mu + \int _{D' {\setminus } D} \underbrace{g}_{\ge 0} \,\mathrm {d}\mu \ge \int _D g \,\mathrm {d}\mu . \end{aligned}$$
Next, we consider the case \(\mu \bigl ({\mathcal {L}}_{g < 0}\bigr ) > \varDelta \). In Lines 5 and 6, we find bounds \(\eta _1 < \eta _2 \le 0\) such that \(\mu \bigl ({\mathcal {L}}_{g \le \eta _1}\bigr ) \le \varDelta \) and \(\mu \bigl ({\mathcal {L}}_{g \le \eta _2}\bigr ) > \varDelta \). Such \(\eta _1, \eta _2\) exist because \(\mu \bigl ({\mathcal {L}}_{g < 0}\bigr ) > \varDelta \) and \(\mu \bigl ({\mathcal {L}}_{g \le \eta }\bigr ) \rightarrow 0\) for \(\eta \rightarrow -\infty \), which follows from the \(\mu \)-integrability of g. The bisection loop in Lines 7 to 12 maintains these properties while ensuring that \(\eta _2 - \eta _1 < \frac{\delta }{2}\).
We know that \(\mu (D_1) = \mu ({\mathcal {L}}_{g \le \eta _1}) \le \varDelta \). For \(\eta _2 = 0\), we have \(\mu (D_2) = 0\). Otherwise, the existence of a measurable subset \(D_2 \subseteq {\mathcal {L}}_{g \le \eta _2} {\setminus } D_1\) with
$$\begin{aligned} \mu (D_2) \in \left[ \varDelta - \mu (D_1) + \frac{\delta \varDelta }{2 \eta _2}, \varDelta - \mu (D_1)\right] \end{aligned}$$
is guaranteed by the atomlessness of \((\varOmega , \varSigma , \mu )\). In either case, \(D_2\) is guaranteed to be disjoint from \(D_1\). Let \(D = D_1 \cup D_2 \in \varSigma \) for the selected set \(D_2\). We have
$$\begin{aligned} \mu (D) = \mu (D_1) + \mu (D_2) {\left\{ \begin{array}{ll} = \mu (D_1) &{}\quad \text {if}\ \eta _2 = 0 \\ \in \left[ \varDelta \cdot \left( 1 + \frac{\delta }{2 \eta _2} \right) , \varDelta \right] &{}\quad \text {if}\ \eta _2 < 0, \end{array}\right. } \end{aligned}$$
and therefore \(\mu (D) \le \varDelta \). For every \(D' \in \varSigma \) with \(\mu (D') \le \varDelta \), we have
$$\begin{aligned} \int _D g \,\mathrm {d}\mu - \int _{D'} g \,\mathrm {d}\mu&= \int _{D {\setminus } D'} \underbrace{g}_{\le \eta _2} \,\mathrm {d}\mu - \int _{D' {\setminus } D} \underbrace{g}_{> \eta _1} \,\mathrm {d}\mu \\&\le \eta _2 \cdot \mu (D {\setminus } D') - \eta _1 \cdot \mu (D' {\setminus }D) \\&= \eta _2 \cdot \bigl ( \mu (D) - \mu (D \cap D') \bigr ) - \eta _1 \cdot \bigl ( \mu (D') - \mu (D \cap D') \bigr ) \\&= \underbrace{\eta _2 \cdot \mu (D)}_{{\le \eta _2 \cdot \varDelta + \frac{\delta }{2} \cdot \varDelta }} - \underbrace{\eta _1 \cdot \mu (D')}_{\ge \eta _1 \cdot \varDelta } + \underbrace{(\eta _1 - \eta _2) \cdot \mu (D \cap D')}_{\le 0} \\&\le \underbrace{(\eta _2 - \eta _1)}_{\le \delta / 2} \cdot \varDelta + \frac{\delta }{2} \cdot \varDelta \\&\le \delta \varDelta , \end{aligned}$$
which proves (9). \(\square \)
Using Procedure 1, we can state the main trust region loop in Algorithm 2. We allow for the use of an approximation \({\tilde{g}} \in L^1(\varOmega , \mu )\) of the gradient density \(g_U \in L^1(\varOmega , \mu )\), assuming that it can be made arbitrarily accurate according to the \(L^1\) norm.
Theorem 3
Let \(\varOmega \), \(\varSigma \), Y, J, \(\mu \), \(\nu \), \({\mathcal {J}}\), and \(C > 0\) satisfy Assumptions 1.1 to 1.7. Furthermore let \(U_0 \in \varSigma \), \(\varDelta _{\max } \in (0, \mu (\varOmega ))\), \(\varDelta _0 \in (0, \varDelta _{\max })\), \(\varepsilon > 0\), \(0< \sigma _1 < \sigma _2 \le 1\), and \(\omega \in (0, 1)\) with \(\omega < \frac{3 - 3 \sigma _1}{3 - 2 \sigma _1}\). Let \({\mathcal {J}}(\varSigma )\) be bounded from below, and let \(L > 0\) be a constant such that
$$\begin{aligned} {\bigl \Vert J'(x) + J'(y) \bigr \Vert }_{Y^*} \le L \cdot {\Vert x - y\Vert }_Y \qquad \forall x, y \in {\text {conv}}\bigl ( \nu (\varSigma ) \bigr ). \end{aligned}$$
(10)
Then Algorithm 2 terminates in finite time and yields \(i \in {\mathbb {N}}_0\) and \(U_i \in \varSigma \) such that \({\mathcal {J}}(U_i) \le {\mathcal {J}}(U_0)\) and
$$\begin{aligned} \int _\varOmega \min \lbrace 0, g_{U_i} \rbrace \,\mathrm {d}\mu > -\varepsilon . \end{aligned}$$
Proof
For given \(i \in {\mathbb {N}}_0\), let \(S_i :={\mathcal {L}}_{g_{U_i} < 0} \subseteq \varOmega \) and \({\tilde{S}}_i :={\mathcal {L}}_{{\tilde{g}}_i < 0} \subseteq \varOmega \). If the stationarity test in Line 4 succeeds, then
$$\begin{aligned}&\int _\varOmega \min \lbrace g_{U_i}, 0 \rbrace \,\mathrm {d}\mu \\&\quad = \int _{\varOmega } \min \lbrace {\tilde{g}}_i, 0 \rbrace \,\mathrm {d}\mu + \int _{{\tilde{S}}_i \cup S_i} \underbrace{\bigl ( \min \lbrace g_{U_i}, 0 \rbrace - \min \lbrace {\tilde{g}}_i, 0 \rbrace \bigr )}_{\ge -|g_{U_i} - {\tilde{g}}_i|} \,\mathrm {d}\mu \\&\quad > -\left( 1 - \frac{\omega }{3} \right) \varepsilon - \frac{\omega }{3} \frac{\varDelta _i}{\mu (\varOmega )} \varepsilon \\&\quad \ge -\varepsilon + \frac{\omega }{3} \varepsilon - \frac{\omega }{3} \varepsilon . \end{aligned}$$
We note that the initial choice of \(\varDelta _0 \le \varDelta _{\max } \le \mu (\varOmega )\) and the fact that \(\varDelta _i\) is never increased above \(\varDelta _{\max }\) guarantee that \(\frac{\varDelta _i}{\mu (\varOmega )} \le 1\) for all i. If the test fails, then
$$\begin{aligned} \int _\varOmega \min \lbrace g_{U_i}, 0 \rbrace \,\mathrm {d}\mu&\le -\left( 1 - \frac{\omega }{3} \right) \varepsilon + \frac{\omega }{3} \frac{\varDelta _i}{\mu (\varOmega )} \varepsilon \\&\le -\varepsilon + \frac{2 \omega }{3} \varepsilon . \end{aligned}$$
Therefore, the stationarity test will succeed at some point as the solution becomes stationary, but it will succeed only for solutions that satisfy the tolerance \(\varepsilon \). According to Lemma 9, \(D_i\) is such that for every \(D' \in \varSigma \) with \(\mu (D') \le \varDelta _i\),
$$\begin{aligned} \int _{D_i} {\tilde{g}}_i \,\mathrm {d}\mu \le \int _{D'} {\tilde{g}}_i \,\mathrm {d}\mu + \frac{\omega \varepsilon }{3} \frac{\varDelta _i}{\mu (\varOmega )}. \end{aligned}$$
According to Lemma 4, there exists a \(D^*\in \varSigma \) with \(\mu (D^*) \le \varDelta _i\) and
$$\begin{aligned} \int _{D^*} \min \lbrace {\tilde{g}}_i, 0 \rbrace \,\mathrm {d}\mu \le \frac{\varDelta _i}{\mu (\varOmega )} \int _\varOmega \min \lbrace {\tilde{g}}_i, 0 \rbrace \,\mathrm {d}\mu . \end{aligned}$$
With \(D' :=D^*\cap {\tilde{S}}_i\), we have
$$\begin{aligned} \int _{D_i} {\tilde{g}}_i \,\mathrm {d}\mu&\le \int _{D'} {\tilde{g}}_i \,\mathrm {d}\mu + \frac{\omega \varepsilon }{3} \frac{\varDelta _i}{\mu (\varOmega )} \nonumber \\&\le \frac{\varDelta _i}{\mu (\varOmega )} \int _\varOmega \min \lbrace {\tilde{g}}_i, 0 \rbrace \,\mathrm {d}\mu + \frac{\omega \varepsilon }{3} \frac{\varDelta _i}{\mu (\varOmega )} \nonumber \\&\le -\frac{\varDelta _i}{\mu (\varOmega )} \frac{3 - 2\omega }{3} \varepsilon < 0. \end{aligned}$$
(11)
For every accepted step, we therefore have
$$\begin{aligned} {\mathcal {J}}(U_{i + 1}) - {\mathcal {J}}(U_i) \le \sigma _1 \int _{D_i} {\tilde{g}}_i \,\mathrm {d}\mu \le -\frac{\varDelta _i}{\mu (\varOmega )} \frac{3 - 2 \omega }{3} \varepsilon \sigma _1 < 0. \end{aligned}$$
Next, we prove that there exists \({\bar{\varDelta }} > 0\) such that step \(D_i\) is accepted for all \(i \in {\mathbb {N}}\) with \(\varDelta _i \le {\bar{\varDelta }}\). From (10), we can invoke Lemma 3 to show that
$$\begin{aligned} \Bigl | {\mathcal {J}}(U_{i + 1}) - {\mathcal {J}}(U_i) - \int _{D_i} g_{U_i} \,\mathrm {d}\mu \Bigl | \le \frac{L C^2}{2} \mu (D_i)^2 \le \frac{L C^2 \varDelta _i^2}{2} \end{aligned}$$
(12)
for all i in which the stationarity test does not detect stationarity. For these iterations, we can then conclude that
$$\begin{aligned} \rho _i = \frac{{\mathcal {J}}(U_{i + 1}) - {\mathcal {J}}(U_i)}{\smash [b]{\underbrace{\int _{D_i} {\tilde{g}}_i \,\mathrm {d}\mu }_{< 0}}} \quad&{\mathop {\ge }\limits ^{(12)}}\quad \frac{\int _{D_i} {\tilde{g}}_i \,\mathrm {d}\mu + \frac{\omega \varepsilon \varDelta _i}{3 \mu (\varOmega )} + \frac{L C^2}{2} \cdot \varDelta _i^2}{\int _{D_i} {\tilde{g}}_i \,\mathrm {d}\mu } \\&{\mathop {\ge }\limits ^{(11)}}\quad 1 - \frac{\frac{\omega \epsilon \varDelta _i}{3 \mu (\varOmega )} + \frac{L C^2}{2} \cdot \varDelta _i^2}{\frac{\varDelta _i (3 - 2 \omega ) \varepsilon }{3 \mu (\varOmega )}} \\&\ge \quad 1 - \frac{\omega }{3 - 2 \omega } - \frac{3 L C^2 \mu (\varOmega )}{2 \varepsilon (3 - 2 \omega )} \cdot \varDelta _i. \end{aligned}$$
Note that \(1 - \frac{\omega }{3 - 2 \omega } > \sigma _1\) if and only if \(\omega < \frac{3 - 3 \sigma _1}{3 - 2 \sigma _1}\). Thus, we find that \(\rho _i \ge \sigma _1\) for all i with
$$\begin{aligned} \varDelta _i \le {\bar{\varDelta }} :=\frac{1 - \frac{\omega }{3 - 2 \omega } - \sigma _1}{\frac{3 L C^2 \mu (\varOmega )}{2 \varepsilon (3 - 2 \omega )}} = \frac{2 \varepsilon \cdot \bigl ( 3 - 3 \omega - (3 - 2 \omega ) \sigma _1 \bigr )}{3 L C^2 \mu (\varOmega )} > 0. \end{aligned}$$
This implies that there is never an endless loop of rejected steps. In addition, because \(\varDelta _i\) is only halved upon rejection, it follows that \(\varDelta _i \ge \frac{\min \lbrace {\bar{\varDelta }}, \varDelta _0 \rbrace }{2}\) for all i. When substituted into our prior estimate of the decrease per accepted step, this yields
$$\begin{aligned} {\mathcal {J}}(U_{i + 1}) - {\mathcal {J}}(U_i) \le -\frac{\min \lbrace {\bar{\varDelta }}, \varDelta _0 \rbrace \cdot (3 - 2 \omega ) \varepsilon \sigma _1}{6 \omega \mu (\varOmega )} < 0. \end{aligned}$$
The fact that \({\mathcal {J}}\) is bounded from below therefore implies that the number of accepted steps is finite. Because there is not an endless loop of rejected steps, the algorithm must terminate. The only manner in which it can do so is if the stationarity test succeeds after a finite number of steps. \(\square \)