Strong Stationarity for Optimal Control Problems with Non-smooth Integral Equation Constraints: Application to a Continuous DNN

Motivated by the residual type neural networks (ResNet), this paper studies optimal control problems constrained by a non-smooth integral equation associated to a fractional differential equation. Such non-smooth equations, for instance, arise in the continuous representation of fractional deep neural networks (DNNs). Here the underlying non-differentiable function is the ReLU or max function. The control enters in a nonlinear and multiplicative manner and we additionally impose control constraints. Because of the presence of the non-differentiable mapping, the application of standard adjoint calculus is excluded. We derive strong stationary conditions by relying on the limited differentiability properties of the non-smooth map. While traditional approaches smoothen the non-differentiable function, no such smoothness is retained in our final strong stationarity system. Thus, this work also closes a gap which currently exists in continuous neural networks with ReLU type activation function.

1. Introduction.In this paper, we establish strong stationary optimality conditions for the following control constrained optimization problem (P) min (a, )∈H 1 (0,T ;R n×n )×H 1 (0,T ;R n ) J(y, a, ) s.t.∂ γ y(t) = f (a(t)y(t) + (t)) a.e. in (0, T ), where f : R → R is a non-smooth non-linearity.The objective functional is given by J(y, a, ) := g(y(T )) + 1 2 a 2 H 1 (0,T ;R n×n ) + 1 2 2 where g : R n → R is a continuously differentiable function.The values γ ∈ (0, 1) and y 0 ∈ R n are fixed and the set K ⊂ H 1 (0, T ; R n ) is convex and closed.The symbol ∂ γ denotes the fractional time derivative, more details are provided in the forthcoming sections.Notice that the entire discussion in this paper also extends (and is new) for the case γ = 1, i.e., the standard time derivative.This has been substantiated with the help of several remarks throughout the paper.Recently, optimal control of fractional ODEs/PDEs have received a significant interest, we refer to the articles [1,7] and the references therein.The most generic framework is considered in [5].However, none of these articles deal with the non-smooth setting presented in this paper.The essential feature of the problem under consideration is that the non-linearity f is not necessarily Gâteaux-differentiable, so that the standard methods for the derivation of qualified optimality conditions are not applicable here.In view of our goal to establish strong stationarity, the main novelties in this paper arise from: • the presence of the fractional time derivative; • the fact that the controls appear in the argument of the non-smooth nonlinearity f ; • the presence of control constraints (in this context, we are able to prove strong stationarity without resorting to unverifiable "constraint qualifications").All these challenges appear in applications concerned with the control of neural networks.The non-smooth and nonlinear function f encompasses functions such as max or ReLU arising in deep neural networks (DNNs).The objective function J encompasses a generic class of functionals such as cross entropy and least squares.In fact, the optimal control problem (P) is motivated by residual neural networks [24,33] and fractional deep neural networks [3,4,6].The control constraints can capture the bias ordering notion recently introduced in [2].All existing approaches in the neural network setting assume differentiability of f in deriving the gradients via backpropagation.No such smoothness conditions are assumed in this paper.
Deriving necessary optimality conditions is a challenging issue even in finite dimensions, where a special attention is given to MPCCs (mathematical programs with complementarity constraints).In [34] a detailed overview of various optimality conditions of different strength was introduced, see also [27] for the infinite-dimensional case.The most rigorous stationarity concept is strong stationarity.Roughly speaking, the strong stationarity conditions involve an optimality system, which is equivalent to the purely primal conditions saying that the directional derivative of the reduced objective in feasible directions is nonnegative (which is referred to as B-stationarity).
While there are plenty of contributions in the field of optimal control of smooth problems, see e.g.[38] and the references therein, fewer papers are dealing with nonsmooth problems.Most of these works resort to regularization or relaxation techniques to smooth the problem, see e.g.[8,28] and the references therein.The optimality systems derived in this way are of intermediate strength and are not expected to be of strong stationary type, since one always loses information when passing to the limit in the regularization scheme.Thus, proving strong stationarity for optimal control of non-smooth problems requires direct approaches, which employ the limited differentiability properties of the control-to-state map.In this context, there are even less contributions.Based on the pioneering work [31] (strong stationarity for optimal control of elliptic VIs of obstacle type), most of them focus on elliptic VIs [12,18,26,32,39,40]; see also [14] (parabolic VIs of the first kind) and the more recent contribution [15] (evolutionary VIs).Regarding strong stationarity for optimal control of nonsmooth PDEs, the literature is rather scarce and the only papers known to the authors addressing this issue so far are [30] (parabolic PDE), [11,16,17] (elliptic PDEs) and [10] (coupled PDE system).We point out that, in contrast to our problem, all the above mentioned works feature controls which appears outside the non-smooth mapping.Moreover, none of these contributions deals with a fractional time derivative.
Let us give an overview of the structure and the main results in this paper.After introducing the notation, we present in section 2 some fractional calculus results which are needed throughout the paper.Then, section 3 focuses on the analysis of the state equation in (P).Here we address the existence and uniqueness of socalled mild solutions, i.e., solutions of the associated integral Volterra equation.The properties of the respective control-to-state operator are investigated.In particular, we are concerned with the directional differentiability of the solution mapping of the non-smooth integral equation associated to the fractional differential equation in (P).
While optimal control of nonlinear (and smooth) integral equations attracted much attention, see, e.g., [13,23,41], to the best of our knowledge, the sensitivity analysis of non-smooth integral equations has not been yet investigated in the literature.
In section 4 we show that the mild solution found in the previous section is in fact strong.That is, the unique solution to the state equation in (P) is absolutely continuous, and it thus possesses a so-called Caputo-derivative.We underline that, the only paper known to the authors which deals with optimal control and proves the existence of strong solutions in the framework of fractional differential equations is [5].In [5], the absolute continuity of the mild solution of a fractional in time PDE (state equation) is shown by imposing pointwise (time-dependent) bounds on the time derivative of the control which then carry over to the time derivative of the state.We point out that we do not need such bounds in our case.Moreover, the result in this section stands by its own and is in fact not employed in the upcoming sections.However, it adds to the key novelties of the present paper.
Section 5 focuses on the main contribution of this paper, namely the strong stationarity for the optimal control of (P).Via a classical smoothening technique, we first prove an auxiliary result (Lemma 5.1) which will serve as an essential tool in the context of establishing strong stationarity.Our main Theorem 5.7 is then shown by extending the "surjectivity" trick from [10,30].In this context, we resort to a verifiable "constraint qualification" (CQ).We underline that this is satisfied by state systems describing neural networks with the max or ReLu function.In addition, there are many other settings where the CQ can be a priori checked, as pointed out in Remark 5.4 below.In a more general case, this CQ is the price to pay for imposing constraints on the control (and not on the control a), see Remark 5.12.As already emphasized in contributions where strong stationarity is investigated, CQs are to be expected when examining control constrained problems [11,39] or, they may be required by the complex nature of the state system [10].At the end of section 5 we gather some important remarks regarding the main result.A fundamental aspect resulting from the findings in this paper is that, when it comes to strong stationarity, the presence of more than one control allows us to impose control constraints without having to resort to unverifiable CQs, see Remark 5.13.Finally, we include in Appendix A the proof of Lemma 5.1, for convenience of the reader.
Notation.Throughout the paper, T > 0 is a fixed final time and n ∈ N is a fixed dimension.By • we denote the Frobenius norm.If X and Y are linear normed spaces, X → → Y means that X is compactly embedded in Y , while X d → Y means that X is densely embedded in Y .The dual space of X will be denoted by X * .For the dual pairing between X and X * we write ., .X .If X is a Hilbert space, (•, •) X stands for the associated scalar product.The closed ball in X around x ∈ X with radius α > 0 is denoted by B X (x, α).With a little abuse of notation, the Nemytskiioperators associated with the mappings considered in this paper will be denoted by the same symbol, even when considered with different domains and ranges.We use sometimes the notation h g to denote h ≤ Cg, for some constant C > 0, when the dependence of the constant C on some physical parameters is not relevant.

Preliminaries.
In this section we gather some fractional calculus tools that are needed for our analysis.Definition 2.1 (Left and right Riemann-Liouville fractional integrals).For φ ∈ L 1 (0, T ; R n ), we define Here Γ is the Euler-Gamma function.
Proof.By assumption, we have (γ − 1)r + 1 > 0 and follows from elementary calculations.The proof is complete.
3. Control-to-state operator.In this section we address the properties of the solution operator of the state equation Throughout the paper, γ ∈ (0, 1), unless otherwise specified.For all z ∈ R n , the non-linearity f : where f : R → R is a non-smooth nonlinear function.For convenience we will denote both non-smooth functions by f ; from the context it will always be clear which one is meant.Assumption 3.1.For the non-smooth mapping appearing in (P) we require: 1.The non-linearity f : R n → R n is globally Lipschitz continuous with constant L > 0, i.e., 2. The function f is directionally differentiable at every point, i.e., lim As a consequence of Assumption 3.1 we have a mild solution of the state equation (3.1) if it satisfies the following integral equation Remark 3.3.One sees immediately that if γ = 1 then y ∈ W 1,∞ (0, T ; R n ) and the mild solution is in fact a strong solution of the following ODE

1).
Proof.To show the existence of a mild solution in the general case γ ∈ (0, 1) we define the operator where t * will be computed such that F is a contraction.Indeed, according to Lemma 2.5, F is well-defined (since f maps bounded sets to bounded sets).Moreover, by applying (2.1) with = ∞, and using Assumption 3.1, we see that , the proof is complete.Otherwise we fix t * as above and conclude that z = F (z) admits a unique solution in C([0, t * ]; R n ), which for later purposes, is denoted by y.To prove that this solution can be extended on the whole given interval [0, T ], we use a concatenation argument.We define Using a simple coordinate transform, we can apply again (2.1) with = ∞ on the interval (t * , 2t * ), and we have is locally Lipschitz continuous in the following sense: For every M > 0 there exists a constant L M > 0 such that Proof.First we show that S maps bounded sets to bounded sets.To this end, let M > 0 and (a, ) with where we used Assumption 3.1.1and Lemma 2.7 with r = 1.By means of Lemma 2.6 we deduce Now, let M > 0 be further arbitrary but fixed.Define Subtracting the integral formulations associated to each k and using the Lipschitz continuity of f with constant L yields for t ∈ [0, T ] where in the last inequality we used (3.5) and Lemma 2.7 with r = 1.Now, Lemma 2.6 implies that which completes the proof.
Theorem 3.6 (S is directionally differentiable).The control to state operator is directionally differentiable with directional derivative given by the unique solution δy ∈ C([0, T ]; R n ) of the following integral equation (3.6) i.e., δy = S ((a, ); (δa, δ )) for all (a, ), (δa, δ Proof.We first show that (3.6) is uniquely solvable.To this end, we argue as in the proof of Proposition 3.4.From Lemma 2.5 we know that the operator is well defined since f (ay + ; az . By employing the Lipschitz continuity of f (ay + ; •) with constant L, one obtains the exact same estimate as in the proof of Proposition 3.4 and the remaining arguments stay the same.
Now, let us take a closer look at the term where in the last inequality we used the Lipschitz continuity of S, cf.Proposition 3.5, with Going back to (3.7), we see that where we relied again on Lemma 2.7 with r = 1.In light of (3.8), Lemma 2.6 finally implies The proof is now complete.
Definition 4.1.We say that y ∈ W 1,1 (0, T ; R n ) is a strong solution to (4.1) if The following well known result is a consequence of the identity I γ 0+ I 1−γ 0+ y = I 1 0+ y = y, which is implied by the semigroup property of the fractional integrals cf.e.g.[29, Lem.2.3] and Definition 2.1.Lemma 4.2.A function y ∈ W 1,1 (0, T ; R n ) is a strong solution of (4.1) if and only if it satisfies the integral formulation (3.3).Proof.Let t ∈ [0, T ] and h ∈ (0, 1] be arbitrary but fixed.Note that the existence of a unique solution y ∈ C([0, T +1]; R n ) is guaranteed by Proposition 3.4; this solution coincides with the mild solution of (3.1) on the interval [0, T ].From (3.3) we have For z 1 and z 2 we find the following estimates where we relied on the fact that y, a, and are essentially bounded.Altogether we have (4.2) h ds for all t ∈ [0, T ] and all h ∈ (0, 1].Let us define Since (a, ) ∈ W 1, (0, T ; R n×n ) × W 1, (0, T ; R n ), and ζ = min{r, }, where r is given by Lemma 2.7, we can estimate B h as follows (4.3) where we relied on Lemmas 2.
for all t ∈ [0, T ] and all h ∈ (0, 1].Using monotone convergence theorem, we can exchange the order of integration and summation to get where we used the definition of I nγ 0+ from Definition 2.1.Applying Lemma 2.5 we obtain where < ∞ is the celebrated Mittag-Leffler function; note that here we used nγΓ(nγ) = Γ(nγ + 1).Since {B h } is uniformly bounded in L ζ (0, T ; R), see (4.3), we obtain that the difference quotients of y are uniformly bounded in L ζ (0, T ; R) with respect to h ∈ (0, 1].Hence, y has a weak derivative in L ζ (0, T ; R) by [21,Thm. 3,p. 277].The proof is now complete.Remark 4.4.We remark that the degree of smoothness of the right-hand sides a, does not necessarily carry over to the strong solution y (unless a certain compatibility condition is satisfied, see Remark 4.5 below).This is in accordance with observations made in literature, see e.g., [19, Ex. 6.4, Rem 6.13, Thm 6.27] (fractional ODEs) and [36, Cor.2.2] (fractional in time PDEs).Indeed, for large values for and small values of γ tending to 0, the strong solution y ∈ W 1,ζ (0, T ; R n ), where ζ = r ∈ (1, 1/(1 − γ)) is close to 1.However, as γ approaches the value 1, one can expect the strong solutions to become as regular as their right-hand sides.This can be seen in the case γ = 1, where the smoothness of the strong solution improves as the smoothness of a, does so.Note that in this particular situation the solution of (4.1) is in fact far more regular than as in the statement in Theorem 4.3, see Remark 3.3.Remark 4.5 (Compatibility condition).If f (a(0)y 0 + (0)) = 0, then the regularity of the strong solution to (4.1) can be improved by looking at the equation satisfied by the weak derivative y and inspecting its smoothness.Since the focus of this paper lies on the optimal control and not on the analysis of fractional equations, we do not give a proof here.We just remark that the requirement f (a(0)y 0 + (0)) = 0 corresponds to the one in e.g.[19,Thm 6.26], cf., also [36,Cor. 2.2], where it is proven that the smoothness of the derivative of the strong solution improves if and only if such a compatibility condition is true.

Strong Stationarity.
The first result in this section will be an essential tool for establishing the strong stationarity in Theorem 5.7 below, as it guarantees the existence of a multiplier satisfying both a gradient equation and an inequality associated to a local minimizer of (P).
Proof.The technical proof can be found in Appendix A. The next step towards the derivation of our strong stationary system is to write the first order necessary optimality conditions in primal form.
Proof.The result follows from the continuous differentiability of g combined with the directional differentiability of S, see Theorem 3.6, and the local optimality of (ā, ¯ ).
Remark 5.4.Let us underline that there is a zoo of situations where the requirement in Assumption 5.3 is fulfilled.We just enumerate a few in what follows.
• If there exists some index m ∈ {1, ..., n} such that y 0,m > 0 and f (z) ≥ 0 ∀ z ∈ R then the optimal state satisfies ȳm (t) ≥ y 0,m > 0 for all t ∈ [0, T ], in view of (3.3).In particular, our 'constraint qualification' is fulfilled by continuous fractional deep neural networks (DNNs) with ReLU activation function, since f = max{0, •} in this case, while an additional initial datum can be chosen so that y 0,m > 0. • Similarly, if there exists some index m ∈ {1, ..., n} such that y 0,m < 0 and f (z) ≤ 0 ∀ z ∈ R, then the optimal state satisfies ȳm (t) ≤ y 0,m < 0 for all t ∈ [0, T ].In both situations, the CQ in Assumption 5.3 is satisfied.• If there exists some index m ∈ {1, ..., n} such that y 0,m = 0 and f ( ¯ m (t)) = 0 for all t ∈ [0, T ], then, according to [19,Thm. 6.14] the optimal state satisfies ȳm (t) = 0 for all t ∈ [0, T ].This is the case if e.g.f = max{0, •} and Remark 5.5.We point out that Assumption 5.3 is due to the structure of the state equation and due to the fact that constraints are imposed on the control (and not on the control a), see Remark 5.12 below for more details.The claim concerning the optimal state in Assumption 5.3 is essential for deriving the strong stationary optimality system (5.6)below and it plays the role of a 'constraint qualification' (CQ), cf.e.g.[38,Ch. 6].This terminology has its roots in finite-dimensional nonlinear optimization, where it describes a condition for the (unknown) local optimizer which guarantees the existence of Lagrange multipliers such that a KKT-system is satisfied, see e.g.[22,Sec. 2].In the non-smooth case, the KKT conditions correspond to the strong stationary optimality conditions, see Remark 5.9 below.
The following result describes the density of the set of arguments into which the non-smoothness is derived in the "linearized" state equation (3.6).This aspect is crucial in the context of proving strong stationarity for the control of non-smooth equations, cf.[10,30] and Remark 5.12 below.Lemma 5.6 (Density of the set of arguments of f ((āȳ + ¯ ) i (t); •)).Let (ā, ¯ ) be a given local optimum for (P) with associated state ȳ := S(ā, ¯ ).Under Assumption Proof.Let ρ ∈ C([0, T ]; R n ) be arbitrary, but fixed and define the function Note that δy ∈ C([0, T ]; R n ), in view of Lemma 2.5.We will now construct δa such that This is possible due to Assumption 5.3.Indeed, for j = 1, ..., n and t ∈ [0, T ] we can define Note that δa jm ∈ C[0, T ].Due to (5.3), δy satisfies the integral equation By Theorem 3.6, the integral equation is equivalent to In view of Proposition 3.5 and Theorem 3.6, the mapping S is locally Lipschitz continuous and directionally differentiable from Hence, the mapping S ((ā, ¯ ); where we recall (5.4).This gives in turn in view of (5.3).Since ρ ∈ C([0, T ]; R n ) was arbitrary, the proof is now complete.
The main finding of this paper is stated in the following result.
The optimality system in Theorem 5.7 is indeed of strong stationary type, as the next result shows: Theorem 5.10 (Equivalence between B-and strong stationarity).Let (ā, ¯ ) ∈ H 1 (0, T ; R n×n ) × K be given and let ȳ := S(ā, ¯ ) be its associated state.If there exists a multiplier λ ∈ L r (0, T ; R n ) and an adjoint state p ∈ L r (0, T ; R n ), where r is given by Lemma 2.7, such that (5.6) is satisfied, then (ā, ¯ ) also satisfies the variational inequality (5.2).Moreover, if Assumption 5.3 is satisfied, then (5.2) is equivalent to (5.6).
Remark 5.11 (Strong stationarity in the case γ = 1).If γ = 1, then the state ȳ associated to the local optimum (ā, ¯ ) ∈ H 1 (0, T ; R n×n × R n ) belongs to W 2,2 (0, T ; R n ); this is a consequence of the statement in Remark 3.3 combined with the fact that f Moreover, by taking a look at (5.6a) we see that the adjoint equation reads Some comments regarding the main result.We end this section by collecting some important remarks concerning Theorem 5.7.
Remark 5.12 (Density of the set of arguments of f ((āȳ + ¯ ) i (t); •)).The proof of Theorem 5.7 shows that it is essential that the set of directions into which the non-smooth mapping f is differentiated -in the 'linearized' state equation associated to (ā, ¯ ) -is dense in a (suitable) Bochner space (which is the assertion in Lemma 5.6).This has also been pointed out in [10, Rem.2.12], where strong stationarity for a coupled non-smooth system is proven.
Let us underline that the 'constraint qualification' in Assumption 5.3 is not only due to the structure of the state equation, but also due to the presence of constraints on .If constraints were imposed on a instead of , then there would be no need for a CQ in the sense of Assumption 5.3.An inspection of the proof of Theorem 5.7 shows that in this case one needs to show that This is done by arguing as in the proof of Lemma 5.6, where this time, one defines δ := ρ − ā δy.
Thus, depending on the setting, the 'constraint qualification' may vanish or may read completely differently [10, Assump.2.6], but it should imply that the set of directions into which f is differentiated -in the "linearized" state equation-is dense in an entire space [10, Lem.2.8], see also [10,Rem. 2.12].
These observations are also consistent with the result in [30].Therein, the direction into which one differentiates the non-smoothness -in the 'linearized' state equation -is the 'linearized' solution operator, such that the counterpart of our Lemma 5.6 is [30,Lem. 5.2].In [30], there is no constraint qualification in the sense of Assumption 5.3; however, the density assumption [30, Assump.2.1.6]can be regarded as such.In [16,Rem. 4.15] the authors also acknowledge the necessity of a density condition similar to that described above in order to ensure strong stationarity.Remark 5.13 (Control constraints).We point out that we deal with controls (a, ) mapping to (R n ) n+1 , whereas the space of functions we want to cover in Lemma 5.6 consists of functions that map to R n only.This allows us to restrict n controls by constraints (if we look at (P) as having n+1 controls mapping to R n .)Indeed, a closer inspection of the proof of Lemma 5.6 shows that one can impose control constraints on all columns of the control a except the m−th column.This still implies that the set of directions into which f is differentiated -in the 'linearized' state equation -is dense in an entire space.The fact that two or more controls provide advantages in the context of strong stationarity has already been observed in [26,Sec. 4].Therein, an additional control has to be considered on the right-hand side of the VI under consideration in order to be able to prove strong stationarity, see [26,Sec. 4] for more details.
The situation changes when, in addition to asking that ∈ K, control constraints are imposed on all columns of a.In this case, we deal with a fully control constrained problem.By looking at the proof of Lemma 5.6 we see that the arguments cannot be applied in this case, see also [10,16,30,32] where the same observation was made.This calls for a different approach in the proof of Theorem 5.7 and additional "constraint qualifications" [11,39].
Remark 5.14 (Sign condition on the adjoint state.Optimality conditions obtained via smoothening).
To see that (A.13) admits a unique solution, we argue as in the proof of Proposition 3.4.From Lemma 2.5 we know that the operator z → I γ T − (a ε f ε (a ε y ε + ε )z) maps continuous functions to continuous functions, since a ε , f ε (a ε y ε + ε ) ∈ L ∞ (0, T ; R n×n ).However, the first term in (A.13) is only L r − integrable with r given by Lemma 2.7.This means that (no matter how smooth z is) the fix point operator associated to (A.13), namely z → (T − •) γ−1 Γ(γ) ∇g(y ε (T )) + I γ T − (a ε f ε (a ε y ε + ε )z) maps only to L r (0, T ; R n ) with r given by Lemma 2.7.Due to Lemma 2.5 we have for all z 1 , z 2 ∈ L r (0, t * ; R n ) the estimate and by arguing exactly as in the proof of Proposition 3.4 one obtains that (A.13) admits a unique solution p ε ∈ L r (0, T ; R n ) with r given by Lemma 2.7.This immediately implies that λ ε ∈ L r (0, T ; R n ), since f ε (a ε y ε + ε ) ∈ L ∞ (0, T ; R n×n ).
Next, let (a, ) ∈ H 1 (0, T ; R n×n ) × K be arbitrary but fixed.We abbreviate By concatenating the local solutions found on the intervals [0, t * ] and [t * , 2t * ] one obtains a unique continuous function on [0, 2t * ] which satisfies the integral equation (3.3).Proceeding further in the exact same way, one finds that (3.3) has a unique solution in C([0, T ]; R n ).Proposition 3.5 (S is Locally Lipschitz).The solution operator associated to (3.1)

4 .
Strong solutions.Next we prove that the state equation (4.1)