A space–time variational method for optimal control problems: well-posedness, stability and numerical solution

We consider an optimal control problem constrained by a parabolic partial differential equation with Robin boundary conditions. We use a space–time variational formulation in Lebesgue–Bochner spaces yielding a boundedly invertible solution operator. The abstract formulation of the optimal control problem yields the Lagrange function and Karush–Kuhn–Tucker conditions in a natural manner. This results in space–time variational formulations of the adjoint and gradient equation in Lebesgue–Bochner spaces, which are proven to be boundedly invertible. Necessary and sufficient optimality conditions are formulated and the optimality system is shown to be boundedly invertible. Next, we introduce a conforming uniformly stable simultaneous space–time (tensorproduct) discretization of the optimality system in these Lebesgue–Bochner spaces. Using finite elements of appropriate orders in space and time for trial and test spaces, this setting is known to be equivalent to a Crank–Nicolson time-stepping scheme for parabolic problems. Comparisons with existing methods are detailed. We show numerical comparisons with time-stepping methods. The space–time method shows good stability properties and requires fewer degrees of freedom in time to reach the same accuracy.


Introduction
The optimal control of partial differential equations (PDE) is an area of vast growing significance e.g. in fluid flows, crystal growths or medicine, see, e.g.[1,2].This explains the huge amount of literature concerning theoretical as well as numerical aspects.
The abstract form of such problems relies on a cost function J ∶ Y × U → R, where Y and U are function spaces for the state y and the control u.The constrained optimal control problem then takes the form J(y, u) → min!s.t.e(y, u) = 0, ( where the constraint e(y, u) = 0 is often termed as state equation.At this point, there is a bifurcation concerning the subsequent approach.On the one hand, the first-discretize-then-optimize approach seeks for an appropriate discretization of (1.1) and then derives optimality conditions for the discretized optimal control problem.On the other hand, first-optimize-then-discretize means that optimality conditions are derived directly w.r.t.(1.1) and then the arising optimality system is discretized.We shall follow the second approach.
First-optimize-then-discretize Within this approach, the first step is a suitable interpretation of the state equation.In case of a PDE-constrained optimal control problem, the state equation is a PDE.Here, we are interested in the case where the PDE is a parabolic problem in space and time.This offers a variety of different formulations of the state equation, e.g.
• Strong form: e(y, u) = 0 is interpreted pointwise.This, however does often not allow statements on the well-posedness of the state equation.• Semi-variational : Using a method of lines yields either an inital value problem of an ordinary differential equation or a system of elliptic boundary value problems.• Space-time variational : Space and time are both treated as variables in a variational sense.In that case, the state equation is tested by space-time test functions z ∈ Z, where Z is an appropriate Lebesgue-Bochner space, and takes the form, for a right-hand side f (⋅; u) ∈ Z ′ find y ∈ Y ∶ b(y, z) = f (z; u) for all z ∈ Z. (1.2)

Space-time variational formulations and adjoint problem
We follow the last-mentioned method in the above list.In the literature, this approach has already been studied, see e.g.[3][4][5][6][7][8], but with some (partly subtle) differences to our approach to be detailed below.The well-posedness theory for (1.2) dates back (at least) to the 1970s, see e.g.[9][10][11].In order to describe to which extent our approach differs from the mentioned papers, we need to detail the choices of the bilinear form b(⋅, ⋅) as well as the trial and test spaces Y and Z.We shall see that our subsequent choice allows to show well-posedness of (1.2) under minimal regularity assumptions, but this necessarily implies that Y = Z and these spaces need to satisfy an inf-sup condition, which is known to hold, [12,13].We postpone a detailed comparison in Section 2.2 (in particular Remark 2.8) below.
Another difference of the known approaches from the literature and our proposed framework is the derivation of an optimality system.For the details, we refer to Section 3.2, Remark 3.9 below.We use the variational form (1.2) to derive the reduced problem (w.r.t. the control), which allows us to prove the existence of a unique optimal solution.In a next step, we define the Lagrange function, again using the variational form (1.2), from which we then derive the Karush-Kuhn-Tucker (KKT) conditions.The adjoint problem arises in a natural variational form by the KKT system, see Proposition 3.3 below.

Space-time discretization
In a final step, we propose a conforming space-time discretization, which amounts to construct finite-dimensional spaces Y δ ⊂ Y and Z δ ⊂ Z for a Petrov-Galerkin discretization of (1.2) and later also the control space U δ ⊂ U. Since Y = Z, the discrete spaces Y δ and Z δ need to satisfy a discrete inf-sup condition, also known as Ladyshenskaja-Babuška-Brezzi (LBB) condition, i.e., inf uniformly in δ (where β is independent of δ).The inf-sup constant β is particularly relevant as the Xu-Zikatanov lemma [14] yields an error/residual-relation with the multiplicative factor 1 β .In some cases, one can realize optimally stable discretizations, i.e., β = 1 (in particular, the constant is independent of the final time, which is crucial for optimal control problems), [13,15,16].This is a key motivation for our approach.However, there are different stable discretizations described in the literature.For example, in [12,17,18] wavelet methods have been used to derive an LBB-stable discretization, [13,15,16] propose tensorproduct discretizations (some of them reducing to time-stepping schemes) and [8,19] introduce unstructured finite element discretizations in space and time.Here, we use a tensorproduct discretization since they allow for efficient numerical solvers and admit optimal stability, [20,21]; of course, also other schemes could be used instead.Our approach leads to a different discrete system as previous approaches, see Section 4.5, Remark 4.4 below.
Until recent it has been believed that a simultaneous discretization of time and space variables would be way too costly since problems in n + 1 dimension need to be solved, where n denotes the space dimension.This has changed somehow since it is nowadays known that space-time discretizations yield good stability properties, can efficiently be used for model reduction and can also be treated by efficient numerical solvers, see [12,15,[20][21][22][23][24][25][26], just to name a few papers in that direction.However, the issues of a suitable discretization and the question if the arising higher-dimensional problem can efficiently be solved remain.Of course, also for space-time approaches different from ours, there are also efficient numerical solvers known, see e.g.[2,27,28].

Model problem
We consider the following PDE-constrained optimal control problem.The state space Y consists of mappings y ∶ I × Ω → R, the control space U of functions u ∶ I × Ω → R. We are interested in determining a control u ∈ U and a corresponding state y ∈ Y that solve the following optimization problem: where the functions µ ∶ Ω → R, η ∶ I × Γ → R and y d ∶ Ω → R as well as a scalar λ > 0 are given.Moreover, R is a linear operator, whose role will be described below.We shall always assume that µ(x) > 0 a for all x ∈ Ω a.e..

Remark 1.2 (a)
We could easily extend to a cost function of the form The extension to inhomogeneous initial conditions y(0, x) = y 0 and other types of boundary conditions follows standard lines, e.g.[12] and Remark 2.7.(c) In the first preprint version of this paper, we considered box constraints for the control.In order to discuss the analysis concerning well-posedness and convergence in full detail, we decided to devote control constraints to future research.

Organization of the paper
The remainder of this paper is organized as follows.In Section 2, we recall and collect some preliminaries on PDE-constrained optimization problems in reflexive Banach spaces and on space-time variational formulations of parabolic PDEs.The space-time variational formulation of the optimal control problem under consideration is developed in Section 3. In particular, we derive necessary and sufficient optimality conditions.Section 4 is devoted to the space-time discretization of the PDE, the discretization of the control as well as of the adjoint problem.The latter one turns out to be much simpler in our space-time context than in the semi-discrete setting as we obtain a linear system whose matrix is just the transposed of the matrix appearing in the primal problem.The fully discretized optimal control problem is then solved numerically.We report on our numerical experiments in Section 5 and conclude by a summary, conclusions and an outlook in Section 6.

Preliminaries
Let us start by collecting some preliminaries that we will need in the sequel.

Optimal control problems
In this section, we recall the abstract functional analytic framework for optimal control problems in reflexive Banach spaces b which we will later apply within the space-time setting.The consideration of control and/or state constraints is devoted to future research.Remark 2.2 Note, that e(y, u) = 0 is an equation in the dual space Z ′ of Z. Since we consider reflexive Banach spaces, it holds Z ′′ ≅ Z.Therefore, the constraint is to be interpreted as ⟨e(y, u), z⟩ Z ′ ×Z = 0 for all z ∈ Z, where ⟨⋅, ⋅⟩ Z ′ ×Z is the duality pairing, and the adjoint state will be in Z.
A pair (y, u) ∈ Y × U is called local optimum of Problem 2.1 if for some neighborhood N (y, u) of (y, u); the pair is called global optimum of Problem 2.1 if (2.1) is satisfied for all (y, u) ∈ Y × U.
We will be investigating the well-posedness of such optimal control problems in a space-time variational setting.This requires first to study the well-posedness of the state equation e(y, u) = 0, namely the question if a unique state can be assigned to each admissible control.If so, one defines the control-to-state operator b Concerning the chosen model problem we will deal with real Hilbert spaces.However, we will not identify these Hilbert spaces with their dual spaces, which is the reason why we describe the general optimal control framework for reflexive Banach spaces.
Theorem 2.3 Let U ≠ ∅ and Ĵ ∶ U → R be a weakly lower semi-continuous function such that there exists a constant c > −∞ so that Ĵ(u) ≥ c for all u ∈ U.Then, (2.3) has at least one solution u.If Ĵ is in addition strictly convex, then the optimal solution is unique.
Necessary first order optimality conditions for optimal control problems are based upon the Euler-Lagrange equation Ĵ′ (u) = 0, e.g.[1].This, however, involves the derivative of Ĵ, which is often difficult to determine exactly.The well-known way-out is through the adjoint problem.In fact, if e y (Su, u) ∶ Y → Z ′ (the partial derivative of e(⋅, ⋅) w.r.t.y) is a bijection, then, Ĵ′ (u) = J u (Su, u) − e u (Su, u) * (e y (Su, u) * ) −1 J y (Su, u), for any u ∈ U, where e y (Su, u) * and e u (Su, u) * denote the adjoint operators of e y (Su, u) and e u (Su, u), respectively.In order to avoid the determination of the inverse of the adjoint e y (Su, u) * , one considers the adjoint equation e y (y, u) * z = −J y (y, u), (2.4) whose solution z ∈ Z is called adjoint state.Then, Theorem 2.4 (KKT system) Let u be a solution of (2.3) and y ∶= Su the related state.Then, there exists an adjoint state z ∈ Z, such that the following KKT system is satisfied for all (t, x) ∈ I × Ω a.e.: e(y, u) = 0, (2.6a) eu(y, u) * z = −Ju(y, u). (2.6c) The Lagrange function L ∶ Y × U × Z → R to Problem 2.1 reads L(y, u, z) ∶= J(y, u)+⟨z, e(y, u)⟩ Z×Z ′ Then, (2.6) can equivalently be written as ∇L(y, z, z) = 0.

Space-time variational formulation of parabolic problems
In order to detail the setting in Section 2.1 for the specific Problem 2.1 at hand, we review a variational formulation of the initial boundary value problem (1.4) in space and time, which yields the specific form of the state operator e(⋅, ⋅).
To this end, let H ∶= L 2 (Ω), G ∶= L 2 (Γ), V ∶= H 1 (Ω) and V ′ be the dual of V induced by the H-inner product.Then, we denote the Lebesgue-Bochner spaces by Moreover, denoting by ⟨⋅, ⋅⟩ V ′ ×V the duality pairing in space only, we obtain inner products and duality pairing in time and space as for the respective u and v and X ∈ {V ′ , H, V }, X ∈ {V ′ , H, V}, respectively.Then, we start by testing the first equation in (1.4) with functions z(t) ∈ V , t ∈ I a.e., integrate over time, perform integration by parts in space and insert the Robin boundary condition of (1.4).Denoting by a ∶ V × V → R the bilinear form in space, i.e., a(φ, ψ) ∶= (∇φ, ∇ψ for t ∈ I a.e.To obtain a variational formulation in space and time we integrate (2.7) in time and obtain The trial space for the the state y is a Lebesgue-Bochner space defined as As in [13,16], we choose the norms but other equivalent norms can also be considered.The test space reads For the well-posedness of (2.8) (see [12]), we need However, the definition of the cost function J in Problem 1.1 requires Thus, we can now detail the role of the linear mapping R, namely R ∶ H → V ′ (which could here also be just the identity).We introduce the bilinear form b ∶ Y × Z → R and the linear form so that (2.8) equivalently can be written as Obviously, (2.12) is a variational problem of the form (1.2), where the righthand side is a linear form in Z ′ for all u ∈ U.For later reference, it will be convenient to reformulate (2.12) in operator form.To this end, we define so that (2.12) reads By = Ru + h.If we define the differential operator in space as , δy(t)) dt, then we get the representation By = ẏ + Ay.

Well-posedness of the parabolic problem
The proof of the well-posedness of the variational form (2.12) for any given u ∈ U basically follows the lines of [11][12][13], namely by verifying the conditions of the Banach-Nečas theorem.For the Robin data we make the usual assumptions µ ∈ L ∞ (I × Γ) and η ∈ G = L 2 (I; G).The surjectivity is shown by proving the convergence of a Faedo-Galerkin approximation, [12, App.A].Infsup-condition and boundedness can be derived by detailing primal and dual supremizers.
Proposition 2.5 The problem (2.12) is well posed with Proof The proof closely follows the lines of [12, Thm.5.1], [13, Prop.1] and [16,Prop. 2.6].In fact, we can identify primal and dual supremizers for given z ∈ Z and y ∈ Y, respectively, as follows In addition, sy 2 Remark 2.6 Even though we have proven optimal stability and continuity, we will later also need the general case, in which we have Remark 2.7 (Inhomogeneous initial conditions) As already mentioned in Remark 1.2, we restrict ourselves to homogeneous initial conditions only for convenience of the presentation.In fact, for y 0 = 0, we would set Y ∶= L 2 (I; V ) ∩ H 1 (I; V ′ ), the test space would be Z ∶= L 2 (I; V ) × H and bilinear and linear forms read for y ∈ Y, H , yielding a state equation of the form (1.2).Hence, inhomogeneous initial conditions can be treated analogously, just the notation becomes a bit more heavy, [12].

Comparison with other space-time methods
As already mentioned in the introduction, our approach is somehow different as existing ones in the literature.We are now going to describe the differences concerning the formulation of the state equation in more detail.
Remark 2.8 (Differences to existing space-time methods) (a) In [3][4][5], Meidner, Neitzel and Vexler use (almost) the same trial space Y as in (2.9), namely Ỹ ∶= L 2 (I; V ) ∩ H 1 (I; V ′ ), but impose the initial condition y(0) = y 0 c in strong form.The arising problem is not of the form (1.2).In fact, the well-posedness does not follow from the Banach-Nečas theorem but with techniques from semigroup theory, [11].This requires y 0 ∈ V (y 0 ∈ H is the minimal requirement) and additional regularity, [3, Prop.2.1].In fact, the righthand side is required to be in L 2 (I; H) and the solution is in , which is significantly stronger than Y defined in (2.9).Moreover, treating the initial condition as in [3][4][5] allows to use Ỹ also as test space, which is another reason for the additional smoothness, but which yields a Galerkin discretization instead of a Petrov-Galerkin one, see below.(b) In [6,7], von Daniels, Hinze and Vierling use the same trial space Ỹ for the state equation as [3][4][5], but impose the initial condition in a weak sense, [6, (1.5)].Moreover, Ỹ is chosen also as test space (in a Galerkin spirit).(c) In the more recent paper, Langer, Steinbach, Tröltzsch and Yang use the same variational formulation as we do, [8].However, the initial condition is part of the definition of the trial space, which will be relevant for the adjoint problem.
c Recall, that we have chosen homogeneous initial conditions y0 = 0 only for simplicity of exposition, see Remark 2.7.
In all cases, there are differences to our approach in the derivation/formulation of the adjoint equation and the adjoint state to be described in the next section.

Space-Time Variational Optimal Control Problem
Next, we formulate the optimal control problem in the variational space-time setting by specifying the above abstract framework.To do so, we are now going to derive a space-time variational formulation of Problem 1.1.The state space is determined by the PDE, i.e., we choose Y defined in (2.9).In view of Remark 2.2 and recalling that Z ′′ ≅ Z, we are now in position to formulate the constraint in space-time variational form as follows ⟨e(y, u), z⟩ i.e., e(y, u) ∶= By−Ru−h ∈ Z ′ .Next, we can detail the control-to-state operator H , where λ > 0 is the regularization parameter.Proof Since Ĵ is non-negative there exists a constant c > −∞ so that Ĵ(u) ≥ c for all u ∈ U.It remains to show that Ĵ is weakly lower semi-continuous.Since Ĵ is easily seen to be strictly convex and continuous (by continuity of S and the norms), Theorem 2.3 proves the claim.

First order necessary optimality conditions
Adopting the previous notation, we start by detailing the Lagrange function Finally, for δu ∈ U, we have L u (y, u, z)[δu] = λ (u, δu) H − ⟨z, Rδu⟩ V×V ′ .Hence, we obtain the following first order optimality (KKT) system: Find (y, Let us further detail the gradient equation (3.4c).Denote by R * ∶ V → H the adjoint operator of R defined by (R * v, h) H ∶= ⟨Rh, v⟩ V ′ ×V for v ∈ V and h ∈ H.
Then, (3.4c) reads λ(u, δu) H −(R * z, δu) H = 0, which means that we can derive a relation of the optimal control ū and the optimal adjoint state z, namely Then, we can formulate the KKT conditions as follows.
Proposition 3.3 (Optimality (KKT) system) Let (y, u) ∈ Y × U be an optimal solution of Problem 1.1.Then, there exists an adjoint state z ∈ Z such that the following optimality system holds: ) or, in operator form .
Setting P ∶= RR * ∶ V → V ′ , inserting (3.5) into (3.4)yields the reduced first order optimality system for determining (y, or, in operator form From (3.4b) we see that the adjoint problem arises from the primal one by exchanging the roles of trial and test spaces -and by a different right-hand side, of course.
Remark 3.4 (Well-posedness of the adjoint problem) We recall from Section 2.2 that the primal problem (3.4a) is well-posed, see e.g.[11,12].This is a consequence of the Banach-Nečas theorem and the fact that the inf-sup condition (2.14) holds.Due to its specific form, this immediately implies well-posedness also of the adjoint problem (3.4b), even with the same inf-sup constant as for (3.4a) .
Remark 3.5 Due to the convexity of the objective function as well as the linearity of the state equation, the Problem 1.1 is convex.Hence, every solution of (3.3) is a global optimal solution of the problem, [1].
Theorem 3.6 (Well-posedness of the optimality system) The first order optimality system (3.8) is well-posed for all λ > 0 with For γ P ∶= P V→V ′ , we have Proof The fact LL −1 = L −1 L = I can be verified by straightforward calculations, as long as L −1 exists.This, in turn, boils down to the existence of C −1 .In order to show this, note that Hence, the spectrum σ(K) ⊂ R + 0 is contained in the non-negative reals.This implies (I +λ which proves the claim. Corollary 3.7 For the space-time variational formulation (2.12), we have so that the inf-sup-constant of the reduced optimality system (3.8) is at least λ 1+2λ .
Remark 3.8 For the optimal control ū, it holds that Remark 3.9 (Differences to existing space-time methods, continued) We continue Remark 2.8 with highlighting the differences to previous publications.(a) In [3][4][5], the adjoint problem is derived directly from the variational formulation, which means that the terminal condition is imposed in strong form.This necessarily implies that Z = Y is the space for the adjoint state.Moreover, the same high regularity requirements apply for the solution of the adjoint problem as for the primal one, [3, Prop.2.3].(b) In [6,7] the adjoint equation is derived by integration by parts in time, which is possible since Z = Y.As in [3][4][5], this implies high (and the same) regularity for y and z, [6, La. 3.2].(c) In [8] the adjoint problem and also the gradient equation is derived in strong form, which is then formulated in space-time variational form.This results in a coupled space-time system where primal and adjoint state have the same regularity.Moreover, since the initial condition is imposed in the primal trial space, the terminal condition is part of the definition of the adjoint trial space.In our case, it holds z ∈ Z, which allows weaker regularity for the adjoint state.The differences concerning discretization will be described in the next section.

Space-Time Discretization
In this section, we are going to describe a conforming discretization of the optimal control problem in space and time.We start by reviewing space-time Petrov-Galerkin methods for parabolic problems from [13,15,16] and will extend this to a full space-time discretization of the optimal control problem at hand.This leads us to a tensorproduct-type discretization w.r.t.time and space variables.Of course, the approach is not restricted to tensorproducts; for example, one could also use unstructured space-time finite elements as in [8].However, w.r.t.stability and efficient solution of the fully discretized problems, the tensporproduct approach turned out to be very promising, see also [18,20].

Petrov-Galerkin discretization of the PDE
We consider and construct finite-dimensional spaces Y δ ⊂ Y and Z δ ⊂ Z, where -for simplicity -we assume that n δ ∶= dim(Y δ ) = dim(Z δ ).The Petrov-Galerkin approximation to (2.12) amounts finding y δ ∈ Y δ such that (for given u ∈ U to be discretized below) We may think of δ = (∆t, h), where ∆t is the temporal and h the spatial mesh width.We recall, that there are several ways to select such discrete spaces so that the arising discrete problem is well-posed and stable in the sense of (1.3).An overview of conditionally and unconditionally stable variants can be found in [15,24,31].In [19] a finite element approach is described.Moreover, the authors of [13,16] show that linear ansatz and constant test functions w.r.t.time lead to the Crank-Nicolson time integration scheme for the special case of homogeneous Dirichlet boundary conditions if the right-hand side is approximated with the trapezoidal rule.A similar approach, but for the case of Robin boundary conditions, is briefly presented in the sequel, where we basically follow [15].It is convenient (and, as we explained above, also efficient from the numerical point of view) to choose the approximation spaces to be of tensorproduct form, with the temporal subspaces Our particular choice is as follows: The time interval I = (0, T ) is discretized according to where K ∈ N denotes the number of time steps, i.e., ∆t ∶= T K is the time step size.The temporal subspaces V ∆t , Q ∆t and the spatial subspace V h read with piecewise linear functions with the coefficient vector We are going to derive the arising linear system of equations for (4.1) with the stiffness matrix B δ ∈ R n δ ×n δ and the vectors (Ru) δ ∈ R n δ , h δ ∈ R n δ to be detailed next.To this end, we use the basis functions for the test space and obtain for = 0, ..., K − 1 and j = 1, ..., n h b(y δ , ξ ⊗ φ j ) = Moreover, it holds (Ru) δ ∶= [r j (u)] =0,...,K−1; j=1,...,n h ∈ R n δ and h δ ∶= [h j ] =0,...,K−1; j=1,...,n h ∈ R n δ , where r j (u) =∶ ⟨Ru, ξ ⊗ φ j ⟩ V ′ ×V and h j =∶ h(ξ ⊗φ j ) = (η, ξ ⊗φ j ) G .The control u will be discretized below.In order to derive a compact form, we introduce a number of matrices Based upon this, we obtain Remark 4.1 There are several uniformly inf-sup stable discretizations available, see e.g.[31].For the above case, there is even an optimal discretization, i.e., where the inf-sup constant is unity, [13,16].In any case, we have (and shall assume in the sequel) that (1.3) holds uniformly in δ → 0, possibly with discrete norms ⋅ Y δ , ⋅ Z δ .

Discretization of the control
So far, we did not yet discretize the control u ∈ U = L 2 (I; H).A natural choice seems to be U δ ∶= Q ∆t ⊗ V h = Z δ , but other choices are possible as well.Thus, we consider The next step is to detail (Ru) δ based upon this discretization.We obtain for = 0, ..., K − 1 and j = 1, ..., n h where M time ∆t ∈ R K×K is the identity (for piecewise constants) as introduced above and Putting everything together, the discretized version of the primal problem (4.4) reads For later reference, we note that Remark 4.2 We stress the fact that we could use any other suitable discretization of the control, both w.r.t.time and space, in particular including adaptive techniques or a discretization arising from implicitly utilizing the optimality conditions and the discretization of the state and adjoint equation, see e.g.[32].

Petrov-Galerkin discretization of the adjoint problem
We are now going to derive the discrete form of the adjoint problem (3.6b) or (3.4b).Since this problem involves the adjoint operator, it seems reasonable to use the same discretization, so that the (matrix-vector form of the) discrete problem amounts finding z δ ∈ R n δ such that (for given y δ ∈ Y δ ) with d δ (y δ ) ∈ R n δ , g δ ∈ R n δ , i.e., d(y δ , δy δ )+b(δy δ , z δ ) = g(δy δ ) for all δy δ ∈ Y δ .Note, that the stiffness matrix is the transposed of the stiffness matrix of the primal problem.The unknown coefficient vector Let us now detail the remaining terms .., K and j = 1, ..., n h , we get Further, we abbreviate the coefficient vector of y δ (T ) in terms of the basis Φ h as y K δ ∶= [y K 1 , ..., y K n h ] ⊺ so that by (4.3) we obtain for = 1, ..., K and j = 1, ..., n h e For the common case R = I, we get ni,j = (φi, φj ) H , i.e., N space where δ ,K denotes the discrete Kronecker delta and we introduce We note that it holds θ (T ) = δ ,K , i.e., θ (T ) = 0 for = 1, ..., K − 1 and θ K (T ) = 1.With this notation at hand, the fully discretized version of the adjoint problem (4.8) reads

Petrov-Galerkin discretization of the gradient equation
In order to obtain a discrete version of the gradient equation we test (3.4c) with the basis functions of U δ , namely λ (u δ , δu δ ) H − ⟨Rδu δ , z δ ⟩ V ′ ×V = 0 for all δu δ ∈ U δ .Recalling the discretizations (4.5) and (4.9) of u δ and z δ , respectively, we obtain for = 0, ..., K − 1 and j = 1, ..., n h Then, the discrete version of the gradient equation (3.4c) reads We note, that M δ is a square mass matrix, i.e., invertible.For R = I, we have

The discrete optimality system
We can now put all pieces together and detail the discrete version of the first order optimality system (3.4),namely Recalling (4.7), (4.10) and (4.11), the discrete first order optimality system (4.12) can be written in matrix form as , where all involved matrices have tensorproduct structure.In view of (4.11), i.e, λM δ u δ = M ⊺ δ z δ , we can easily eliminate the variable u δ and obtain the reduced system which is a discretized version of (3.8).All involved matrices are tensorproducts and for our choice, we have Theorem 4.3 (Well-posedness of the discrete optimality system) Assume that the discrete inf-sup condition (1.3) holds.Then, the discrete first order optimality system (4.12) is well-posed for all λ > 0 and Proof We can apply Theorem 3.6 for the reduced discrete optimality system (4.13).
Following the lines of its proof, (3.9) ensures that so that the reduced system is uniformly invertible.Adapting Remark 3.8 for the discrete case yields u δ ≤ 1 λ L −1 δ ( h δ + g δ ).
Remark 4.4 (Differences to existing space-time methods, continued) We continue Remarks 2.8 and 3.9 with highlighting the differences to previous publications, now concerning the discretization.
To summarize our approach, we start by an optimally stable Petrov-Galerkin discretization of the state equation based upon tensorproducts, which can be chosen to be equivalent to a Crank-Nicolson time stepping method.In a second step, we chose an appropriate discretization of the control.This automatically yields a Petrov-Galerkin discretization of the adjoint equation and the gradient equation.Putting everything together results in a stable discretization of the optimality system along with a priori and a posteriori error estimates.(a) [3][4][5] suggest semi-discretizations for the state by discontinuous Galerkin methods.Since there Z = Y, this can also be used for the adjoint state.Stability and approximation results are then given.(b) [6,7] uses a Petrov-Galerkin method with temporal discontinuous trial functions and continuous test functions for primal and adjoint problems.The control is not discretized, but treated in a variational manner.(c) In [8], the derivation yields a 2×2 saddle point problem for primal and dual state similar to (3.7), which is discretized in a similar fashion as in our approach for the primal state equation.This means that also here primal and dual states use discretizations of the same order, which is different from our approach.Moreover, [8] uses an unstructured space-time discretization, whereas we suggest a tensorproduct approach.However, the tensorproduct discretization was here mainly chosen to allow the use of efficient solvers and can easily be replaced by other discretizations as well.

Error analysis
Corollary 4.5 (A priori estimate) Applying the Xu-Zikatanov Lemma [14] yields a quasi-best approximation statement, i.e., Using the above described discretization for Y δ , Z δ and U δ , we get an error of order O(max{h, ∆t}) in the prescribed norms, which can easily be improved by using higher order discretizations (if the solution is sufficiently regular).
The latter estimate allows us to use residual-based error estimates, which are e.g.particularly relevant for the reduced basis method in the case of parameter-dependent problems, see e.g.[33].

Discretization of the cost function
Finally, we detail the space-time discretization of the cost function, i.e., where We solve the optimal control problem by numerically solving the reduced 2 × 2 block linear system arising from the optimality system (4.13).

Numerical Results
In this section, we present some results of our numerical experiments.We follow two main goals: (1) We make quantitative comparisons concerning the inf-sup-stability of the optimality system and (2) we compare the above presented space-time variational approach with the standard semi-discretization (see also [34] for such comparisons for parabolic problems).We do not compare with other state-of-the-art methods as we are mainly interested in investigating the effect of simultaneous space-time discretization.In order to make the comparison fair, we chose an all-at-once method for the semi-discrete framework so that the reduced discrete optimality system is built in a similar manner in both approaches.Moreover, we used the Crank-Nicolson scheme for the semi-discrete problem since our choice for trial and test spaces for the primal problem is equivalent to this time-stepping scheme, [13,16].Thus, in the semidiscrete setting, primal and dual problems amount for a comparable number of operations, with a stability issue for the dual problem, of course.Note that, in the semi-discrete case, the Crank-Nicolson scheme for the adjoint problem (i.e., ∂ t z + ∆z = 0; z(T ) = y(T ) − y d ) is backward in time.
All results were obtained with Matlab R2020b on a machine with a quad core with 2.7 GHz and 16 GB of RAM.

Discrete inf-sup constant
We start by computing the discrete inf-sup constant of the optimality system and compare that with the bound (3.10) in Corollary 3.7.We report the data for a 1d example on I × Ω = (0, 1) × (−1, 1) with µ(x) = x 2 + 0.1 for n h = 40, K = 80, but stress that the results are representative also for other examples.
First, we investigate the dependence of the inf-sup constant w.r.t. the regularization parameter λ.In Figure 1, we show the computed discrete inf-sup constant in comparison with the lower bound (2 + 1 λ ) −1 .We observe the same quantitative behaviors of both curves and see that our bound seems to be almost sharp for increasing values of λ.In particular, we see the optimality (inf-sup is unity) already for λ = 10 −2 and larger.For small values of λ, the bound is too pessimistic by almost two orders of magnitude.
Next, we fix λ = 10 −2 and investigate the dependence of the discretization.The results are presented in Figure 2. On the left, we fix K = 60 and vary n h , whereas on the right, we choose n h = 60 and modify K.We see that the lower f Doing so, we have that y d,h ∈ V h (i.e., a piecewise linear approximation), which is useful for our experiments.We could of course have also used a piecewise constant approximation.
namely n h = 101 and n h = 1001.First, we observe that the overall performance is independent of the choice of λ.Next, we see that both methods converge to the same value of the objective function as K increases.However, the huge benefit of the space-time setting shows off, namely that we reach an almost optimal value also for very coarse temporal discretizations, which offers significant computational savings.It is not surprising that this effect is due to the  improved stability of the space-time method as we can also see in Figure 4, where we depict the control for different values of K for λ = 10 −3 and n h = 101.We can clearly observe the stability issues for the semi-discrete approach in the left column, which do not appear in the space-time context.

Higher dimensional examples
A possible criticism of the space-approach is the fact that the size of the optimality system might significantly grow with increasing space dimension.Hence, we realized both approaches also in 2d and 3d and report the results in the 2d case here.We do not monitor CPU-time comparisons, but refer e.g. to [20,35], where such comparisons have been done for space-time variational formulations of the heat and wave equation, respectively.It was shown there, that appropriate tensorproduct solvers in fact yield competitive CPU times for the arising space-time systems.The adaptation of those approaches to the optimality system (4.13) is subject to ongoing work, see also Remark 5.1 below.
Hence, we are going to report results for I ×Ω ∶= (0, 1)×(0, 1) 2 and boundary data µ(x) ∶= 0.25 cosh(xy)+0.25 along with a compatible function η.As desired state, we choose y d (x) = tanh(10(x − 0.5)(y − 0.5)).As in the 1d case, we compare the values of the objective function, see Figure 5.The overall behavior is very similar to the 1d case, namely we get a significant improvement of the space-approach over the semi-discrete one for small number of time steps K. Note, that here we use a linear scale for the horizontal axis as opposed to Figure 3, where the results are shown in logarithmic scale.Moreover, for large values of λ, we observe the necessity of a sufficiently fine spatial discretization for both methods.
Remark 5.1 With the chosen all-at-once approach, we get very similar CPU times for both methods.As already pointed out earlier, a runtime comparison of best possible schemes is not the aim of this paper.Not using efficient tensorproduct solvers yields that the limiting factor is the memory -in both cases.

Summary, conclusions and outlook
We have considered a space-time variational formulation for a PDE-constrained optimal control problem.Our first-optimize-then-discretize approach follows the abstract functional analytic framework of such problems, which is then detailed for the space-time variational method.This can be summarized as follows: • Well-posed space-time variational formulation of the state equation.This yields different trial and test spaces (Petrov-Galerkin style) of minimal regularity;  • Formulation of the optimal control problem in the arising spaces, definition of the Lagrange function and derivation of KKT conditions.This yields the adjoint and gradient equations in natural spaces with minimal regularity requirements; • Derivation of (necessary and sufficient) optimality conditions and optimality system (still in the infinite-dimensional setting); • LBB-stable discretization of the optimality system.In special cases, this can be chosen to be equivalent to a Crank-Nicolson semi-discrete discretization, which allows quantitative numerical comparisons.Moreover, we reported on numerical experiments showing that space-time methods yield the same value of the objective function for significantly smaller number of unknowns.Since the CPU-times for the same number of unknowns turned out to be similar, this offers potential for significant speedup.
Topics for future research include control and state constraints, other types of PDEs for the constraints, improved schemes for solving the optimality system, adaptive discretization of the control, etc.Also efficient solvers that explicitly exploit the Kronecker structures of arising operators should be investigated.Finally, the above setting seems to be a very good starting for investigating model reduction, e.g.[16].

Problem 1 . 1 (
Model problem in classical form) Let I = (0, T ) ⊂ R, 0 < T < ∞ and Ω ⊂ R n be a bounded open Lipschitz domain.The normal vector of ∂Ω ∶= Γ is denoted by ν(x) ∈ R n for all x ∈ Γ.

Problem 2 . 1
Let Y, U, Z be some real reflexive Banach spaces.Given an objective function J ∶ Y × U → R and the state operator e ∶ Y × U → Z ′ , we consider the constrained optimization problem min (y,u)∈Y×U J(y, u) subject to (s.t.) the constraint e(y, u) = 0.