1 Introduction

Since the pioneering theoretical works by Russell [37] and Lions [22], the numerical resolution of controllability problems for PDEs has faced a range of challenging difficulties which have been solved by using a number of sophisticated techniques (see, e.g., [10, 11, 15, 29, 32, 46], among many others). All these methods require the numerical approximation of some suitable PDEs, a task which is done by using classical methods in numerical analysis, mainly finite differences and finite elements. As a consequence, the available methods for solving numerically controllability problems for PDEs suffer from the well-known curse of dimensionality phenomenon. In plain words, curse of dimensionality reflects the fact that doubling the number of degrees of freedom in each spatial direction increases the solution complexity by a factor of \(2^d\), with d being the spatial dimension. This makes classical numerical methods for solving PDEs impractical when the spatial dimension is large.

On the other hand, during the last few years there has been a deep research effort in developing numerical methods for solving PDEs by using techniques from machine learning (ML) and artificial intelligence (AI). The main motivation for exploring the use of these new techniques in approximating PDEs is not to try to find methods that outperform classical methods (finite differences, finite elements, finite volumes, etc.) in low spatial dimensions (\(d=1,2,3\)), but to solve numerically high-dimensional PDEs (\(d>3\)), where the above-mentioned classical methods get stuck by the curse of dimensionality.

Examples of fields where high-dimensional PDEs arise are, among others: radioactive transport equation (\(d\ge 5\)), kinetic models, e.g., the Boltzmann kinetic equations (\(d=6\)), computational finance, e.g., the nonlinear Black–Scholes equation for pricing derivatives (\(d\gg 1\)), computational quantum chemistry, e.g., the nonlinear Schrödinger equation in the quantum many-body problem (\(d\gg 1\)), and game theory, e.g., the Hamilton–Jacobi–Bellman equation in dynamic programming [17]. Control problems for parametric PDEs is another field where high dimension plays a crucial role [20, 25, 26, 47].

Among the different deep-learning-based methods that have been recently proposed to approximate numerically the solution of PDEs, it is worth to mention the following: physics-informed neural networks (PINNs) [34], deep Ritz method [41], methods based on the Feynman–Kac formula [5], and methods based on the solution of backward stochastic differential equations [17]. See also [4, 6] for recent reviews. Although these methods have shown an excellent performance at the level of numerical simulation, the error analysis theory for these methods is essentially missing.

The above deep learning-based numerical schemes can be adapted to solve numerically not only forward problems for PDEs, but also a number of related problems involving PDEs such as inverse problems [27, 34] or random PDEs [43], among others.

Up to the best knowledge of the authors, the numerical resolution of controllability problems for PDEs by using ML has not been addressed so far.

The main goal of this paper is to explore the use of PINNs to approximate numerically the solution of controllability problems for PDEs. More precisely, a PINNs-based algorithm that applies to both linear and nonlinear PDEs is proposed in Sect. 2. Then, fostered by [27], error estimates for the so-called generalization error are provided (Theorem 3.1). The proof of this result is based on energy estimates for the solution of the considered PDE and on observability inequalities for its associated adjoint system. From these error bounds, a convergence result of the control and state obtained by using the proposed method to the control and state of the continuous problem is established (Corollary 3.1). For the sake of clarity, in this preliminary work we focus on the case of boundary Dirichlet control, but in a straightforward manner the ideas and methods here proposed extend to other types of control actions. Also, for pedagogical reasons, instead of presenting an abstract general framework, the methods and proofs are first described for the two emblematic systems of the wave and heat equations and then extended to more general PDE systems. Preliminary numerical experiments for three different PDEs illustrate the performance of the proposed method. More precisely, the accuracy of the method is tested on a simple model for the wave equation for which an analytical solution is available. In a second experiment, a high-dimensional controllability problem for the heat equation is considered. The third experiment concerns a semilinear PDE.

As the title indicates, this work is just a first step toward the numerical resolution of controllability problems for high-dimensional PDEs and so further research is needed to achieve a deeper understanding of the type of problems considered here.

Finally, for the sake of completeness, it is important to point out that the connection between different architectures of deep neural networks and controlled ordinary differential equations has been recently studied in [9]. This is a novel research line that also includes the analysis of the so-called neural differential equations by using techniques coming from continuous control theory [1, 12, 13, 35, 36].

2 Problem Setup and Description of the PINNs Algorithm

From now on in this paper, \(\varOmega \subset {\mathbb {R}}^d\), \(d\in {\mathbb {N}}\), denotes a bounded domain with a smooth boundary which is decomposed into two disjoint parts \(\varGamma _D \) and \(\varGamma _C\). For any positive time T, we denote by \(Q_T:=\varOmega \times \left( 0,T\right) \). As is usual \(\varDelta :=\sum _{j=1}^d\frac{\partial ^2}{\partial x_j^2}\) stands for the Laplacian operator.

2.1 Wave Equation

Given initial data \((y^0,y^1)\) in suitable function spaces, the null controllability problem for the wave equation amounts to finding a positive time \(T>0\) and a control function u(xt) such that the solution y(xt) of the system

$$\begin{aligned} \begin{array}{ll} y_{tt}-\varDelta y=0,&{}\quad {\text {in}}\; Q_T \\ y(x,0)=y^0(x),&{}\quad {\text {in}}\; \varOmega \\ y_t(x,0)=y^1(x)&{}\quad {\text {in}}\; \varOmega \\ y(x,t)=0,&{}\quad {\text {on}}\; \varGamma _D\times \left( 0,T\right) \\ y(x,t)=u(x,t)&{}\quad {\text {on}}\; \varGamma _C\times \left( 0,T\right) \end{array} \end{aligned}$$
(1)

satisfies

$$\begin{aligned} y(x,T)=y_t(x,T)=0\quad x\in \varOmega . \end{aligned}$$
(2)

It is well known [22] that if the domain \(\varOmega \) satisfies the so-called geometrical controllability condition (GCC) introduced by Bardos, Lebeau, and Rauch [3] and if \(\left( y^0, y^1\right) \in L^2(\varOmega )\times H^{-1}(\varOmega )\), then, for T large enough, problem (1)–(2) has a solution \(u\in L^2\left( \varGamma _C\times (0, T)\right) \).

The PINNs approach for solving direct and inverse problems for PDEs [34] is next adapted to approximate numerically the control u(xt) of problem (1)–(2). Roughly speaking, in the PINNs approach the solution is approximated by a neural network and the equations are imposed, in the least square sense, at a collection of nodal points. In the machine learning language, PINNs approach is composed of the following four main steps: (1) design an artificial neural network \({\hat{y}}\left( x,t;\varvec{\theta }\right) \) as a surrogate of the true solution y(xt), (2) consider a training set that is used to train the neural network, (3) define an appropriate loss function which accounts for residuals of the PDE, initial, boundary, and final conditions, and (4) train the network by minimizing the cost function defined in the previous step. From the training process, optimal parameters defining the neural network \({\hat{y}}\left( x,t;\varvec{\theta }\right) \) are computed and eventually are used to get predictions about the state y(xt) and the control u(xt), which is approximated as the trace of \({\hat{y}}\left( x,t;\varvec{\theta }\right) \) on the boundary \(\varGamma _C\). Next, we give details on these steps:

Step 1: Neural Network Among different possibilities, we consider a deep feedforward neural network (also known in the literature as a multilayer perceptron (MLP)) with \(d+1\) input channels \(\varvec{x}=(x,t)\in {\mathbb {R}}^{d+1}\) and a scalar output \({\hat{y}}\) (see Fig. 1). More specifically, \({\hat{y}}\left( x,t;\varvec{\theta }\right) \) is constructed as

$$\begin{aligned} \begin{array}{ll} \text {input layer:} &{} {\mathcal {N}}^0(\varvec{x})=\varvec{x}=(x,t)\in {\mathbb {R}}^{d+1} \\ \text {hidden layers:} &{} {\mathcal {N}}^{\ell }(\varvec{x})=\sigma \left( \varvec{W}^{\ell }{\mathcal {N}}^{\ell -1}(\varvec{x})+\varvec{b}^{\ell }\right) \in {\mathbb {R}}^{N_{\ell }}, \quad \ell = 1, \cdots , L-1 \\ \text {output layer:} &{} {\hat{y}}\left( \varvec{x};\varvec{\theta }\right) = {\mathcal {N}}^L(\varvec{x})=\varvec{W}^{L}{\mathcal {N}}^{L -1}(\varvec{x})+\varvec{b}^{L}\in {\mathbb {R}}, \end{array} \end{aligned}$$
(3)

where

  • \({\mathcal {N}}^{\ell }:{\mathbb {R}}^{d_{in}}\rightarrow {\mathbb {R}}^{d_{out}}\) is the \(\ell \) layer with \(N_{\ell }\) neurons,

  • \(\varvec{W}^{\ell }\in {\mathbb {R}}^{N_{\ell }\times N_{\ell -1}}\) and \(\varvec{b}^{\ell }\in {\mathbb {R}}^{N_{\ell }}\) are, respectively, the weights and biases so that \(\varvec{\theta }=\left\{ \varvec{W}^{\ell }, \varvec{b}^{\ell }\right\} _{1\le \ell \le L}\) are the parameters of the neural network, and

  • \(\sigma \) is an activation function, which acts component-wise. Throughout this paper, we consider smooth activation functions such as hyperbolic tangent \(\sigma (s)=\tanh (s)\), with \(s\in {\mathbb {R}}\).

Fig. 1
figure 1

Illustration of a fully connected deep feedforward neural network with two input channels \(\varvec{x}=(x,t)\), two hidden layers (each one with 3 neurons) and a scalar output \({\hat{y}}(x,t;\varvec{\theta })\). Input data pass through the net by following (3). The set of free parameters of the network is denoted by \(\varvec{\theta }\)

Step 2: Training Dataset A dataset \({\mathcal {T}}\) of scattered data is selected in the interior domain \({\mathcal {T}}_{\text {int}}\subset Q_T\) and on the boundaries \({\mathcal {T}}_{\varGamma _D} \subset \varGamma _D\times (0, T)\), \({\mathcal {T}}_{t=0} \subset \varOmega \times \left\{ 0\right\} \), \({\mathcal {T}}_{t=T}\subset \varOmega \times \left\{ T\right\} \). Thus, \({\mathcal {T}}={\mathcal {T}}_{\text {int}}\cup {\mathcal {T}}_{\varGamma _D}\cup {\mathcal {T}}_{t=0}\cup {\mathcal {T}}_{t=T}\) (see Fig. 2 for an illustration). The number of selected points in \({\mathcal {T}}_{\text {int}}\) is denoted by \(N_{int}\). Analogously, \(N_{b}\) is the number of points on the boundary \(\varGamma _D\), and \(N_{0}\) and \(N_{T}\) stand for the number of points in \({\mathcal {T}}_{t=0}\) and \({\mathcal {T}}_{t=T}\), respectively. The total number of collocation nodes is denoted by N, and we write \({\mathcal {T}}_N\) instead of \({\mathcal {T}}\) to indicate clearly the number of points N used hereafter.

Fig. 2
figure 2

Illustration of a training dataset (based on Sobol points) in the domain \(Q_2=(0,1)\times (0,2)\). Interior points are marked with circles and boundary points in blue color

Step 3: Loss Function A weighted summation of the \(L^2\) norm of residuals for the equation, boundary, initial, and final conditions is considered as the loss function to be minimized during the training process. It is composed of the following six terms: given a neural network approximation \({\hat{y}}\) (as constructed in (3)), define

$$\begin{aligned} {\mathcal {L}}_{\text {int}}\left( \varvec{\theta };{\mathcal {T}}_{\text {int}}\right)= & {} \sum _{j=1}^{N_{\text {int}}} w_{j,\text {int}}\vert {\hat{y}}_{tt} (\varvec{x}_j;\varvec{\theta })- \varDelta {\hat{y}}(\varvec{x}_j;\varvec{\theta }) \vert ^2, \quad \varvec{x}_j\in {\mathcal {T}}_{\text {int}}, \\ {\mathcal {L}}_{\varGamma _D}\left( \varvec{\theta };{\mathcal {T}}_{\varGamma _D}\right)= & {} \sum _{j=1}^{N_{b}} w_{j, b}\vert {\hat{y}}(\varvec{x}_j;\varvec{\theta }) \vert ^2, \quad \varvec{x}_j\in {\mathcal {T}}_{\varGamma _D}, \\ {\mathcal {L}}_{t=0}^{\text {pos}}\left( \varvec{\theta };{\mathcal {T}}_{t=0}\right)= & {} \sum _{j=1}^{N_{0}} w_{j, 0}\vert {\hat{y}}(\varvec{x}_j;\varvec{\theta }) - y^0(\varvec{x}_j) \vert ^2, \quad \varvec{x}_j\in {\mathcal {T}}_{t=0}, \\ {\mathcal {L}}_{t=0}^{\text {vel}}\left( \varvec{\theta };{\mathcal {T}}_{t=0}\right)= & {} \sum _{j=1}^{N_{0}} w_{j, 0}\vert {\hat{y}}_t(\varvec{x}_j;\varvec{\theta }) - y^1(\varvec{x}_j) \vert ^2, \quad \varvec{x}_j\in {\mathcal {T}}_{t=0}, \\ {\mathcal {L}}_{t=T}^{\text {pos}}\left( \varvec{\theta };{\mathcal {T}}_{t=T}\right)= & {} \sum _{j=1}^{N_{T}} w_{j, T}\vert {\hat{y}}(\varvec{x}_j;\varvec{\theta }) \vert ^2, \quad \varvec{x}_j\in {\mathcal {T}}_{t=T}, \\ {\mathcal {L}}_{t=T}^{\text {vel}}\left( \varvec{\theta };{\mathcal {T}}_{t=T}\right)= & {} \sum _{j=1}^{N_{T}} w_{j, T}\vert {\hat{y}}_t(\varvec{x}_j;\varvec{\theta }) \vert ^2, \quad \varvec{x}_j\in {\mathcal {T}}_{t=T}, \end{aligned}$$

where \( w_{j,\text {int}}\), \(w_{j, b}\), \(w_{j, 0}\), and \(w_{j, T}\) are the weights of the quadrature rules.

The loss function used for training is

$$\begin{aligned} {\mathcal {L}}\left( \varvec{\theta };{\mathcal {T}}\right)= & {} {\mathcal {L}}_{\text {int}}\left( \varvec{\theta };{\mathcal {T}}_{\text {int}}\right) \nonumber \\&+ {\mathcal {L}}_{\varGamma _D}\left( \varvec{\theta };{\mathcal {T}}_{\varGamma _D}\right) \nonumber \\&+ {\mathcal {L}}_{t=0}^{\text {pos}}\left( \varvec{\theta };{\mathcal {T}}_{t=0}\right) + {\mathcal {L}}_{t=0}^{\text {vel}}\left( \varvec{\theta };{\mathcal {T}}_{t=0}\right) \nonumber \\&+ {\mathcal {L}}_{t=T}^{\text {pos}}\left( \varvec{\theta };{\mathcal {T}}_{t=T}\right) + {\mathcal {L}}_{t=T}^{\text {vel}}\left( \varvec{\theta };{\mathcal {T}}_{t=T}\right) . \end{aligned}$$
(4)

Notice that no boundary condition is imposed on \(\varGamma _C\). As is usual in the field of machine learning, all the derivatives included in the loss function are computed by using automatic differentiation [2].

Step 4: Training Process The final step in the PINN algorithm amounts to minimizing (4), i.e.,

$$\begin{aligned} \varvec{\theta }^*= \arg \min _{\varvec{\theta }} {\mathcal {L}}\left( \varvec{\theta };{\mathcal {T}}\right) . \end{aligned}$$
(5)

The approximation \({\hat{u}}(t;\varvec{\theta }^*)\) of the control u(xt) is then obtained as the restriction of \({\hat{y}}(x,t;\varvec{\theta }^*)\) to the boundary \(\varGamma _C\), i.e.,

$$\begin{aligned} {\hat{u}}(x,t;\varvec{\theta }^*)= {\hat{y}}(x,t;\varvec{\theta }^*),\quad x\in \varGamma _C, \text { } 0\le t\le T. \end{aligned}$$
(6)

See Fig. 3 for an schematic diagram of the proposed algorithm.

Fig. 3
figure 3

PINN algorithm for approximating the exact state and control for the wave equation. The neural network \({\hat{y}}\left( x,t,\varvec{\theta }\right) \) is required to satisfy, in the least squares sense, the PDE, initial conditions, boundary condition and exact controllability conditions. Then, the residual on training points \({\mathcal {L}}\left( \varvec{\theta };{\mathcal {T}}\right) \) is minimized to get the optimal set of parameters \(\varvec{\theta }^*\) of the neural network. This leads to the PINN exact state \({\hat{y}}\left( x,t;\varvec{\theta }^*\right) \). Finally, the PINN exact control \({\hat{u}}\left( x,t;\varvec{\theta }^*\right) \) is obtained as the trace of the PINN state on the boundary control region \(\varGamma _C\)

Remark 2.1

Notice that the PINN algorithm proposed above is, in spirit, in the same line as the one considered in [30], where an error function is minimized in the sense of least squares. As a consequence, if this error function reaches the zero value, then the controllability condition is satisfied. A major difference with respect to classical numerical methods for control of PDEs is that the PINN-based approach is mesh-free as it does not require a (finite element) mesh for numerical approximation. Moreover, the function that is used for numerical approximation is a neural network as opposed to (piece-wise) polynomials, that are the usual models of choice.

Remark 2.2

As is well known, the different terms that appear in the loss function (4) do not have the same strength, in general. At the practical level, this difficulty may be overcome by introducing additional weighting parameters in front of those terms. These would be new hyperparameters that the machine-learning-based algorithm ought to learn. It is clear that the introduction of these parameters does not affect the convergence results in the next section.

2.2 Heat Equation

Similar to the case of the wave equation, given an initial datum \(y^0\) in a suitable function space, the null controllability problem for the heat equation amounts to finding a positive time \(T>0\) and a control function u(xt) such that the solution y(xt) of the system

$$\begin{aligned} \begin{array}{ll} y_{t}-\varDelta y=0,&{}\quad {\text {in}}\; Q_T \\ y(x,0)=y^0(x),&{}\quad {\text {in}}\; \varOmega \\ y(x,t)=0, &{}\quad {\text {on}}\; \varGamma _D\times \left( 0,T\right) \\ y(x,t)=u(x,t)&{}\quad {\text {on}}\; \varGamma _C\times \left( 0,T\right) \end{array} \end{aligned}$$
(7)

satisfies

$$\begin{aligned} y(x,T)=0,\quad x\in \varOmega . \end{aligned}$$
(8)

It is well known [21] that if \(y^0\in L^2(\varOmega )\), then, for any \(T>0\), problem (7)–(8) has a solution \(u\in L^2\left( \varGamma _C\times (0, T)\right) \).

The numerical approximation of problem (7)–(8) follows the same steps 1–4 as in the case of the wave equation. The only element to be modified is the loss function, which in this case is defined as the sum of

$$\begin{aligned} {\mathcal {L}}_{\text {int}}\left( \varvec{\theta };{\mathcal {T}}_{\text {int}}\right)= & {} \sum _{j=1}^{N_{\text {int}}} w_{j,\text {int}}\vert {\hat{y}}_t (\varvec{x}_j;\varvec{\theta })- \varDelta {\hat{y}} (\varvec{x}_j;\varvec{\theta })\vert ^2, \quad \varvec{x}_j\in {\mathcal {T}}_{\text {int}}, \\ {\mathcal {L}}_{\varGamma _D}\left( \varvec{\theta };{\mathcal {T}}_{\varGamma _D}\right)= & {} \sum _{j=1}^{N_{b}} w_{j, b}\vert {\hat{y}}(\varvec{x}_j;\varvec{\theta }) \vert ^2, \quad \varvec{x}_j\in {\mathcal {T}}_{\varGamma _D}, \\ {\mathcal {L}}_{t=0}\left( \varvec{\theta };{\mathcal {T}}_{t=0}\right)= & {} \sum _{j=1}^{N_{0}} w_{j, 0}\vert {\hat{y}}(\varvec{x}_j;\varvec{\theta }) - y^0(\varvec{x}_j) \vert ^2, \quad \varvec{x}_j\in {\mathcal {T}}_{t=0}, \\ {\mathcal {L}}_{t=T}\left( \varvec{\theta };{\mathcal {T}}_{t=T}\right)= & {} \sum _{j=1}^{N_{T}} w_{j, T}\vert {\hat{y}}(\varvec{x}_j;\varvec{\theta }) \vert ^2, \quad \varvec{x}_j\in {\mathcal {T}}_{t=T}. \end{aligned}$$

2.3 Extension to General Evolution PDE Systems

Consider now a general evolution system of the form

$$\begin{aligned} \begin{array}{ll} y_{t} + A y=0, &{}\quad {\text {in}}\; Q_T \\ y(x,0)=y^0(x), &{}\quad {\text {in}}\; \varOmega \\ y(x,t)=0, &{}\quad {\text {on}}\; \varGamma _D\times \left( 0,T\right) \\ y(x,t)=u(x,t) &{}\quad {\text {on}}\; \varGamma _C\times \left( 0,T\right) , \end{array} \end{aligned}$$
(9)

where A is a generic (linear or nonlinear) operator, and the state \(y=y(x,t)\) is, in general, a vector function.

As in the preceding two cases, the goal is to find a positive time T and a control function u(xt) such that the solution to (9) satisfies

$$\begin{aligned} y(x,T)=0,\quad x\in \varOmega . \end{aligned}$$
(10)

The PINNs algorithm described above for the wave and heat equations also applies in this general framework with few changes. Actually, the only step that must be updated is the one corresponding to the loss function that now takes the form:

$$\begin{aligned} {\mathcal {L}}_{\text {int}}\left( \varvec{\theta };{\mathcal {T}}_{\text {int}}\right)= & {} \sum _{j=1}^{N_{\text {int}}} w_{j,\text {int}}\Vert {\hat{y}}_t (\varvec{x}_j;\varvec{\theta })- \varDelta {\hat{y}} (\varvec{x}_j;\varvec{\theta })\Vert ^2, \quad \varvec{x}_j\in {\mathcal {T}}_{\text {int}}, \\ {\mathcal {L}}_{\varGamma _D}\left( \varvec{\theta };{\mathcal {T}}_{\varGamma _D}\right)= & {} \sum _{j=1}^{N_{b}} w_{j, b}\Vert {\hat{y}}(\varvec{x}_j;\varvec{\theta }) \Vert ^2, \quad \varvec{x}_j\in {\mathcal {T}}_{\varGamma _D},\\ {\mathcal {L}}_{t=0}\left( \varvec{\theta };{\mathcal {T}}_{t=0}\right)= & {} \sum _{j=1}^{N_{0}} w_{j, 0}\Vert {\hat{y}}(\varvec{x}_j;\varvec{\theta }) - y^0(\varvec{x}_j) \Vert ^2, \quad \varvec{x}_j\in {\mathcal {T}}_{t=0}, \\ {\mathcal {L}}_{t=T}\left( \varvec{\theta };{\mathcal {T}}_{t=T}\right)= & {} \sum _{j=1}^{N_{T}} w_{j, T}\Vert {\hat{y}}(\varvec{x}_j;\varvec{\theta }) \Vert ^2, \quad \varvec{x}_j\in {\mathcal {T}}_{t=T}, \end{aligned}$$

where \(\Vert \cdot \Vert \) stands for the Euclidean norm.

3 Estimates on Generalization Error

This section aims at obtaining error estimates for the so-called generalization error for both control and state. The generalization error for the control variable u is defined by

$$\begin{aligned} {\mathcal {E}}_{\text {gener}}\left( u\right) :=\Vert u- {\hat{u}} \Vert , \end{aligned}$$
(11)

where \(u=u(x,t)\) is the exact control of minimal \(L^2\)-norm of the continuous problem, \({\hat{u}}={\hat{u}}\left( x,t;\varvec{\theta }^*\right) \) is its numerical approximation obtained from the algorithm proposed above, and \(\Vert \cdot \Vert \) is an appropriate norm. The generalization error for the state variable is similarly defined.

The generalization error (11) is typically decomposed into approximation error, which is due to the choice of the hypothesis space (two-layer, multilayer, residual, convolutional neural networks, etc.), and estimation error, due to the fact that the surrogate control \({\hat{u}}\) is computed from a finite dataset. Of course, the generalization error also depends on a crucial way on the specific algorithm proposed for training. In particular, PINN solutions obtained from the proposed method are obtained by solving highly nonconvex optimization problems that typically get stuck in local minima. Estimating this optimization error is a very challenging open problem.

Error estimates for the approximation error of some hypothesis spaces are by now well known. For instance, for the case of Barron space of two-layer neural networks, the approximation error in the \(L^2\)-norm scales as \({\mathcal {O}}\left( m^{-1/2}\right) \), with m being the number of neurons in the network, and independently of the dimension d. As for the estimation error, it is also known that the Rademacher complexity of Barron space, which controls the estimation error, is controlled by a Monte Carlo rate \({\mathcal {O}}\left( N^{-1/2}\right) \), where N is the number of sampling points used for training. We refer the reader to [42] and the references therein for more details on this issue. In particular, these results support the choice of multilayer neural networks of Sect. 2.

Regarding the PINNs algorithm for solving PDEs, convergence results w.r.t. the number of sampling points used for training have been recently obtained in [38] for the case of second-order linear elliptic and parabolic equations with smooth solutions. It is also worth to mentioning article [27] where error estimates, in terms of training error and the number of sampling points, are derived for the generalization error of a class of data assimilation problems.

Following [27], we next prove error estimates for control and state and for the class of controllability problems considered here. The two key ingredients to get such error bounds are observability inequalities and error estimates for quadrature rules. The precise observability inequalities that are needed in our cases will be detailed in the next subsections. Concerning quadrature error estimates, these are very well known in the literature but for the sake of completeness, we now recall some basic concepts and results on this issue.

3.1 Error Estimates for Quadrature Rules

For a given function \(f:{\mathcal {D}}\subset {\mathbb {R}}^d\rightarrow {\mathbb {R}}\), a quadrature rule approximating the integral

$$\begin{aligned} {\overline{f}}:=\int _{{\mathcal {D}}}f(x)\, dx \end{aligned}$$

is defined by

$$\begin{aligned} {\overline{f}}_N:=\sum _{j=1}^N w_jf(x_j), \end{aligned}$$

where \((x_j, w_j)\), \(1\le j\le N\), are the nodes and weights of the quadrature rule. Quadrature errors depend on the specific rule used, on the smoothness of the function f and on the dimension d. For regular functions and low dimensions, one typically may use Gauss or Clenshaw–Curtis rules. Rules based on low discrepancy sequences such as Sobol sequences are the rules of choice for intermediate dimensions [39]. In both cases, error estimates for these quadrature rules take the general form

$$\begin{aligned} \vert {\overline{f}} - {\overline{f}}_N\vert \le C_qN^{-\alpha },\quad \alpha >0, \end{aligned}$$
(12)

where \(\alpha \) depends on the regularity of f and the constant \(C_q=C_q(d)\), which also depends on f and its derivatives, explodes as \(d\rightarrow \infty \). Monte Carlo integration is immune to the curse of dimensionality and applies to non-smooth integrands. As is well known, the error estimate in that case is as in (12) where \(C_q\) is independent of the dimension d and \(\alpha =1/2\).

3.2 Wave Equation

The generalization error in the control variable u due to the PINN algorithm proposed in Sect. 2.1 is defined as

$$\begin{aligned} {\mathcal {E}}_{\text {gener}}\left( u\right) :=\Vert u- {\hat{u}} \Vert _{L^2\left( \varGamma _C;\left( 0,T\right) \right) }, \end{aligned}$$
(13)

where \(u=u(t)\) is the exact control of the continuous problem (1)–(2) and \({\hat{u}}={\hat{u}}\left( t;\varvec{\theta }^*\right) \) is its numerical approximation given by (6).

Similarly, the generalization error for the state variable is defined by

$$\begin{aligned} {\mathcal {E}}_{\text {gener}}\left( y\right) :=\Vert y- {\hat{y}} \Vert _{C\left( 0,T; L^2(\varOmega )\right) \cap C^1\left( 0,T; H^{-1}(\varOmega )\right) }. \end{aligned}$$
(14)

As is usual in machine learning’s terminology, the so-called training error for PINNs algorithm is given by

$$\begin{aligned} {\mathcal {E}}_{\text {train}}:= & {} {\mathcal {E}}_{\text {train, int}} + {\mathcal {E}}_{\text {train, boundary}} + {\mathcal {E}}_{\text {train, initialpos}} + {\mathcal {E}}_{\text {train, initialvel}} \nonumber \\&+ {\mathcal {E}}_{\text {train, finalpos}} + {\mathcal {E}}_{\text {train, finalvel}}, \end{aligned}$$
(15)

where

$$\begin{aligned} \begin{array}{ll} {\mathcal {E}}_{\text {train, int}} &{} = \left( {\mathcal {L}}_{\text {int}}\left( \varvec{\theta }^*;{\mathcal {T}}_{\text {int}}\right) \right) ^{1/2}\\ {\mathcal {E}}_{\text {train, boundary}} &{} = \left( {\mathcal {L}}_{\varGamma _D}\left( \varvec{\theta }^*;{\mathcal {T}}_{\varGamma _D}\right) \right) ^{1/2} \\ {\mathcal {E}}_{\text {train, initialpos}} &{} = \left( {\mathcal {L}}_{t=0}^{\text {pos}}\left( \varvec{\theta }^*;{\mathcal {T}}_{t=0}\right) \right) ^{1/2} \\ {\mathcal {E}}_{\text {train, initialvel}} &{} = \left( {\mathcal {L}}_{t=0}^{\text {vel}}\left( \varvec{\theta }^*;{\mathcal {T}}_{t=0}\right) \right) ^{1/2} \\ {\mathcal {E}}_{\text {train, finalpos}} &{} = \left( {\mathcal {L}}_{t=T}^{\text {pos}}\left( \varvec{\theta }^*;{\mathcal {T}}_{t=T}\right) \right) ^{1/2} \\ {\mathcal {E}}_{\text {train, finalvel}} &{} = \left( {\mathcal {L}}_{t=T}^{\text {vel}}\left( \varvec{\theta }^*;{\mathcal {T}}_{t=T}\right) \right) ^{1/2} \end{array} \end{aligned}$$
(16)

and \(\varvec{\theta }^*\) is as in (5).

Next, we recall classical observability and energy inequalities for the wave equation:

Lemma 3.1

Let us assume that the domain \(\varOmega \) satisfies the geometrical controllability condition [3], and let \(T>0\) be large enough. Given initial and final conditions \((z^0_0, z^1_0), (z^0_T, z^1_T)\in L^2\left( \varOmega \right) \times H^{-1}\left( \varOmega \right) \), there exists a control function \(v\in L^2\left( \varGamma _C;(0,T)\right) \) such that the solution z(xt) of the system

$$\begin{aligned} \begin{array}{ll} z_{tt}-\varDelta z=0, &{}\quad {\text {in}}\; Q_T \\ z(x,0)=z^0_0(x), &{}\quad {\text {in}}\; \varOmega \\ z_t(x,0)=z^1_0(x) &{}\quad {\text {in}}\; \varOmega \\ z(x,t)=0, &{}\quad {\text {on}}\; \varGamma _D\times (0,T) \\ z(x,t)=v(x,t) &{}\quad {\text {on}}\; \varGamma _C\times (0,T) \end{array} \end{aligned}$$
(17)

satisfies

$$\begin{aligned} z(x,T)=z^0_T(x),\quad z_t(x,T)=z^1_T(x,T), \quad x\in \varOmega . \end{aligned}$$
(18)

Moreover,

$$\begin{aligned} \Vert v \Vert _{L^2\left( \varGamma _C;(0,T)\right) }\le C_o\left( \Vert z^0_0\Vert _{L^2\left( \varOmega \right) } + \Vert z^1_0\Vert _{H^{-1}\left( \varOmega \right) } + \Vert z^0_T\Vert _{L^2\left( \varOmega \right) }+ \Vert z^1_T\Vert _{H^{-1}\left( \varOmega \right) }\right) ,\nonumber \\ \end{aligned}$$
(19)

for a positive constant \(C_o=C_o(\varOmega ,T)\) which depends on \(\varOmega \) and T, but is independent of the initial and final data.

Lemma 3.2

Let \(\left( z_0^0,z_0^1\right) \in L^2(\varOmega )\times H^{-1}(\varOmega )\) and \(g\in L^2\left( \partial \varOmega \times (0,T)\right) \). Consider the non-homogeneous system

$$\begin{aligned} \begin{array}{ll} z_{tt}-\varDelta z=f(x,t), &{}\quad {\text {in}}\; Q_T \\ z(x,0)=z^0_0(x), &{}\quad {\text {in}}\; \varOmega \\ z_t(x,0)=z^1_0(x) &{}\quad {\text {in}}\; \varOmega \\ z(x,t)=g(x,t), &{}\quad {\text {on}}\; \partial \varOmega \times (0,T). \end{array} \end{aligned}$$
(20)

Then, there exists a positive constant \(C_e=C_e(\varOmega ,T)\) such that

$$\begin{aligned}&\Vert z \Vert _{C\left( 0,T; L^2(\varOmega )\right) } + \Vert z_t \Vert _{C\left( 0,T; H^{-1}(\varOmega )\right) } \nonumber \\&\quad \le C_e\left( \Vert z^0_0\Vert _{L^2\left( \varOmega \right) } + \Vert z^1_0\Vert _{H^{-1}\left( \varOmega \right) } + \Vert g\Vert _{L^2\left( \partial \varOmega \times (0,T)\right) } \right) . \end{aligned}$$
(21)

We are now in a position to estimate the generalization error for our PINNs-based algorithm.

Theorem 3.1

Let \(y=y(x,t)\in C^2\left( \overline{Q_T}\right) \) be a classical solution of (1)–(2), and let \({\hat{y}}={\hat{y}}(x,t;\varvec{\theta }^*)\) its PINN approximation obtained by the method proposed in Sect. 2.1. It is assumed that \({\hat{y}}\in C^2\left( \overline{Q_T}\right) \). Let \(u=u(x,t)\) and \({\hat{u}}={\hat{u}}\left( x,t;\varvec{\theta }^*\right) \) be the exact control of the continuous system (1)–(2) and its PINN approximation, respectively. Then, the following estimate for the generalization error in the control variable holds:

$$\begin{aligned} {\mathcal {E}}_{\text {gener}}\left( u\right)\le & {} C \left( {\mathcal {E}}_{\text {train, int}} + C_{q_{int}}^{1/2} N_{\text {int}}^{-\alpha _{int}/2} \right. \nonumber \\&+ {\mathcal {E}}_{\text {train, boundary}} + C_{qb}^{1/2} N_{\text {b}}^{-\alpha _{b}/2} \nonumber \\&+ {\mathcal {E}}_{\text {train, initialpos}} + C_{qip}^{1/2} N_{0}^{-\alpha _{ip}/2} \nonumber \\&+ {\mathcal {E}}_{\text {train, initialvel}} + C_{qiv}^{1/2} N_{0}^{-\alpha _{iv}/2} \nonumber \\&+ {\mathcal {E}}_{\text {train, finalpos}} + C_{qfp}^{1/2} N_{T}^{-\alpha _{fp}/2} \nonumber \\&\left. + {\mathcal {E}}_{\text {train, finalvel}} + C_{fv}^{1/2} N_{T}^{-\alpha _{fv}/2} \right) , \end{aligned}$$
(22)

where \(C=C(\varOmega , T)\), and consequently \(C=C(d)\) also depends on the spatial dimension d. The constants \(C_{q-}\) and exponents \(\alpha _{-}\) are the ones associated with quadrature rules as in (12).

A similar estimate, with different constants, also holds for the generalization error in the state variable, as given by (14).

Proof

Let \({\overline{y}} = y-{\hat{y}}\) and \({\overline{u}}=u-{\hat{u}}\) be the error in the state and control variables, respectively. By linearity, \({\overline{y}}\) solves

$$\begin{aligned} \begin{array}{ll} {\overline{y}}_{tt}-\varDelta {\overline{y}} = {\hat{y}}_{tt}-\varDelta {\hat{y}}, &{}\quad {\text {in}}\; Q_T \\ {\overline{y}}(x,0)=y^0(x)-{\hat{y}}(x,0),&{}\quad {\text {in}}\; \varOmega \\ {\overline{y}}_t(x,0)=y^1(x)-{\hat{y}}_t(x,0) &{}\quad {\text {in}}\; \varOmega \\ {\overline{y}}(x,T)={\hat{y}}(x,T),&{}\quad {\text {in}}\; \varOmega \\ {\overline{y}}_t(x,T)={\hat{y}}_t(x,T) &{}\quad {\text {in}}\; \varOmega \\ {\overline{y}}(x,t)={\hat{y}}(x,t), &{}\quad {\text {on}}\; \varGamma _D\times (0,T) \\ {\overline{y}}(x,t)=u(x,t)-{\hat{y}}(x,t)&{}\quad {\text {on}}\; \varGamma _C\times (0,T). \end{array} \end{aligned}$$
(23)

Again by linearity, \({\overline{y}} (x,t;\varvec{\theta })\) is decomposed as \({\overline{y}}={\overline{y}}^1+{\overline{y}}^2\), where \({\overline{y}}^1\) and \({\overline{y}}^2\) are, respectively, solutions to

$$\begin{aligned} \begin{array}{ll} {\overline{y}}^1_{tt}-\varDelta {\overline{y}}^1 = 0, &{}\quad {\text {in}}\; Q_T \\ {\overline{y}}^1(x,0)=y^0(x)-{\hat{y}}(x,0), &{}\quad {\text {in}}\; \varOmega \\ {\overline{y}}^1_t(x,0)=y^1(x)-{\hat{y}}_t(x,0) &{}\quad {\text {in}}\; \varOmega \\ {\overline{y}}^1(x,t)=0, &{}\quad {\text {on}}\; \varGamma _D\times (0,T) \\ {\overline{y}}^1(x,t)=u(x,t)-{\hat{y}}(x,t) &{}\quad {\text {on}}\; \varGamma _C\times (0,T) \end{array} \end{aligned}$$
(24)

and

$$\begin{aligned} \begin{array}{ll} {\overline{y}}^2_{tt}-\varDelta {\overline{y}}^2 = {\hat{y}}_{tt}-\varDelta {\hat{y}}, &{}\quad {\text {in}}\; Q_T \\ {\overline{y}}^2(x,0)=0, &{}\quad {\text {in}}\; \varOmega \\ {\overline{y}}^2_t(x,0)=0 &{}\quad {\text {in}}\; \varOmega \\ {\overline{y}}^2(x,T)={\hat{y}}(x,T)-{\overline{y}}^1(x,T), &{}\quad {\text {in}}\; \varOmega \\ {\overline{y}}^2_t(x,T)={\hat{y}}_t(x,T)-{\overline{y}}^1_t(x,T),&{}\quad {\text {in}}\; \varOmega \\ {\overline{y}}^2(x,t)={\hat{y}}(x,t), &{}\quad {\text {on}}\; \varGamma _D\times (0,T) \\ {\overline{y}}^2(x,t)=0 &{}\quad {\text {on}}\; \varGamma _C\times (0,T). \end{array} \end{aligned}$$
(25)

By applying the observability inequality (19) to system (24), and the energy estimate (21) to (25),

$$\begin{aligned}&\Vert u-{\hat{u}} \Vert _{L^2\left( \varGamma _C;(0,T)\right) } \nonumber \\&\quad \le C_o\left( \Vert y^0-{\hat{y}}(0)\Vert _{L^2\left( \varOmega \right) } + \Vert y^1-{\hat{y}}_t(0)\Vert _{H^{-1}\left( \varOmega \right) } + \Vert {\overline{y}}^1(T)\Vert _{L^2\left( \varOmega \right) }+ \Vert {\overline{y}}^1_t(T)\Vert _{H^{-1}\left( \varOmega \right) } \right) \nonumber \\&\quad \le C_o \left( \Vert y^0-{\hat{y}}(0)\Vert _{L^2\left( \varOmega \right) } + \Vert y^1-{\hat{y}}_t(0)\Vert _{L^2\left( \varOmega \right) } + \Vert {\hat{y}}(T)\Vert _{L^2\left( \varOmega \right) }+ \Vert {\hat{y}}_t(T)\Vert _{L^2\left( \varOmega \right) }\right. \nonumber \\&\qquad \left. + \Vert {\overline{y}}^2(T)\Vert _{L^2\left( \varOmega \right) }+ \Vert {\overline{y}}^2_t(T)\Vert _{H^{-1}\left( \varOmega \right) } \right) \nonumber \\&\quad \le C_o \left( \Vert y^0-{\hat{y}}(0)\Vert _{L^2\left( \varOmega \right) } + \Vert y^1-{\hat{y}}_t(0)\Vert _{L^2\left( \varOmega \right) } + \Vert {\hat{y}}(T)\Vert _{L^2\left( \varOmega \right) }+ \Vert {\hat{y}}_t(T)\Vert _{L^2\left( \varOmega \right) } \right. \nonumber \\&\qquad \left. + C_e \left( \Vert {\hat{y}}\Vert _{L^2(\varGamma _D\times (0,T))} + \Vert {\hat{y}}_{tt}-\varDelta {\hat{y}} \Vert _{L^2 ( 0,T; L^2(\varOmega ))} \right) \right) . \end{aligned}$$
(26)

Estimate (22) then follows by applying (12). The corresponding estimate for the generalization error (14) is an immediate consequence of (21) and (22). \(\square \)

Although it has not been written explicitly hereinabove, it is clear that generalization errors depend on the specific type and size of neural network as well as on the type and number of quadrature nodes which are selected from the very beginning. Thus, denoting by \({\mathcal {H}}_m\) the hypothesis space considered for numerical approximation, where m denotes the number of neurons (or free parameters) in the neural network, and by N the number of collocation points used for quadrature, to make explicit this dependence we write

$$\begin{aligned} {\mathcal {E}}_{\text {gener}}\left( u\right) ={\mathcal {E}}^{m,N}_{\text {gener}}\left( u\right) \quad \text {and }\quad {\mathcal {E}}_{\text {gener}}\left( y\right) = {\mathcal {E}}^{m,N}_{\text {gener}}\left( y\right) . \end{aligned}$$

Next, the behavior of the generalization errors is analyzed when the size m of single-layer neural networks goes to infinity and so does the sampling size (\(N\rightarrow \infty \)).

Let us consider the hypothesis space of single-layer neural nets

$$\begin{aligned} {\mathcal {H}}_m:=\left\{ y_m(\varvec{x}):=\sum _{i=1}^ma_i\sigma \left( \varvec{\omega }_i\varvec{x}+b_i\right) : \varvec{x}, \varvec{\omega }_i\in {\mathbb {R}}^{d+1}, a_i, b_i\in {\mathbb {R}} \right\} . \end{aligned}$$

The training process (5) may be rewritten in the equivalent form

$$\begin{aligned} {\hat{y}}_{m,N}=\arg \min _{y_m\in {\mathcal {H}}_m}{\mathcal {L}}\left( y_m;{\mathcal {T}}_N\right) . \end{aligned}$$
(27)

From now on it is assumed that the optimization problem (27) has a solution. Otherwise, one can always add a regularization term of the form \(\Vert \varvec{\theta }\Vert ^2\).

We now recall the following universal approximation theorem due to Pinkus [33, Th. 4.1].

Theorem 3.2

Let \(f\in C^k({\mathbb {R}}^{d+1})\). Assume that the activation function \(\sigma \in C^k({\mathbb {R}})\) is not a polynomial. Then, for any compact set \(K\subset {\mathbb {R}}^{d+1}\) and any \(\varepsilon >0\) there exists \(m\in {\mathbb {N}}\) and \(y_m\in {\mathcal {H}}_m\) such that

$$\begin{aligned} \max _{\varvec{x}\in K}\vert D^lf(\varvec{x})-D^ly_m(\varvec{x})\vert \le \varepsilon \end{aligned}$$

for all multiindex \(l\le k\).

Corollary 3.1

Assume that the activation function \(\sigma \in C^k({\mathbb {R}})\) is not a polynomial. With the same assumptions as in Theorem 3.1 and considering subsequences, still labeled by m and N, one has

$$\begin{aligned} \lim _{N\rightarrow \infty }\lim _{m\rightarrow \infty }{\mathcal {E}}^{m,N}_{\text {gener}} \left( u\right) =\lim _{N\rightarrow \infty }\lim _{m\rightarrow \infty } {\mathcal {E}}^{m,N}_{\text {gener}}\left( y\right) =0. \end{aligned}$$
(28)

Proof

Let us fix \(\varepsilon >0\). We apply Theorem 3.2 for \(K=\overline{Q_T}\) and \(f=y\), solution of the controllability problem (1)–(2). Then, there exist \(m=m(\varepsilon )\in {\mathbb {N}}\) and corresponding \(y_m\in {\mathcal {H}}_m\) such that

$$\begin{aligned}&\Vert (y_m)_{tt}-\varDelta y_m \Vert _{L^2 ( 0,T; L^2(\varOmega ))} + \Vert y_m\Vert _{L^2(\varGamma _D\times (0,T))} \nonumber \\&\qquad + \Vert y^0-y_m(0)\Vert _{L^2\left( \varOmega \right) } + \Vert y^1-(y_m)_t(0)\Vert _{L^2\left( \varOmega \right) } \nonumber \\&\qquad + \Vert y_m(T)\Vert _{L^2\left( \varOmega \right) }+ \Vert (y_m)_t(T)\Vert _{L^2\left( \varOmega \right) }\nonumber \\&\quad \le \varepsilon /2. \end{aligned}$$
(29)

Each of the terms in the left-hand side of (29) is now expressed by using a quadrature rule with collocation nodes \({\mathcal {T}}_N\). Then, taking into account the optimality of \({\hat{y}}_{m,N}\), as given by (27), one deduces that the sum of training errors that appear in the right-hand side of (22) is less than or equal to \(\varepsilon /2\).

Moreover, for fixed \(m=m(\varepsilon )\), there exists N such that the sum of quadrature errors in (22) is also less than or equal to \(\varepsilon /2\). Thus, \({\mathcal {E}}^{m,N}_{\text {gener}}\left( u\right) \le \varepsilon \). The arbitrariness of \(\varepsilon \) gives the result for the generalization error in the control variable. The case of the state variable is completely analogous. \(\square \)

3.3 Extension to Other PDE Systems and Neural Network Architectures

It is clear that the arguments and conclusions of Theorem 3.1 and Corollary 3.1 extend to any linear system of PDEs for which observability as well as energy inequalities similar to those in (19) and (21) hold. Linearity of the PDE is used in an essential way in the proof of Theorem 3.1. Thus, a different argument is needed to extend this result to the case of nonlinear PDEs.

The proof of Corollary 3.1 relies on the universal approximation theorem by Pinkus for the case of single-layer neural networks. Thus, the conclusion of Corollary 3.1 also holds for other neural network architectures for which such a density result is true.

4 Numerical Experiments

In this section, we test the performance of the proposed method in three exact controllability problems. The first experiment aims at checking the accuracy of the method on a very simple controllability problem for the wave equation for which an exact solution is explicitly known. In the second experiment, the high-dimensional situation is tested on a controllability problem for the heat equation. The last experiment considers a semilinear wave equation.

As indicated at the beginning of Sect. 3, the optimization error due to the gradient-based algorithms used for training is a key ingredient in the total error associated with the proposed PINN algorithm. This error has not been accounted for in Theorem 3.1 and Corollary 3.1. However, the numerical simulation results presented in this section do incorporate this error. As a consequence, simulation results are unable to illustrate with accuracy the theoretical findings of Sect. 3. The gap between the theoretical error estimates and the simulation results is accounted for by the optimization error in the training process.

In all experiments that follow, a multilayer neural network, as described in Sect. 2, with the \(\tanh \) as activation function, is used. Sobol quadrature nodes [39] are employed for training the neural network. The training process, i.e., minimization of \({\mathcal {L}}\left( \varvec{\theta };{\mathcal {T}}_N\right) \), is carried out with the ADAM optimizer [19] with learning rate \(10^{-3}\) for the first 20000 epochs. Then, a L-BFGS optimizer [7] is employed to accelerate convergence. The required gradients are computed by using automatic differentiation [2]. The descent algorithm is initialized with Glorot uniform [14]. As is well known, results obtained from gradient-based optimizers depend on initialization. A common practice to deal with this issue is to perform an emsemble training [24]. However, the use of this and other more sophisticated techniques (residual-based adaptive refinement (RAR) [23], dropout [40], batch normalization [18], etc.) is not the purpose of this paper which aims at illustrating the possible use of PINNs in the topic of controllability of PDEs.

4.1 Experiment 1: Linear Wave Equation

We consider the control system (1)–(2) in the domain \(\varOmega = \left( 0,1\right) \) for the data

$$\begin{aligned} y^0(x)=\sin \left( \pi x\right) ,\quad y^1(x)=0, \quad 0\le x\le 1, \end{aligned}$$

and for the control time \(T=2\). An explicit solution of the problem is easily obtained by using D’Alembert formula. Indeed, by considering the function

$$\begin{aligned} \tilde{y^0}(x)=\left\{ \begin{array}{ll} \sin \left( \pi x\right) , &{} -1\le x\le 1\\ 0, &{} \text {elsewhere, } \end{array} \right. \end{aligned}$$

the explicit exact state is given by

$$\begin{aligned} y(x,t)=\frac{1}{2}\left( \tilde{y^0}(x-t)+\tilde{y^0}(x+t)\right) ,\quad 0\le x\le 1, \,\, 0\le t\le 2, \end{aligned}$$
(30)

and the exact control is

$$\begin{aligned} u(t)=\left\{ \begin{array}{ll} \frac{1}{2}y^0\left( 1-t\right) , &{} 0\le t\le 1 \\ &{} \\ -\frac{1}{2}y^0\left( t-1\right) , &{} 1\le t\le 2. \end{array} \right. \end{aligned}$$
(31)

Remark 4.1

We notice that the control given by (31) is the one of minimal \(L^2\)-norm. This is no longer true if the initial velocity \(y^1\) is different from zero (see [16], Section 4.1 for details).

The efficiency of the proposed PINN-based algorithm in approximating the solution to this problem is analyzed next. The generalization error in the control variable \({\mathcal {E}}_{\text {gener}}(u):=\Vert u-{\hat{u}}\Vert _{L^2(0,T)}\), \(L^2\)-relative error \(\Vert u-{\hat{u}}\Vert _{L^2(0,T)}/\Vert u\Vert _{L^2(0,T)}\), and total training error \({\mathcal {E}}_{\text {train}}:={\mathcal {L}}\left( \varvec{\theta }^*;{\mathcal {T}}_N\right) \) are computed for several values of the total number N of training points and several architectures of the neural network. The effect of regularization, where the term \(\lambda _{\text {reg}}\Vert \varvec{\theta }\Vert _{2}^2\) is added to the loss function (4), with \(\lambda _{\text {reg}} >0\), is also studied.

Once the training process is finished and the optimal set of parameters \(\varvec{\theta }^*\) is obtained, the PINN control \({\hat{u}}(t;\varvec{\theta }^*) = {\hat{y}}\left( 1,t;\varvec{\theta }^*\right) \) is computed on a uniform mesh of size \(h=0.02\) in the segment \(\left( 1,t\right) \), \(0\le t\le 2\). Both the generalization error and the \(L^2\)-relative error are then approximated by using the same mesh. The training points are split into interior and boundary points as follows: for a given positive integer \(N_0\), \(3N_0\) points are located on the boundary and \(N_0^2\) in the interior domain. Thus, \(N=N_0^2+3N_0\).

Tables 1 and 2 collect simulation results for a multilayer neural network composed of 4 hidden layers and 50 neurons in each layer. It is observed that both the generalization error and the \(L^2\)-relative error slowly decrease as the number of training points increases. The comparison between Tables 1 and 2 shows that regularization does not increase the level of accuracy. Table 3 displays simulation results for the case of a single-layer architecture having the same number of neurons as in the multilayer neural network considered in Tables 1 and 2. It is observed that the single-layer architecture provides slightly less accurate results.

Figure 4 shows the exact control (31) and the PINN control \({\hat{u}}\left( t;\varvec{\theta }^*\right) \), and the error between exact and PINN states.

The effect of increasing the depth (number of hidden layers) and width (number of neurons for layer) of the neural network has been also tested. We have observed that the level of accuracy in the solutions is not improved significantly. This is in agreement with previous studies (see, e.g., [23]) that show that a relative small neural network is able to approximate with accuracy of smooth solutions of PDEs.

Table 1 Experiment 1 (linear wave equation): No regularization. Number of training points N versus generalization error \({\mathcal {E}}_{\text {gener}}(u)\), \(L^2\)- relative error and training error \({\mathcal {E}}_{\text {train}}\) for a multilayer neural network composed of 4 hidden layers and 50 neurons in each layer
Table 2 Experiment 1 (linear wave equation): Regularization with \(\lambda _{\text {reg}}= 10^{-7}\). Number of training points N versus generalization error \({\mathcal {E}}_{\text {gener}}(u)\), \(L^2\)- relative error and training error \({\mathcal {E}}_{\text {train}}\) for a multilayer neural network composed of 4 hidden layers and 50 neurons in each layer
Table 3 Experiment 1 (linear wave equation): No regularization. Number of training points N versus generalization error \({\mathcal {E}}_{\text {gener}}(u)\), \(L^2\)- relative error and training error \({\mathcal {E}}_{\text {train}}\) for a single-layer neural network composed of 200 neurons
Fig. 4
figure 4

Experiment 1 (linear wave equation). Comparison between exact control u(t) and PINN (or predicted) control \({\hat{u}}(t;\varvec{\theta }^{*})\) (left), and error between exact state and PINN state, i.e., \(y(x,t)-{\hat{y}}(x,t;\varvec{\theta }^{*})\) (right) Neural network composed of 4 hidden layers and 50 neurons in each layer. No regularization. Number of training points \(N=10300\)

4.2 Experiment 2: Linear Heat Equation

In this experiment, we consider the heat system (7)–(8) for \(\varOmega = \left( 0,1\right) ^d\) and \(d=1, 5, 10\), and 20.

The One-Dimensional case For comparison purposes, the case \(d=1\) is addressed next. The first mode of the Laplacian \(y^0(x)=\sin \left( \pi x\right) \), \(0<x<1\), is taken as the initial condition. On \(x=0\), a zero Dirichlet boundary condition is imposed. The control function acts on the extreme \(x=1\). In order to have a better control of the diffusion, the Laplacian \(\varDelta \) is replaced by \(\kappa \varDelta \), with \(\kappa = 0.25\). The control time is \(T = 0.5\). This experiment has been previously considered in [30, Subsection 5.1]. Figure 5 shows the predicted state (left) and control (right) obtained from the PINN algorithm described in Sect. 2.2, and for a feedforward neural network composed of 5 hidden layers and 100 neurons in each layer. The number of training points is \(N=10300\). Once the training process is completed, the training error for the controllability condition \(y(x,T)=0\), \(0<x<1\), which provides an approximation of \(\Vert y\left( \cdot ,T\right) \Vert _{L^2\left( \varOmega \right) }\) is \(1.17\times 10^{-5}\). It is observed in Fig. 6 that both the PINN control and state have a similar profile as in [30, Figures 2 and 4 (left)]. However, no oscillations near the final time appear in the PINN control. This is not contradictory with the results in [30] since it is well known that no uniqueness of null controls holds.

Fig. 5
figure 5

Experiment 2 (linear heat equation). PINN (or predicted) state (left) and PINN control (right). Neural network composed of 5 hidden layers and 100 neurons in each layer. Number of training points \(N=10300\)

The Multi-dimensional Case In order to check the accuracy of the proposed method in high dimensions, we consider the following control to the trajectory problem for \(\varOmega = (0,1)^d\) and \(T=1\):

$$\begin{aligned} \begin{array}{ll} y_{t}-\varDelta y=0, &{}\quad {\text {in}}\; Q_T:=\varOmega \times \left( 0, T\right) \\ y(x,0)=\frac{\Vert x\Vert ^2}{d}, &{}\quad {\text {in}}\; \varOmega \\ y(x,t)=u(x,t) &{}\quad {\text {on}}\; \partial \varOmega \times \left( 0,T\right) \\ y(x, T) = \frac{\Vert x\Vert ^2}{d} + 2 &{}\quad {\text {in}}\; \varOmega . \end{array} \end{aligned}$$
(32)

This problem has an explicit solution [28], which is given by \(y(x,t)=\frac{\Vert x\Vert ^2}{d} + 2t\), \(x\in \varOmega \). The control function is obtained as the trace of y on \(\partial \varOmega \). Table 4 displays simulation results for \(L^2\)- relative error in the state variable and training error. It is observed that even for high dimensions the relative error in the state variable is very low. Accuracy is similar to the one obtained for forward PDEs via PINNs [28].

Table 4 Experiment 2 (linear heat equation): Dimension versus \(L^2\)-relative error in the state variable and training error \({\mathcal {E}}_{\text {train}}\). Multilayer neural network composed of 4 hidden layers and 50 neurons in each layer. Number of training points \(N=23000\)

4.3 Experiment 3: A Semilinear Wave Equation

Next, we consider a nonlinear situation, precisely the case of a semilinear wave equation. Positive results on the exact controllability for semilinear wave equations have been obtained, among others, in [31, 44, 45].

In this experiment, the following null controllability problem for a semilinear wave equation is considered:

$$\begin{aligned} \begin{array}{ll} y_{tt}- y_{xx}=4y^2, &{}\quad {\text {in}}\; \left( 0,1\right) \times \left( 0,2\right) \\ y(x,0)=1.5\sin \left( 3 \pi x\right) , &{}\quad {\text {in}}\; \left( 0,1\right) \\ y_t(x,0)=x^2 &{}\quad {\text {in}}\; \left( 0,1\right) \\ y(0,t)=0, &{}\quad {\text {on}}\; \left( 0,2\right) \\ y(1,t)=u(t) &{}\quad {\text {on}}\; \left( 0,2\right) ,\\ y(x,2)=y_t(x,2)=0 &{}\quad {\text {in}}\; \left( 0,1\right) . \end{array} \end{aligned}$$
(33)

This problem has been previously studied in [8, Subsection 4.2.1].

The proposed PINN-based algorithm has been tested for different neural network architectures and number of training points. Table 5 collects the simulation results for all contributions in training error as in (16). Recall that \( {\mathcal {E}}_{\text {train, int}} \) is the training error associated with the residual of the PDE, \({\mathcal {E}}_{\text {train, boundary}}\) corresponds to the boundary condition at \(x=0\), \( {\mathcal {E}}_{\text {train, initialpos}}\) and \( {\mathcal {E}}_{\text {train, initialvel}} \) are, respectively, the training errors for initial position and velocity, and finally, \( {\mathcal {E}}_{\text {train, finalpos}} \) and \( {\mathcal {E}}_{\text {train, initialvel}}\) are the training errors for the controllability condition at the control time \(T=2\). It is observed in Table 5 that increasing the number of training points does not reduce significantly training errors. This is in accordance with previous studies (see, e.g., numerical experiments in [27]). Recall that the training error includes the optimization error due to the gradient-based descent algorithms used for minimization of the highly nonconvex loss function (4) for which we have no information.

Table 5 Experiment 3 (semilinear wave equation): Training error versus number of training points N for a neural network composed of 5 hidden layers and 100 neurons per layer

Figure 6 displays numerical simulation results obtained for a multilayer neural network composed of 5 hidden layers and 100 neurons in each layer. For this particular example, no explicit solution is known so that it is not possible to check the accuracy of the method. In addition, as it was mentioned in the preceding experiment, the control is not unique. Nonetheless, comparison between Figs. 6 and 2 in [8] shows that the results are very similar.

Fig. 6
figure 6

Experiment 3 (semilinear wave equation). PINN (or predicted) state \({\hat{y}}(x,t;\varvec{\theta }^*)\) (left) and PINN control \({\hat{u}}(t;\varvec{\theta }^{*})\) (right). Neural network composed of 5 hidden layers and 100 neurons in each layer. Number of training points \(N=5850\)

5 Conclusions

Even though highly accurate methods are available for approximating numerically a wide range of controllability problems for PDEs, the applicability of these methods to high-dimensional problems is questionable due to the well-known curse of dimensionality phenomenon.

The present paper provides a first attempt to overcome this difficulty. It relies on the use of modern deep-learning-based methods, in particular on the so-called physics-informed neural networks. More precisely, a PINNs-based method has been proposed for the numerical approximation of controllability problems for PDEs both linear and nonlinear. The problem is formulated as the minimization, in the sense of least squares, of a loss function that accounts for the residual of the PDE and its initial, boundary, and final conditions. The main novelties here with respect to more classical numerical methods in control of PDEs are as follows: (i) a feedforward neural network is used for approximating both the state and the control variables, and (ii) the method is mesh-free. In addition, it is important to emphasize that although deep-learning-based methods have found great success in many applications, no theoretical results have appeared in the literature in this field so far. In this respect, estimates for the generalization error (in both control and state variables) in terms of training and quadrature errors have been obtained in this paper. It is also proved that the training error vanishes as the size of the neural network and the number of training points go to infinity. An important feature of these theoretical results is that they apply to any controllability problem for a linear PDE and in any dimension and so PINNs qualify as a promising tool to deal with high-dimensional problems.

The accuracy in our numerical experiments is similar to the one obtained by using the PINN algorithm [34] for solving forward problems for PDEs. This is not surprising since the proposed method is a PINN-based algorithm for solving PDEs where final conditions are added to the picture and the control is obtained as the trace of the solution of the PDE.

There are many interesting questions that remain open. Some of them are:

  • Since the constants that appear in our estimates on generalization error are based on energy and observability inequalities, they depend on the spatial dimension d. To what extent these estimates break the curse of dimensionality is a very interesting open problem.

  • Although it was proved that training error converges to zero as the size of network and the number of the training points increase, up to the best knowledge of the authors, estimating training error is also a very challenging open problem. It is clear that training errors can be estimated a posteriori. Nonetheless, a posteriori estimates of training errors are, in general, not sharp as the training error incorporates errors due to the numerical approximation of highly nonconvex optimization problems whose solutions get stuck in local minima. This issue has been observed in our numerical experiments where increasing the size of the neural network and the number of training points produces a very slow decreasing of the training error.

  • Proving error estimates for generalization error in the case of controllability problems for nonlinear PDEs and other types of control actions (e.g., distributed control) are also interesting open problems.

6 Reproducibility

The implementation of the numerical experiments presented in Sect. 4 has been performed with the user-friendly Python library DeepXDE [23], which is available at https://github.com/lululxvi/deepxde. Python scripts for the three experiments can be downloaded from https://github.com/fperiago/deepcontrol.