1 Introduction

1.1 Motivation

In recent years, access to enormous quantities of data combined with rapid advances in machine learning has yielded outstanding results in computer vision, recommendation systems, medical diagnosis, and financial forecasting [1]. Nonetheless, the impact of learning algorithms reaches far beyond and has already found its way into many scientific disciplines [2].

The rapid interest in machine learning in general and within computational mechanics is well documented in the scientific literature. By considering the number of publications treating “Artificial Intelligence”, “Machine Learning”, “Deep Learning”, and “Neural Networks”, the interest can be quantified. Figure 1a shows the trend in all journals of Elsevier and Springer since 1999, while Fig. 1b depicts the trend within the computational mechanics community by considering representative journalsFootnote 1 at Elsevier and Springer. The trends before 2017 differ slightly, with a steady growth in general but only limited interest within computational mechanicsFootnote 2. However, around 2017, both curves show a shift in trend, namely a vast increase in publications highlighting the interest and potential prospects of artificial intelligence and its subtopics for a variety of applications.

Due to the rapid growth [21] of research in the field of deep learning (see Fig. 1a), we provide an overview of the various deep learning methodologies in deterministic computational mechanics. To limit the scope of this work, we focus on deterministic approaches and problems within computational mechanics. Numerous review articles on deep learning for specific applications have already emerged (see [22, 23] for topology optimization, [24] for full waveform inversion, [25,26,27,28,29] for fluid mechanics, [30] for continuum mechanics, [31] for material mechanics, [32] for constitutive modeling, [33] for generative design, [34] for material design, and [35] for aeronautics)Footnote 3. The aim of this work is, however, to focus on the general methods rather than applications, where similar methods are often applied to different problems. This has the potential to bridge gaps between scientific communities by highlighting similarities between methods and thereby establishing clarity on the state-of-the-art.

Fig. 1
figure 1

Number of publications concerning artificial intelligence and some of its subtopics since 1999 showing the exponential growth of literature within the field. Illustration inspired by [40]

1.2 Taxonomy of deep learning techniques in computational mechanics

In order to discuss the deep learning methods in a structured manner, we introduce the following taxonomy:

  • simulation substitution (Sect. 2)

    • data-driven modeling (Sect. 2.1)

    • physics-informed learning (Sect. 2.2)

  • simulation enhancement (Sect. 3)

  • discretizations as neural networks (Sect. 4)

  • generative approaches (Sect. 5)

  • deep reinforcement learning (Sect. 6)

Simulation substitution replaces the entire simulation with a surrogate model, which in the context of deep learning are deep neural networks (NNs). The model can be trained with supervised learning, which purely relies on labeled data and therefore is referred to as data-driven modeling. The generalization errors of these models can be reduced by physics-informed learning. Here, physics constraints are imposed on the learnable space such that only physically admissible solutions are learned.

Simulation enhancement instead only replaces components of the simulation chain, while the remaining parts are still handled by classical methods. Approaches within this category are strongly linked to their respective applications and will, therefore, be presented in the context of their specific use cases. Both data-driven and physics-informed approaches will be discussed.

Treating discretizations as neural networks is achieved by constructing a discretization from the basic building blocks of NNs, i.e., linear transformations and non-linear activation functions. Thereby, techniques within deep learning frameworks—such as automatic differentiation, gradient-based optimization, and efficient GPU-based parallelization—can be leveraged to improve classical simulation techniques.

Generative approaches deal with creating new content based on a data set. The goal is not, however, to recreate the data, but to generate statistically similar data. This is useful in diversifying the design space or enhancing a data set to train surrogate models.

Finally, in deep reinforcement learning, an agent learns how to interact with an environment in order to maximize rewards provided by the environment. In the case of deep reinforcement learning, the agent is modeled with NNs. In the context of computational mechanics, the environment is modeled by the governing physical equations. Reinforcement learning provides an alternative to gradient-based optimization, which is useful when gradient information is not available.

The unique proposed taxonomy arises from a methodological viewpoint, instead of an application [22,23,24,25,26,27,28,29,30,31,32,33,34,35], or problem [42] oriented perspective. However, parallels can be drawn to the in [42] identified challenges and proposed areas of investigation in machine learning. Similarly, the distinction between machine learning enhancedFootnote 4 and substitution by machine learning models is made. Additionally, challenges such as robustness, explainability, and handling of complex and high-dimensional data are highlighted. Also, the separation between physics-informed learning and data-driven modeling is made by [42], as well as by [43]. Interestingly, older reviews [3, 4] arrived at similar categories, additionally including NNs as means of more efficient implementations, i.e., discretizations as NNs. Only the last two proposed categories, generative approaches, and deep reinforcement learning, have not been spotlighted as methodologies within reviews of computational mechanics. But these are well-established within the machine learning community [44,45,46,47] and sufficiently distinct to be treated separately.

1.3 Deep learning

Before continuing with the topics specific to computational mechanics, NNsFootnote 5 and the notation used throughout this work are briefly introduced. In essence, NNs are function approximators that are capable of approximating any continuous function [50]. The NN parametrized by the learnable parameters \(\varvec{\theta }\) (typically consisting of weights \({\varvec{w}}\) and biases \({\varvec{b}}\)) learns a function \({\hat{y}}=f_{NN}(x;\varvec{\theta })\), which approximates the relation \(y=f(x)\). The NN is constructed with nested linear transformations in combination with non-linear activation functions \(\sigma \). The most basic NNs: Fully connected NNs achieve this with layers of fully connected neurons (see Fig. 2), where the activation \(a_k^i\) of each neuron (the ith neuron of layer k) is obtained through linear combinations of the previous layer and the non-linear activation function \(\sigma \):

$$\begin{aligned} a_k^i=\sigma \left( \displaystyle \sum _{j=1}^{n} w_{kj}^i a^{i-1}_{j} + b_k^i \right) . \end{aligned}$$
(1)

If more than one layer (excluding input x and output layer \({\hat{y}}\)) is employed, the NN is considered a deep NN, and its training process is thereby deep learning. The evaluation of the NN, i.e., the prediction is referred to as forward propagation. The quality of prediction is determined by a cost function \(C({\hat{y}})\), which is to be minimized. Its gradients \(\nabla _{\varvec{\theta }} C=\{\nabla _{{\varvec{w}}}C, \nabla _{{\varvec{b}}}C\}\) with respect to the parameters \(\varvec{\theta }\) are obtained with automatic differentiation [51], specifically referred to as backward propagation in the context of NNs. The gradients are used within a gradient-based optimization [44, 52, 53] to update the parameters \(\varvec{\theta }\) and thereby improve the prediction \({\hat{y}}\). Supervised learning relies on labeled data \(x^{{\mathcal {M}}}, y^{{\mathcal {M}}}\) to establish a cost function C, while unsupervised learning does not rely on labeled data. The parameters defining the user-defined training algorithm and NN architecture are referred to as hyperparameters. The concept is summarized by Fig. 2, showing a fully connected multi-layer, i.e., deep, NN. More advanced NN architectures discussed throughout this work are described in Appendix A.

Fig. 2
figure 2

Conceptual illustration on how NNs, parametrized with weights and biases \(\varvec{\theta }=({\textbf{w}},{\textbf{b}})\), are trained, relying on the backward propagation algorithm computing the gradients of the cost function C, and how predictions \({\hat{y}}\) are performed via the forward propagation. Specifically, a fully connected deep NN is depicted

NotationalRemark 1

Data sets are denoted by a superscript \({\mathcal {M}}\), i.e, \(\{ x^{\mathcal {M}}, y^{\mathcal {M}} \}_{i=1}^{N_{\mathcal {M}}}\), where \(N_{\mathcal {M}}\) is the data set size.

NotationalRemark 2

Although x and y may denote vector-valued quantities, we do not use bold-faced notation for them. Instead, this is reserved for all N degrees of freedom within a problem, i.e., \({\varvec{x}} = \{x_i\}_{i=1}^{N}\), \({\varvec{y}} = \{y_i\}_{i=1}^{N}\). This can, for instance, be in the form of a domain \(\Omega \) sampled with N grid points or systems composed of N degrees of freedom. Note however, that matrices will still be denoted with capital letters in bold face.

NotationalRemark 3

A multitude of NN architectures will be discussed throughout this work, for which we introduce abbreviations and subscripts. Most prominent are fully connected NNs \(F_{FNN}\) (FC-NNs) [44, 54], convolutional NNs \(f_{CNN}\) (CNNs) [55,56,57], recurrent NNs \(f_{RNN}\) (RNNs) [58,59,60], and graph NNs \(f_{GNN}\) (GNNs) [61,62,63]Footnote 6. If the network architecture is independent of the method, the network is denoted as \(f_{NN}\).

2 Simulation substitution

In the field of computational mechanics, numerical procedures are developed to solve or find partial differential equations (PDEs). A generic PDE can be written as

$$\begin{aligned} {\mathcal {N}}[u;\lambda ]=0, \qquad \text {on }\Omega \times {\mathcal {T}}, \end{aligned}$$
(2)

where a non-linear operator \({\mathcal {N}}\) acts on a solution u(xt) of a PDE as well as the coefficients \(\lambda (x,t)\) of the PDEFootnote 7 in the spatio-temporal domain \(\Omega \times {\mathcal {T}}\). In the forward problem, the solution u(xt) is to be computed, while the inverse problem considers either the non-linear operator \({\mathcal {N}}\) or coefficients \(\lambda (x,t)\) as unknowns.

A further distinction is made between methods treating the temporal dimension t as a continuum, as in space-time approaches [67] (Sects. 2.1.1 and 2.2.1)Footnote 8, or in discrete sequential time steps, as in time-stepping procedures (Sects. 2.1.2 and 2.2.2). For simplicity, but without loss of generality, time-stepping procedures will be presented on PDEs with a first order derivative with respect to time:

$$\begin{aligned} \frac{\partial u}{\partial t} = {\mathcal {N}}^{{\mathcal {T}}}[u;\lambda ], \qquad \text {on }\Omega \times {\mathcal {T}}. \end{aligned}$$
(3)

with the non-linear operator \({\mathcal {N}}^{{\mathcal {T}}}\). Another task in computational mechanics is the forward modeling and identification of systems of ordinary differential equations (ODEs). For this, we will consider systems of the following form:

$$\begin{aligned} \frac{d{\varvec{x}}(t)}{dt} = {\varvec{f}}({\varvec{x}}(t)). \end{aligned}$$
(4)

Here, \({\varvec{x}}(t)\) are the time-dependent degrees of freedom and \({\varvec{f}}\) is the right-hand side defining the system of equations.Footnote 9 Both the forward problem of computing \({\varvec{x}}(t)\) and the inverse problem of identifying \({\varvec{f}}\) will be discussed in the following.

2.1 Data-driven modeling

Data-driven modeling relies entirely on labeled data \(x^{{\mathcal {M}}}, y^{{\mathcal {M}}}\). The NN learns the mapping between \(x^{{\mathcal {M}}}\) and \(y^{{\mathcal {M}}}\) with \({\hat{y}}_i=f_{NN}(x_i;\varvec{\theta })\). Thereby an interpolation to yet unseen data points is established. A data-driven loss \({\mathcal {L}}_{{\mathcal {D}}}\), such as the mean squared error, for example, can be used as cost function C.

$$\begin{aligned} C = {\mathcal {L}}_{{\mathcal {D}}}=\frac{1}{2N_{{\mathcal {M}}}} \sum _{i=1}^{N_{{\mathcal {M}}}} ||{\hat{y}}_i - y_i^{{\mathcal {M}}} ||^2_2 \end{aligned}$$
(5)

2.1.1 Space-time approaches

To declutter the notation, but without loss of generality, the temporal dimension t is dropped in this section, as it is possible to treat it like any other spatial dimension x in the scope of these methods. The goal of the upcoming methods is to either learn a forward operator \({\hat{u}}=F[\lambda ; x]\), an inverse operator for the coefficients \({\hat{\lambda }} = I[u; x]\), or an inverse operator for the non-linear operator \(\hat{{\mathcal {N}}} = O[u; \lambda ; x]\).Footnote 10 The methods will be explained using the forward operator, but they apply analogously to the inverse operators. Only the inputs and outputs differ.

The solution prediction \({\hat{u}}_i\) at coordinate \(x_i\) or \(\varvec{{\hat{u}}}_i\) on the entire domain \(\Omega \) is made based on a set of inverse coefficients \(\varvec{\lambda }_i\). The cost function C is formulated analogously to Eq. (5):

$$\begin{aligned} C= & {} {\mathcal {L}}_{{\mathcal {D}}}=\frac{1}{2N_{{\mathcal {M}}}} \sum _{i=1}^{N_{{\mathcal {M}}}} || {\hat{u}}_i - u_i^{{\mathcal {M}}} ||_2^2 \qquad \text {or} \qquad \nonumber \\ C= & {} {\mathcal {L}}_{{\mathcal {D}}}=\frac{1}{2N_{{\mathcal {M}}}} \sum _{i=1}^{N_{{\mathcal {M}}}} || \varvec{{\hat{u}}}_i - {\varvec{u}}_i^{{\mathcal {M}}} ||_2^2. \end{aligned}$$
(6)

2.1.1.1. Fully connected neural networks

The simplest procedure is to approximate the operator F with a FC-NN \(F_{FNN}\).

$$\begin{aligned} {\hat{u}}(x) = F_{FNN}(\varvec{\lambda }; x; \varvec{\theta }) \end{aligned}$$
(7)

Example applications are flow classification [68, 69], fluid flow in turbomachinery [70], dynamic beam displacements from previous measurements [71], wall velocity predictions in turbulence [72], heat transfer [73], prediction of source terms in turbulence models [74], full waveform inversion [75,76,77], and topology optimization based on moving morphable bars [78]. The approach is however limited to simple problems, as an abundance of data is required. Therefore, several improvements have been proposed.

2.1.1.2. Image-to-image mapping

One downside of the application of FC-NNs to problems in computational mechanics is that they often need to learn spatial relationships with respect to x from scratch. CNNs inherently account for these spatial relationships due to their kernel-based structure. Therefore, image-to-image mappings using CNNs have been proposed, where an image, i.e., a uniform grid (see Fig. 3) of the coefficients \(\varvec{\lambda }\), is used as input.

$$\begin{aligned} \varvec{{\hat{u}}} = F_{CNN}(\varvec{\lambda };\varvec{\theta }) \end{aligned}$$
(8)

This results in a prediction of the solution \(\varvec{{\hat{u}}}\) throughout the entire image, i.e., the domain.

Fig. 3
figure 3

Representation of nodes of a Cartesian grid as pixels in an image. Adapted from [79]

Applications include pressure and velocity predictions around airfoils [80,81,82,83], stress predictions from geometries and boundary conditions [84, 85], steady flow predictions [86], detection of manufacturing features [87, 88], full waveform inversion [89,90,91,92,93,94,95,96,97,98,99,100], and topology optimization [101,102,103,104,105,106,107,108,109,110]. An important choice in the design of the learning algorithm is the encoding of the input data. In the case of geometries and boundary conditions, binary representations are the most straightforward approach. These are however challenging for CNNs, as discussed in [86]. Signed distance functions [86] or simulations on coarse grids provide superior alternatives. For inverse problems, an initial forward simulation of an initial guess of the inverse field can be used to encode the desired boundary conditions [105, 108,109,110]. Another possibility for CNNs is a decomposition of the domain. The mapping can be performed on the full domain [111], smaller subdomains [112], or even individual pixels [113]. In the latter two cases, interfaces require special treatment.

The disadvantage of CNN mappings is being constrained to uniform grids on rectangular domains. This can be circumvented by using GNNs acting on graph data, e.g., meshes, such as in [114,115,116], or point cloud-based NNs [117, 118] acting on point cloud data, such as in [119]. Just as CNNs, GNNs operate on the invariant structural elements of the data, which for GNNs are edges connecting vertices (see Appendix A.2) instead of pixels aligned on a structured grid for CNNs. In fact, GNNs can be regarded as a generalization of CNNs since they can handle a broader class of data structures, i.e., graphs (including images)Footnote 11. This comes at the cost of less efficient implementations when compared to pure CNNs.

2.1.1.3. Model order reduction encoding

Independent of the NN architecture, learning can be aided by applying the NN to a lower-dimensional space that is able to capture the data. For complex problems, mappings e to low-dimensional spaces (also referred to as latent space or latent vector) \({\varvec{h}}\) can be identified with model order reduction techniques. Thus, in the case of simulation substitution, a low-dimensional encoding \({\varvec{h}}^\lambda =e(\varvec{\lambda })\) of \(\varvec{\lambda }\) (sampled on all sample points \({\varvec{x}}\)) is identified. This is provided as input to a NN to predict the solution field \({\varvec{h}}^u\) in a reduced latent space. The full solution field \({\varvec{u}}\) (on all sample points \({\varvec{x}}\)) is obtained in a decoding \(d=e^{-1}\) step. The prediction is given as

$$\begin{aligned} \varvec{{\hat{u}}} = d(\varvec{{\hat{h}}}^u) = d\bigl (F_{NN}({\varvec{h}}^\lambda ; \varvec{\theta })\bigr ) = d\Bigl (F_{NN}\bigl (e(\varvec{\lambda });\varvec{\theta }\bigr )\Bigr ). \end{aligned}$$
(9)

The dimensional reduction can, e.g., be performed with principal components analysis [120, 121], as proposed in [122], proper orthogonal decomposition [123], or reduced manifold learning [124]. These techniques have been applied to learning aortic wall stresses [125], arterial wall stresses [126], flow velocities in viscoplastic flow [127], and the inverse problem of identifying unpressurized geometries from pressurized geometries [128]. Currently, the most impressive results in data-driven surrogate modeling are achieved with model order reduction encodings combined with NNs [129, 130], which can be combined with most other methodologies presented in this work.

Another dimensionality reduction technique are autoencoders [131], where e and d are modeled by NNsFootnote 12. These are treated in detail in Appendix B.1 and enable non-linear encodings. An early investigation is presented in [132], where proper orthogonal decomposition is related to NNs. Application areas are the prediction of designs of acoustic scatterers from the reduced latent space [133], or mappings from dynamic responses of bridges to damage [134]. Furthermore, it has to be stated that many of the image-to-image mapping techniques rely on NN architectures inspired by autoencoders, such as U-nets [135, 136].

2.1.1.4. Neural operators  The most recent trend in surrogate modeling with NNs are neural operators [137], which map between function spaces instead of functions. Neural operators rely on the extension of the universal approximation theorem [50] to non-linear operators [138]. The two most prominent neural operators are DeepONetsFootnote 13 [139] and Fourier neural operators [140].

DeepONet

In DeepONets [139], illustrated in Fig. 4, the task of predicting the operator \({\hat{u}}(\varvec{\lambda }; x)\) is split up into two sub-tasks:

  • the prediction of \(N_P\) basis functions \(\varvec{{\hat{t}}}(x)\) (TrunkNet),

  • the prediction of the corresponding \(N_P\) problem-specific coefficients \(\varvec{{\hat{b}}}(\varvec{\lambda })\) (BranchNet).

The basis is predicted by the TrunkNet with parameters \(\varvec{\theta }^T\) via an evaluation at coordinates x. The coefficients are estimated from the coefficients \(\varvec{\lambda }\) using the BranchNet parametrized by \(\varvec{\theta }^B\) and, thus, specific to the problem being solved. Taking the dot product over the evaluated basis and the coefficients yields the solution prediction \({\hat{u}}(\varvec{\lambda }; x)\).

$$\begin{aligned} \varvec{{\hat{t}}}(x)&= F^T_{FNN}(x;\varvec{\theta }^T) \end{aligned}$$
(10)
$$\begin{aligned} \varvec{{\hat{b}}}(\varvec{\lambda })&= F^B_{FNN}(\varvec{\lambda }; \varvec{\theta }^B) \end{aligned}$$
(11)
$$\begin{aligned} {\hat{u}}(x)&= \varvec{{\hat{b}}}(\varvec{\lambda }) \cdot \varvec{{\hat{t}}}(x) \end{aligned}$$
(12)
Fig. 4
figure 4

DeepONet, operator learning via prediction of the basis functions \(\varvec{{\hat{t}}}\) and the corresponding coefficients \(\varvec{{\hat{b}}}\) [139]

Applications can be found in [141,142,143,144,145,146,147,148,149,150,151,152,153]. DeepONets have also been extended with physics-informed loss functions [154,155,156].

Fourier neural operators

Fourier neural operators [140] predict the solution \(\varvec{{\hat{u}}}\) on a uniform grid \({\varvec{x}}\) from the spatially varying coefficients \(\varvec{\lambda }=\lambda ({\varvec{x}})\). As the aim is to learn a mapping between functions, sampled on the entire domain, non-local mappings can be performed at each layer [157]. For example, mappings such as integral kernels [158, 159], Laplace transformations [160], and Fourier transforms [140] can be employed. These transformations enhance the non-local expressivity of the NN [157], where Fourier transforms are particularly favorable due to the computational efficiency achievable through fast Fourier transforms.

The Fourier neural operator, as illustrated in Fig. 5, consists of Fourier layers, where linear transformations \({\varvec{K}}\) are performed after Fourier transforms \({\mathcal {F}}\) along the spatial dimensions x. Subsequently, an inverse Fourier transform \({\mathcal {F}}^{-1}\) is applied, which is added to the output of a linear transformation \({\varvec{W}}\) performed outside the Fourier space. Thus, the Fourier transform can be skipped by the NN. The final step is an activation function \(\sigma \). The manipulations within a Fourier layer to predict the next activation on the uniform grid \({\varvec{a}}^{(j+1)}({\varvec{x}})\) can be written as

$$\begin{aligned} {\varvec{a}}^{(j+1)}({\varvec{x}})= & {} \sigma \bigg ({\varvec{W}}{\varvec{a}}^{(j)}({\varvec{x}})+{\varvec{b}}\nonumber \\{} & {} +({\mathcal {F}}^{-1}\Bigl [{\varvec{K}}{\mathcal {F}}\bigl [{\varvec{a}}^{(j)}({\varvec{x}})\bigr ]\Bigr ] \bigg ), \end{aligned}$$
(13)

where \({\varvec{b}}\) is the bias. Both the linear transformations \({\varvec{K}}, {\varvec{W}}\) and the bias \({\varvec{b}}\) are learnable and thereby part of the parameters \(\varvec{\theta }\). Multiple Fourier layers can be employed, typically used in combination with an encoding network \(P_{NN}\) and a decoding network \(Q_{NN}\).

Fig. 5
figure 5

Fourier neural operator, operator learning in the Fourier space [140]

Applications can be found in [161,162,163,164,165,166,167,168,169,170,171]. An extension relying on the attention mechanisms of transformers [172] is presented in [173]. Analogously to DeepONets, Fourier neural operators have been combined with physics-informed loss functions [174].

2.1.1.5. Neural network approximation power

Despite the advancements in NN architecturesFootnote 14, NN surrogates struggle to learn solutions of general PDEs. Typically, successes have only been achieved for parametrized PDEs with relatively small parameter spaces—or in cases where accuracy, reliability, or generalization were disregarded. It has, however, been shown—both for simple architectures such as FC-NNs [175, 176] as well as for advanced architectures such as DeepONets [177]—that NNs possess an excellent theoretical approximation power which can capture solutions of various PDEs. Currently, there are two obstacles that impede the identification of sufficiently good optima with these desirable NN parameter spaces [175]:

  • training data: generalization error,

  • training algorithm: optimization error.

A lack of sufficient training data leads to poor generalization. This might be alleviated through faster data generation using, e.g., faster and specialized classical methods [178], or improved sampling strategies, i.e., finding the minimum number of required data points distributed in a specific manner to train the surrogate. Additionally, current training algorithms only converge to local optima. Research into improved optimization algorithms, such as current trends in computing better initial weights [179] and thereby better local optima, attempts to reduce the optimization error. At the same time, training times are reduced drastically increasing the competitiveness.

2.1.2 Time-stepping procedures

For the time-stepping procedures, we will consider Eqs. (3) and (4) in the following.

2.1.2.1. Recurrent neural networks

The simplest approach to modeling time series data is by using FC-NNs to predict the next time step \(t_{i+1}\) from the current time step \(t_i\):

$$\begin{aligned} {\hat{u}}(x,t_{i+1}) = F_{FNN}\bigl (x,t_i;u(x,t_i);\varvec{\theta }\bigr ). \end{aligned}$$
(14)

However, this approach lacks the ability to capture the temporal dependencies between different time steps, as each input is treated independently and without considering more than just the previous time step. Incorporating the sequential nature of the data can be achieved directly with RNNs. RNNs maintain a hidden state which captures information from the previous time steps, to be used for the next time step prediction. By unrolling the RNN, the entire time-history can be predicted.

$$\begin{aligned}{} & {} \{{\hat{u}}(x,t_2),{\hat{u}}(x,t_3),\dots ,{\hat{u}}(x,t_N)\} \nonumber \\{} & {} \quad = F_{RNN}(x;u(x,t_1);\varvec{\theta }) \end{aligned}$$
(15)

Shortcomings of RNNs, such as their tendency to struggle with learning long-term dependencies due to the problem of vanishing or exploding gradients, have been addressed by more sophisticated architectures such as long short-time memory networks (LSTM) [59], gated recurrent unit networks (GRU) [180], and transformers [172] (see [181] for a recent contribution on transformers for thermal analysis in additive manufacturing). The concept of recurrent units has also been combined with other architectures, as demonstrated for CNNs [182] and GNNs [114, 115, 183,184,185,186,187].

Further applications of RNNs are full waveform inversion [188,189,190], high-dimensional chaotic systems [191], fluid flow [40, 192], fracture propagation [116], sensor signals in non-linear dynamic systems [193, 194], and settlement field predictions induced by tunneling [195], which was extended to damage prediction in affected structures [196, 197]. RNNs are often combined with reduced order model encodings [198], where the dynamics are predicted on the reduced latent space, as demonstrated in [199,200,201,202,203,204,205]. Further variations employ classical time-stepping schemes on the reduced latent space obtained by autoencoders [206, 207].

2.1.2.2. Dynamic mode decomposition

Another approach that was formulated for system dynamics, i.e., Eq. (4) is dynamic mode decomposition (DMD) [208, 209]. The aim of DMD is to identify a linear operator \({\varvec{A}}\) that relates two successive snapshot matrices with n time steps \({\varvec{X}}=[{\varvec{x}}(t_1),{\varvec{x}}(t_2),\dots ,{\varvec{x}}(t_n)]^T, {\varvec{X}}'=[{\varvec{x}}(t_2),{\varvec{x}}(t_3),\dots ,{\varvec{x}}(t_{n+1})]^T\):

$$\begin{aligned} {\varvec{X}}'\approx {\varvec{A}} {\varvec{X}}. \end{aligned}$$
(16)

To solve this, the problem is reframed as a regression task. The operator \({\varvec{A}}\) is approximated by minimizing the Frobenius norm of the difference between \({\varvec{X}}'\) and \({\varvec{A}}{\varvec{X}}\). This minimization can be performed using the Moore-Penrose pseudoinverse \({\varvec{X}}^\dagger \) (see, e.g., [38]):

$$\begin{aligned} {\varvec{A}}=\underset{{{\varvec{A}}}}{\text {arg min} }||{\varvec{X}}'-{\varvec{A}}{\varvec{X}}||_F={\varvec{X}}'{\varvec{X}}^\dagger . \end{aligned}$$
(17)

Once the operator is identified, it can be used to propagate the dynamics forward in time, approximating the next state \({\varvec{x}}(t_{i+1})\) using the current state \({\varvec{x}}(t_i)\):

$$\begin{aligned} {\varvec{x}}(t_{i+1}) \approx {\varvec{A}} {\varvec{x}}(t_i). \end{aligned}$$
(18)

This framework, is however, only valid for linear dynamics. DMD can be extended to handle non-linear systems through the application of Koopman operator theory [210]. According to Koopman operator theory, it is possible to represent a non-linear system as a linear one by using an infinite-dimensional Koopman operator \({\mathcal {K}}\) that acts on a transformed state \(e({\varvec{x}}(t_i))\):

$$\begin{aligned} g({\varvec{x}}(t_{i+1})) = {\mathcal {K}} [e({\varvec{x}}(t_{i}))]. \end{aligned}$$
(19)

In theory, the Koopman operator \({\mathcal {K}}\) is an infinite-dimensional linear transformation. In practice, however, finite-dimensional approximations are employed. This approach is, for example utilized in the extended DMD [211], where the regression from Eq. (17) is performed on a higher-dimensional state \({\varvec{h}}(t_{i}) = e({\varvec{x}}(t_{i}))\) relying on a dictionary of orthonormal basis functions \({\varvec{h}}(t_i)=\varvec{\psi }({\varvec{x}}(t_i))\). Alternatively, the dictionary can be learned using NNs, i.e., \(\varvec{{\hat{\psi }}}({\varvec{x}})=\psi _{NN}({\varvec{x}};\varvec{\theta })\), as demonstrated in [212, 213]. The NN is trained by minimizing the mismatch between predicted state \(\varvec{\psi }(\varvec{{\hat{x}}}(t_{i+1}))={\varvec{A}} \varvec{{\hat{\psi }}}({\varvec{x}}(t_i))\) (Eq. 18) and the true state in the dictionary space. Orthogonality is not required and therefore not enforced.

$$\begin{aligned} C = \frac{1}{2N}\sum _{i=1}^N ||\varvec{{\hat{\psi }}}({\varvec{x}}(t_{i+1})) - {\varvec{A}} \varvec{{\hat{\psi }}}({\varvec{x}}(t_i))||_2^2 \end{aligned}$$
(20)

When the dictionary is learned, the state predictions can be reconstructed using the Koopman mode decomposition, as explained in detail in [212].

Alternatively, the mapping to the augmented state can be performed with autoencoders, which at the same time allows for a direct map back to the original space [214,215,216,217]. Thus, an encoder learns a reduced latent space \(\varvec{{\hat{h}}}({\varvec{x}})=e_{NN}({\varvec{x}};\varvec{\theta }^e)\) and a decoder learns the inverse mapping \(\varvec{{\hat{x}}}({\varvec{h}})=d_{NN}({\varvec{h}};\varvec{\theta }^d)\). The networks are trained using three losses: the autoencoder reconstruction loss \({\mathcal {L}}_{{\mathcal {A}}}\), the linear dynamics loss \({\mathcal {L}}_{{\mathcal {R}}}\), and the future state prediction loss \({\mathcal {L}}_{{\mathcal {F}}}\).

$$\begin{aligned} {\mathcal {L}}_{{\mathcal {A}}}&= \frac{1}{2 (n+1)} \sum _{i=1}^{n+1} ||{\varvec{x}}(t_i) - d_{NN}(e_{NN}({\varvec{x}}(t_i);\varvec{\theta }^e);\varvec{\theta }^d)||_2^2 \end{aligned}$$
(21)
$$\begin{aligned} {\mathcal {L}}_{{\mathcal {R}}}&= \frac{1}{2n}\sum _{i=1}^n ||e_{NN}({\varvec{x}}(t_{i+1});\varvec{\theta }^e) - {\varvec{A}} e_{NN}({\varvec{x}}(t_i);\varvec{\theta }^e)||_2^2 \end{aligned}$$
(22)
$$\begin{aligned} {\mathcal {L}}_{{\mathcal {F}}}&= \frac{1}{2n}\sum _{i=1}^n||{\varvec{x}}(t_{i+1}) - d_{NN}({\varvec{A}} e_{NN}({\varvec{x}}(t_i);\varvec{\theta }^e);\varvec{\theta }^d)||_2^2 \end{aligned}$$
(23)
$$\begin{aligned} C&= \kappa _{{\mathcal {A}}} {\mathcal {L}}_{{\mathcal {A}}} + \kappa _{{\mathcal {R}}} {\mathcal {L}}_{{\mathcal {R}}} + \kappa _{{\mathcal {F}}} {\mathcal {L}}_{{\mathcal {F}}} \end{aligned}$$
(24)

The cost function C is composed of a weighted sum of the loss terms \({\mathcal {L}}_{{\mathcal {A}}},{\mathcal {L}}_{{\mathcal {R}}},{\mathcal {L}}_{{\mathcal {F}}}\) and weighting terms \(\kappa _{{\mathcal {A}}},\kappa _{{\mathcal {R}}},\kappa _{{\mathcal {F}}}\). Furthermore, [216] allows \({\varvec{A}}\) to vary depending on the state. This is achieved by predicting the eigenvalues of \({\varvec{A}}\) with an auxiliary network and constructing the matrix from these.

2.1.3 Active learning and transfer learning

Finally, an important machine learning technique independent of the NN architecture and applicable to both space-time and time-stepping approaches is active learning [218]. Instead of precomputing a labeled data set, data is only provided when the prediction quality of the NN is insufficient. Furthermore, the data is not chosen arbitrarily, but only in the vicinity of the failed prediction. In computational mechanics, the prediction of the NN can be assessed with an error indicator. For an insufficient result, the results of a classical simulation are used to retrain the NN. Over time, the NN estimates improve in the respective domain of application. Due to the error indicator and the classical simulations, the predictions are reliable. Examples for active learning in computational mechanics can be found in [219,220,221].

Another technique, transfer learning [222, 223], aims at accelerating the NN training. Here, the NN is first trained on a similar task. Subsequently, it is applied to the task of interest—where it converges faster than an untrained NN. Applications in computational mechanics can be found in [98, 224].

2.2 Physics-informed learning

In supervised learning, as discussed in Sect. 2.1, the quality of prediction strongly depends on the amount of training data. Acquiring data in computational mechanics may be expensive. To reduce the amount of required data, constraints enforcing the physics have been proposed. Two main approaches exist  [43, 225]. The physics can be enforced by modifying the cost function through a penalty term punishing unphysical predictions, thus acting as a regularizer. Possible modifications are discussed in the upcoming section. Alternatively, the physics can be enforced by construction, i.e., by reducing the learnable space to a physically meaningful space. This approach is highly specific to its application and will therefore mainly be explored in Sect. 3. A brief coverage is provided in Sect. 2.2.3.

Both approaches can be found in overview publications, where [43] defines four overarching methodologies: (i) augmentation of training data using prior knowledge, (ii) modification of the model, i.e., enforcement by construction, (iii) enhancement of the learning algorithm with regularization terms, i.e., enforcing constraints through the cost function, and (iv) checking the final estimate and thereby discarding physical violations (using, e.g., error indicators). The two most prominent methodologies, i.e., modifying the cost function and enforcement by construction are similarly mentioned in [225], which correspondingly refers to them as physics-informed and physics-augmented. Further variations in terminology can be found in [182, 226], who refer to physics-informed NNs for multiple solutions as physics-constrained deep learning, or [227] using the term physics-enhanced NNs for NNs enforcing the physics by construction. Due to the many names within the relatively new and interconnected field, we cover the variations under the overarching term of physics-informed learning.

2.2.1 Space-time approaches

Once again and without loss of generality, the temporal dimension t is dropped to declutter the notation. However, in contrast to Sect. 2.1.1, the following methods are not equally applicable to forward and inverse problems. Thus, the prediction of the solution \({\hat{u}}\), the PDE coefficients \({\hat{\lambda }}\), and the non-linear operator \({\mathcal {N}}\) are treated separately.

2.2.1.1. Differential equation solving with neural networks

The concept of solving PDEsFootnote 15 was first proposed in the 1990s [8,9,10], but was recently popularized by the so-called physics-informed neural networks (PINNs) [228] (see [229,230,231] for recent review articles and SciANN [232], SimNet [233], DeepXDE [234] for libraries).

To illustrate the idea and variations of PINNs, we will consider the differential equation of a static elastic bar

$$\begin{aligned} \frac{d}{dx}\left( EA\frac{du}{dx}\right) +p=0, \qquad x\in \Omega . \end{aligned}$$
(25)

Here, the operator \({\mathcal {N}}\) is given by the left-hand side of the equation, the solution u(x) is the axial displacement, and the spatially varying coefficients \(\lambda (x)\) are given by the cross-sectional properties EA(x) and the distributed load p(x). Additionally, boundary conditions are specified, which can be in terms of Dirichlet (on \(\Gamma _D\)) or Neumann boundary conditions (on \(\Gamma _N\)):

$$\begin{aligned} u(x)&= g(x), \qquad x\in \Gamma _D, \end{aligned}$$
(26)
$$\begin{aligned} EA(x)\frac{du(x)}{dx}&= f(x), \qquad x\in \Gamma _N. \end{aligned}$$
(27)

Physics-informed neural networks

PINNs [228] approximate either the solution u(x), the coefficients \(\lambda (x)\), or both with FC-NNs.

$$\begin{aligned} {\hat{u}}(x)&= F_{FNN}(x; \varvec{\theta }^u) \end{aligned}$$
(28)
$$\begin{aligned} {\hat{\lambda }}(x)&= I_{FNN}(x; \varvec{\theta }^\lambda ) \end{aligned}$$
(29)

Instead of training the network with labeled data as in Eq. (6), the residual of the PDE is considered. The residual is evaluated at a set of \(N_{{\mathcal {N}}}\) points, called collocation points. Taking the mean squared error over the residual evaluations yields the PDE loss

$$\begin{aligned} {\mathcal {L}}_{{\mathcal {N}}}= & {} \frac{1}{2N_{{\mathcal {N}}}} \sum _{i=1}^{N_{{\mathcal {N}}}} ||{\mathcal {N}}[u(x_i); \lambda (x_i)]||^2_2 \nonumber \\= & {} \frac{1}{2N_{{\mathcal {N}}}} \sum _{i=1}^{N_{{\mathcal {N}}}} \left( \frac{d}{dx}\left( EA(x_i)\frac{du(x_i)}{dx}\right) +p(x_i) \right) ^2. \end{aligned}$$
(30)

The gradients of the possible predictions, i.e., uEA, and p with respect to x, are obtained with automatic differentiation [51] through the NN approximation. Similarly, the boundary conditions are enforced at the \(N_\mathcal {B_D}+N_\mathcal {B_N}\) boundary points.

$$\begin{aligned} {\mathcal {L}}_{{\mathcal {B}}}= & {} \frac{1}{2N_{{\mathcal {N}}_D}} \sum _{i=1}^{N_{{\mathcal {B}}_D}} (u(x_i) - g)^2 \nonumber \\{} & {} + \frac{1}{2N_{{\mathcal {B}}_N}} \sum _{i=1}^{N_{{\mathcal {B}}_N}} \left( EA(x_i) \frac{du(x_i)}{dx} - f \right) ^2 \end{aligned}$$
(31)

The cost function is composed of the PDE loss \({\mathcal {L}}_{\mathcal {N}}\), boundary loss \({\mathcal {L}}_{\mathcal {B}}\), and possibly a data-driven loss \({\mathcal {L}}_{\mathcal {D}}\)

$$\begin{aligned} C= {\mathcal {L}}_{{\mathcal {N}}} + {\mathcal {L}}_{{\mathcal {B}}} + {\mathcal {L}}_{{\mathcal {D}}}. \end{aligned}$$
(32)

Both the deep least-squares method [235] and the deep Galerkin method [236] are closely related. Instead of focusing on the residuals at individual collocation points as in PINNs, these methods consider the \(L^2\)-norm of the residuals integrated over the domain \(\Omega \).

Variational physics-informed neural networks

Computing high-order derivatives for the non-linear operator \({\mathcal {N}}\) is expensive. Therefore, variational PINNs [237, 238] consider the weak form of the PDE, which lowers the order of differentiation. In the case of the bar equation, the weak PDE loss is given by

$$\begin{aligned} {\mathcal {L}}_{{\mathcal {V}}_i}&= \int _\Omega \frac{dw_i(x)}{dx} EA(x) \frac{du(x)}{dx}d\Omega \nonumber \\&\quad - \int _{\Gamma _N} w_i(x) EA(x) \frac{du(x)}{dx} d\Gamma _N \nonumber \\&\quad - \int _\Omega w_i(x) p(x) d\Omega =0, \forall w_i(x), \end{aligned}$$
(33)
$$\begin{aligned} {\mathcal {L}}_{\mathcal {V}}&= \frac{1}{N_{\mathcal {V}}} \sum _{i}^{N_{\mathcal {V}}} {\mathcal {L}}_{{\mathcal {V}}_i}. \end{aligned}$$
(34)

In [237], \(N_{{\mathcal {V}}}\) trigonometric and polynomial test functions \(w_i(x)\) are used. The cost function is obtained by replacing the PDE loss \({\mathcal {L}}_{\mathcal {N}}\) with the weak PDE loss \({\mathcal {L}}_{\mathcal {V}}\) in Eq. (32). Note that the Neumann boundary conditions are now not included in the boundary loss \({\mathcal {L}}_{\mathcal {B}}\), as they are already incorporated in the weak form in Eq. (33). The integrals are evaluated through numerical integration methods, such as Gaussian quadrature, Monte Carlo integration methods [239, 240], or sparse grid quadratures [241]. Severe inaccuracies can be introduced through the numerical integration of the NN output—for which remedies have been proposed in [242].

Weak adversarial networks

Instead of specifying the test functions w(x), weak adversarial networks [243] employ a second NN as test function

$$\begin{aligned} {\hat{w}}(x) = W_{FNN}(x;\varvec{\theta }^w). \end{aligned}$$
(35)

The test function is learned through a minimax optimization

$$\begin{aligned} \min _{\varvec{\theta }^u} \max _{\varvec{\theta }^w} C, \end{aligned}$$
(36)

where the test function w(x) continually challenges the solution u(x).

Deep energy method and deep Ritz method

By minimizing the potential energy \(\Pi =\Pi _i+\Pi _e\) instead, the need for test functions is overcome by the deep energy method [244] and the deep Ritz method [245]. This results in the following loss term

$$\begin{aligned} {\mathcal {L}}_{\mathcal {E}}= & {} \Pi _i+\Pi _e=\frac{1}{2}\int _\Omega EA(x) \left( \frac{du(x)}{dx} \right) ^2 d\Omega \nonumber \\{} & {} - \int _\Gamma u(x) EA(x) \frac{du(x)}{dx} d\Gamma \nonumber \\{} & {} - \int _\Omega u(x) p(x) d\Omega . \end{aligned}$$
(37)

Note that the inverse problem generally cannot be solved using the minimization of the potential energy. Consider, for instance, the potential energy of the bar equation in Eq. (37), which is not well-posed in the inverse setting. Here, EA(x) going towards \(-\infty \) in the domain \(\Omega \) and going towards \(\infty \) at \(\Gamma _N\) minimizes the potential energy \({\mathcal {L}}_{{\mathcal {E}}}\).

Extensions

A multitude of extensions to the PINN methodology exist. For in-depth reviews, see [229,230,231].

Learning multiple solutions

Currently, PINNs are mainly employed to learn a single solution. As the training effort exceeds the solving effort of classical solvers, the viability of PINNs is questionable [246]. However, PINNs can also be employed to learn multiple solutions. This is achieved by providing the parametrization of the PDE, i.e., \(\lambda \) as an additional input to the network, as discussed in Sect. 2.1. This enables a cheap prediction stage without retraining for new solutionsFootnote 16. One possible example for this is [247], where different geometries are captured in terms of point clouds and processed with point cloud-based NNs [117].

Boundary conditions

The enforcement of the boundary conditions through a penalty term \({\mathcal {L}}_{{\mathcal {B}}}\) in Eq. (31) leads to an unbalanced optimization, due to the competing loss terms \({\mathcal {L}}_{{\mathcal {N}}}, {\mathcal {L}}_{{\mathcal {B}}}, {\mathcal {L}}_{{\mathcal {D}}}\) in Eq. (32)Footnote 17. One remedy is to modify the NN output \(F_{FNN}\) by multiplication of a function, such that the Dirichlet boundary conditions are satisfied a priori, i.e., \({\mathcal {L}}_{{\mathcal {B}}}=0\), as demonstrated in [37, 248].

$$\begin{aligned} {\hat{u}}(x) = G(x) + D(x) F_{FNN}(x;\varvec{\theta }^u) \end{aligned}$$
(38)

Here, G(x) is a smooth interpolation of the boundary conditions, and D(x) is a signed distance function that is zero at the boundary. For Neumann boundary conditions, [249] propose to predict u and its derivatives \(\partial u/\partial x\) with separate networks, such that the Neumann boundary conditions can be enforced strongly by modifying the derivative network. This requires an additional constraint, ensuring that the derivative predictions match the derivative of u. For complex domains, G(x) and D(x) cannot be found analytically. Therefore, [248] use NNs to learn G(x) and D(x) in a supervised manner by prescribing either the boundary values or zero at the boundary and restricting the values within the domain to be non-zero. Similarly [250] proposed using radial basis function networks for G(x), where \(D(x)=1\) is assumed. The radial basis function networks are determined by solving a linear system of equations constructed with the boundary conditions. On uniform grids, strong enforcement can be achieved through specialized CNN kernels [204] with constant padding terms for Dirichlet boundary conditions and ghost cells for Neumann boundary conditions. Constrained backward propagation [251] has also been proposed to guarantee the enforcement of boundary conditions [252, 253].

Another possibility is to introduce weighting terms \(\kappa _{{\mathcal {N}}}, \kappa _{{\mathcal {B}}}, \kappa _{{\mathcal {D}}}\) for each loss term. These are either hyperparameters, or they are learned during the optimization with attention mechanisms [254,255,256]. This is achieved by performing a minimax optimization with respect to all weighting terms \(\varvec{\kappa }=\{\kappa _{{\mathcal {N}}}, \kappa _{{\mathcal {B}}}, \kappa _{{\mathcal {D}}}\}\)

$$\begin{aligned} \min _{\varvec{\theta }} \max _{\varvec{\kappa }} C. \end{aligned}$$
(39)

Expanding on this idea, each collocation point used for the loss terms can be considered an individual equality constraint [257, 258]. Therefore, a weighting term \(\kappa _{{\mathcal {N}}_i}\) is allocated for each collocation point \(x_i\), as illustrated for the PDE loss \({\mathcal {L}}_{{\mathcal {N}}}\) from Eq. (30)

$$\begin{aligned} {\mathcal {L}}_{{\mathcal {N}}} = \frac{1}{2N_{{\mathcal {N}}}} \sum _{i=1}^{N_{\mathcal {N}}} \kappa _{{\mathcal {N}},i} || {\mathcal {N}}[u(x_i);\lambda (x_i)]||^2_2. \end{aligned}$$
(40)

This has the added advantage that greater emphasis is assigned on more important collocation points, i.e., points which lead to larger residuals. This approach is strongly related to the approaches relying on the augmented Lagrangian method [259] and competitive PINNs [260], where an additional NN models the penalty weights \(\kappa (x)=K_{FNN}(x; \varvec{\theta }^\kappa )\). This is similar to weak adversarial networks, but instead formulated using the strong form.

Ansatz

Another prominent topic is the question of which ansatz to choose. The type of ansatz is, for example, determined by different NN architectures (see [261] for a comparison) or combinations with classical ansatz formulations. Instead of using FC-NNs, some authors [182, 226] employ CNNs to exploit the spatial structure of the data. Irregular geometries can be handled by embedding the structure in a rectangular domain using binary encodings [262] or signed distance functions [86, 263]. Another option are coordinate transformations into rectangular grids [264]. The CNN requires a full-grid discretization, meaning that the coordinates x are analytically independent of the prediction \({\hat{u}} = F_{CNN}\). Thus, the gradients of u are not obtained with automatic differentiation, but with numerical differentiation, i.e., finite differences. Alternatively, the output of the CNN can represent coefficients of an interpolation, as proposed under the name spline-PINNs [265] using Hermite splines. This again allows for an automatic differentiation. This is similarly applied for irregular geometries in [266], where GNNs are used in combination with a piecewise polynomial basis. Using a classical basis has the added advantage that Dirichlet boundary conditions can be satisfied exactly. A further variation is the approximation of the coefficients of classical bases with FC-NNs. This is shown with B-splines in [267] in the sense of isogeometric analysis [268]. This was similarly done for piecewise polynomials in [269]. However, instead of simply minimizing the PDE residual from Eq. (30) directly, the finite element discretization [270, 271] is exploited. The loss \({\mathcal {L}}_{{\mathcal {F}}}\) can thus be formulated in terms of the non-linear stiffness matrix \({\varvec{K}}\), the force vector \({\varvec{F}}\), and the degrees of freedom \({\varvec{u}}^h\).

$$\begin{aligned} {\mathcal {L}}_{{\mathcal {F}}}= ||{\varvec{K}}({\varvec{u}}^h){\varvec{u}}^h-{\varvec{F}}||_2^2 \end{aligned}$$
(41)

In the forward problem, \({\varvec{u}}^h\) is approximated by a FC-NN, whereas for the inverse problem a FC-NN predicts \({\varvec{K}}\). Similarly, [272, 273] map a NN onto a finite element space by using the NN evaluations at nodal coordinates as the corresponding basis function coefficents. This also allows a straightforward strong enforcement of Dirichlet boundary conditions, as demonstrated in [79] with CNNs. The nodes are represented as pixels (see Fig. 3).

Prior information on the solution can be incorporated through a feature layer [274]. If, for example, it is known that the solution is composed of trigonometric functions, a feature layer with trigonometric functions can be applied after the input layer. Thus, known features are given to the NN directly to aid the learning. Without known features, the task can also be modified to improve learning. Inspired by adaptivity from finite elements, refinements are progressively learned by additional layers of the NN [275] (see Fig. 6). Thus, a coarse solution \({\varvec{u}}_1\) is learned to begin with, then refined to \({\varvec{u}}_2\) by an additional layer, which again is refined to \({\varvec{u}}_3\) until the deepest refinement level is reached.

Fig. 6
figure 6

Refinement expressed with NNs in terms of NN depth. Thick black lines indicate non-learnable connections and gray lines indicate learnable connections. Each added layer is composed of a projection from the coarser level and a correction obtained through the learnable connection

Domain decomposition

To improve the scalability of PINNs to more complex problems, several domain decomposition methods have been proposed. One approach are hp-variational PINNs [238], where the domain is decomposed into patches. Piecewise polynomial test functions are defined on each patch separately, while the solution is approximated by a globally acting NN. This enables a separate numerical integration of each patch, improving its accuracy.

In an alternative formulation, one NN can be used per subdomain. This was proposed as conservative PINNs [276], where conservation laws are enforced at the interface to ensure continuity. Here, the discrepancies between both solution and flux were penalized at the interface in a least squares manner. The advantages of this approach are twofold: Firstly, parallelization is possible [277] and, secondly, adaptivitiy can be introduced. Shallower networks can be employed for smooth solutions and deeper networks for more complex solutions. The approach was generalized for any PDE in the context of extended PINNs [278]. Here, the interface condition is formulated in terms of the difference in both the residual and the solution.

Acceleration methods

Analogously to supervised learning, as discussed in Sect. 2.1, transfer learning can be applied to PINNs [279] as, e.g., demonstrated in phase-field fracture [280] or topology optimization [281]. These are very suitable problems since crack and displacement fields evolve with mostly local changes in phase-field fracture. For topology optimization, only minor updates are expected between each optimization iteration [281].

The poor performance of PINNs in their original form can also be improved with better sampling strategies. In importance sampling [282, 283], the collocation point density is proportional to the value of the cost function. Alternatively, residual-based adaptive refinement [234] adds collocation points in the vicinity of areas with a higher cost function.

Another essential topic for NNs is normalization of the inputs, outputs, and loss terms [284, 285]. For time-dependent problems, it is possible to use time-dependent normalization [286] to ensure that the solution is always in the same range regardless of the time step.

Furthermore, the cost function can be enhanced by including the derivative of the residual [287] as well. The derivative should also be minimized, as both the residual and its derivative should be zero at the correct solution. However, a general problem in the cost function formulation persists. The cost function should correspond to the norm of the error, which is not necessarily the case. This means that a reduction in the cost does not necessarily yield an improvement in quality of solution. The error norm can be expressed in terms of the \(H^{-1}\)-norm, which, according to [288], can efficiently be computed on rectangular domains with Fourier transforms. Thus, the \(H^{-1}\)-norm can directly be used as cost function and minimized.

Another aspect is numerical differentiation, which is advantageous for the residual of the PDE [289], as automatic differentiation may be erroneous due to spurious oscillations between collocation points. Thus, numerical differentiation enforces regularity, which was exploited in [289] by coupling automatic differentiation and numerical differentiation to retain the advantages of automatic differentiation.

Further specialized modifications to NN architectures have been proposed. Adaptive activation functions [290] have shown acceleration in convergence. Extreme learning machines [291, 292] remove the need for iterations altogether. All layers are randomly initialized in extreme learning machines, and only the last layer is learnable. Without a non-linear activation function, the parameters are found with a least-squares regression. This was demonstrated for PINNs in [293]. Instead of only learning the last layer, the problem can be split into a non-linear and a linear regression problem, which are solved separately [294], such that the full expressivity of NNs is retained.

Applications to forward problems

PINNs have been applied to various PDEs (see [229,230,231] for an overview). Forward problems can, for example, be found in solid mechanics [284, 295, 296], fluid mechanics [297,298,299,300,301,302,303,304], and thermomechanics [305, 306]. Currently, PINNs do not outperform classical solvers such as the finite element method [246, 307] in terms of speed for a given accuracy of engineering relevance. In the author’s experience and judgement, this is especially the case for forward problems even if the extensions mentioned above are employed. Often, the mentioned gains compared to classical forward solvers disregard the training effort and only report evaluation times.

Incorporating large parts of the solution in the form of measurements with the data-driven loss \({\mathcal {L}}_{{\mathcal {D}}}\) improves the performance of PINNs, which thereby can become a viable method in some cases. Yet, [308] states that data-driven methods outperform PINNs. Thus PINNs should not be regarded as a replacement for data-driven methods, but rather as a regularization technique for data-driven methods to reduce the generalization error.

Applications to inverse problems

However, PINNs are in particular useful for inverse problems with full domain knowledge, i.e., the solution is available throughout the entire domain. This has, for example, been shown for the identification of material properties [285, 309,310,311,312]. By contrast, for inverse problems with only partial knowledge, the applicability of PINNs is limited [313], as both forward and inverse solution have to be learned simultaneously. Most applications therefore limit themselves to simpler inversions such as size and shape optimization. Examples are published, e.g., in [295, 314,315,316,317,318,319]. Exceptions that deal with the identification of entire fields can be found in full waveform inversion [320], topology optimization [321], elasticity, and the heat equation [322].

2.2.1.2. Inverse problems

PINNs are capable of discovering governing equations by either learning the operator \({\mathcal {N}}\) or the coefficients \(\lambda \). The resulting operator is, however, not always interpretable, and in the case of identification of the coefficients, the underlying PDE is assumed. To discover interpretable operators, one can apply sparse regression approaches [323]. Here, potential differential operators are assumed as an input to the non-linear operator

$$\begin{aligned} \hat{{\mathcal {N}}}\left[ x, u,\frac{\partial u}{\partial x},\frac{\partial ^2 u}{\partial x^2},\dots \right] =0. \end{aligned}$$
(42)

Subsequently, a NN learns the corresponding coefficients using observed solutions inserted into Eq. (42). The evaluation of the differential operators is achieved through automatic differentiation by first interpolating the solution with a NN. Sparsity is ensured with a \(L^1\)-regularization.

A more sophisticated and complete framework is AI-Feynman [324]. Sequentially, dimensional analysis, polynomial regression, and brute force search algorithms are applied to identify fundamental laws in the data. If unsuccessful, a NN interpolates the data, which can thereby be queried for symmetry and separability. The identification of symmetries leads to a reduction in variables, i.e., a reduction of the input space. In the case of separability, the problem is decomposed into two subproblems. The reduced problems or subproblems are iteratively fed through the framework until an equation is identified. AI-Feynman has been successfully applied to 100 equations from the Feynman lectures [325].

2.2.2 Time-stepping procedures

Again Eqs. (3) and (4) will be considered for the time-stepping procedures.

2.2.2.1. Physics-informed neural networks

In the spirit of domain decomposition, parareal PINNs [326] split up the temporal domain in subdomains \([t_i<t_{i+1}]\). A rough estimate of the solution u is provided by a conjugate gradient solver on a simplified form of the PDE starting from \(t_0\). PINNs are then independently applied in each subdomain to correct the estimate. Subsequently, the conjugate gradient solver is applied again, starting from \(t_1\). This process is repeated until all time steps have been traversed. A closely related approach can be found in [327], where a PINN is retrained on successive time segments. It is however ensured that previous time steps are kept fulfilled through a data-driven loss term for time segments that were already learned.

Another approach are the discrete-time PINNs [228], which consider the temporal dimension in a discrete manner. The differential equation from Eq. (3) is discretized with the Runge-Kutta method with q stages [328]:

$$\begin{aligned} u^{n+c_i}&= u^n+\Delta t \sum _{j=1}^q a_{ij} {\mathcal {N}}^{{\mathcal {T}}}[u^{n+c_j}], \qquad i=1,\dots ,q, \end{aligned}$$
(43)
$$\begin{aligned} u^{n+1}&= u^n+\Delta t \sum _{j=1}^q b_j {\mathcal {N}}^{{\mathcal {T}}}[u^{n+c_j}], \end{aligned}$$
(44)

where

$$\begin{aligned} u^{n+c_j}(x)=u(t^n+c_j\Delta t, x), \qquad j=1,\dots ,q. \end{aligned}$$
(45)

A NN \(F_{NN}\) predicts all stages \(i=1,\dots ,q\) from an input x:

$$\begin{aligned} \varvec{{\hat{u}}} = [{\hat{u}}^{n+c_1}(x),\dots ,{\hat{u}}^{n+c_q}(x),{\hat{u}}^{n+1}(x)] = F_{NN}(x;\varvec{\theta }). \end{aligned}$$
(46)

The cost is then constructed by rearranging Eqs. (43) and (44).

$$\begin{aligned} {\hat{u}}^n&= {\hat{u}}_i^n = {\hat{u}}^{n+c_i} - \Delta t \sum _{j=1}^q a_{ij} {\mathcal {N}}^{{\mathcal {T}}}[{\hat{u}}^{n+c_j}], \qquad i=1,\dots ,q, \end{aligned}$$
(47)
$$\begin{aligned} {\hat{u}}^n&= {\hat{u}}^n_{q+1} = {\hat{u}}^{n+1} - \Delta t \sum _{j=1}^q b_j {\mathcal {N}}^{{\mathcal {T}}}[{\hat{u}}^{n+c_j}]. \end{aligned}$$
(48)

The \(q+1\) predictions \({\hat{u}}_i^n, {\hat{u}}^n_{q+1}\) of \({\hat{u}}^n\) have to match the initial conditions \(u^{{\mathcal {M}}^n}\), where the mean squared error is used as a loss function to learn all stages \(\varvec{{\hat{u}}}\). The approach has been applied to fluid mechanics [329, 330].

2.2.2.2. Inverse problems

As for inverse problems in the space-time approaches (Paragraph 2.2.1.2), the non-linear operator \({\mathcal {N}}\) can be learned. For temporal problems, this corresponds to the right-hand side of Eq. (3) for PDEs and to Eq. (4) for systems of ODEs. The predicted right-hand side can then be used to predict time series using a classical time-stepping scheme, as proposed in [331]. More sophisticated methods leaning on similar principles are presented in the following. Specifically, we will discuss PDE-Net for discovering PDEs, SINDy for discovering systems of ODEs in an interpretable sense, and an approach relying on multistep methods for systems of ODEs. The multistep approach leads to a non-interpretable, but more expressive approximation of the right-hand side.

PDE-Net

PDE-Net [332, 333] is designed to learn both the system dynamics u(xt) and the underlying differential equation it follows. Given a problem of the form of Eq. (3), the right-hand side can be approximated as a function of coordinates and gradients of the solution.

$$\begin{aligned} \hat{{\mathcal {N}}}^{{\mathcal {T}}}\left[ x,u,\frac{\partial u}{\partial x},\frac{\partial ^2 u}{\partial x^2},\dots \right] \end{aligned}$$
(49)

The operator \(\hat{{\mathcal {N}}}^{{\mathcal {T}}}\) is approximated by NNs. The first step involves estimating spatial derivatives using learnable convolutional filters. The filters are designed to adjust their order of approximation based on the fit to the underlying measurements \(u^{{\mathcal {M}}}\), while the type of gradient is predefinedFootnote 18. Thus, the NN learns how to best approximate spatial derivatives specific to the underlying data. Subsequently, the inputs of \(\hat{{\mathcal {N}}}^{{\mathcal {T}}}\) are combined with point-wise CNNs [334] in [332] or a symbolic network in [333]. Both yield an interpretable operator from which the analytical expression can be extracted. In order to construct a loss function, Eqs. (3) and (49) are discretized using the forward Euler method:

$$\begin{aligned} u(x, t_{n+1}) = u(x, t_{n}) + \Delta t \hat{{\mathcal {N}}}^{{\mathcal {T}}}\left[ x, u, \frac{\partial u}{\partial x}, \frac{\partial ^2 u}{\partial x^2}, \dots \right] . \end{aligned}$$
(50)

This temporal discretization is applied iteratively, and the discrepancy between the derived function and the measured data \(u^{{\mathcal {M}}}(x, t_{n})\) serves as the loss function.

SINDy

Sparse identification of non-linear dynamic systems (SINDy) [335] deals with the discovery of dynamic systems of the form of Eq. (4). The task is posed as a sparse regression problem. Snapshot matrices of the state \({\varvec{X}}=[{\varvec{x}}(t_1),{\varvec{x}}(t_2),\dots ,{\varvec{x}}(t_n)]\) and its time derivative \(\dot{{\varvec{X}}}=[\dot{{\varvec{x}}}(t_1),\dot{{\varvec{x}}}(t_2),\dots ,\dot{{\varvec{x}}}(t_n)]\) are related to one another via candidate functions \(\varvec{\Theta }({\varvec{X}})\) evaluated at \({\varvec{X}}\) using unknown coefficients \(\varvec{\Xi }\):

$$\begin{aligned} \dot{{\varvec{X}}}=\varvec{\Theta }({\varvec{X}})\varvec{\Xi }. \end{aligned}$$
(51)

The coefficients \(\varvec{\Xi }\) are determined through sparse regression, such as sequential thresholded least squares or LASSO regression. By including partial derivatives, SINDy has been extended to the discovery of PDEs [336, 337].

The expressivity of SINDy can further be increased by a coordinate transformation into a representation allowing for a simpler representation of the system dynamics. This can be achieved with an autoencoder (consisting of an encoder \(e_{NN}(x;\varvec{\theta }^e)\) and a decoder \(d_{NN}(h;\varvec{\theta }^d)\), as proposed in [338], where the dynamics are learned on the reduced latent space h using SINDy. A simultaneous optimization of the NN parameters \(\varvec{\theta }^e, \varvec{\theta }^d\) and SINDy parameters \(\varvec{\Xi }\) is conducted with gradient descent. The cost is defined in terms of the autoencoder reconstruction loss \({\mathcal {L}}_{{\mathcal {A}}}\) and the residual of Eq. (51) at both the reduced latent space \({\mathcal {L}}_{{\mathcal {R}}}\) and the original space \({\mathcal {L}}_{{\mathcal {F}}}\)Footnote 19. A \(L^1\)-regularization for \(\varvec{\Xi }\) promotes sparsity.

$$\begin{aligned} {\mathcal {L}}_{{\mathcal {A}}}&= \frac{1}{2n} \sum _{i=1}^{n} ||{\varvec{x}}(t_i) - d_{NN}\big (e_{NN}({\varvec{x}}(t_i);\varvec{\theta }^e);\varvec{\theta }^d\big )||_2^2 \end{aligned}$$
(52)
$$\begin{aligned} {\mathcal {L}}_{{\mathcal {R}}}&= \frac{1}{2n}\sum _{i=1}^n ||\underbrace{\Big (\nabla _x e_{NN}\big ({\varvec{x}}(t_i); \varvec{\theta }^e\big )\Big )\cdot \dot{{\varvec{x}}}(t_i)}_{\dot{{\varvec{h}}}} \nonumber \\&\quad - \varvec{\Theta }\Big (e_{NN}\big ({\varvec{x}}(t_i);\varvec{\theta }^e\big )\Big )\varvec{\Xi }||_2^2 \end{aligned}$$
(53)
$$\begin{aligned} {\mathcal {L}}_{{\mathcal {F}}}&= \frac{1}{2n}\sum _{i=1}^n||\dot{{\varvec{x}}}(t_i) - \nabla _h d_{NN}\big (\underbrace{e_{NN}({\varvec{x}}(t_i);\varvec{\theta }^e)}_{{\varvec{h}}};\varvec{\theta }^d\big )\nonumber \\&\quad \cdot \underbrace{\varvec{\Theta }\Big (e_{NN}({\varvec{x}}(t_i);\varvec{\theta }^e)\Big )\varvec{\Xi }}_{\dot{{\varvec{h}}}}||_2^2 \end{aligned}$$
(54)
$$\begin{aligned} C&= \kappa _{{\mathcal {A}}} {\mathcal {L}}_{{\mathcal {A}}} + \kappa _{{\mathcal {R}}} {\mathcal {L}}_{{\mathcal {R}}} + \kappa _{{\mathcal {F}}} {\mathcal {L}}_{{\mathcal {F}}} \end{aligned}$$
(55)

As in Eq. (24), a weighted cost function with weights \(\kappa _{{\mathcal {A}}},\kappa _{{\mathcal {R}}},\kappa _{{\mathcal {F}}}\) is employed. The reduced latent space can be exploited for forward simulations of the identified system. By solving the system with classical time-stepping schemes in the reduced latent space, the solution is obtained in the full space through the decoder, as outlined in [339]. Thus, a reduced order model of a previously unknown system is identified. The downside is, that the model is no longer interpretable in the full space.

Multistep methods

Another approach [340] to learning the system dynamics from Eq. (4) is to approximate the right-hand side directly with a NN \(\varvec{{\hat{f}}}({\varvec{x}}_i)=O_{NN}({\varvec{x}}_i;\varvec{\theta })\), \({\varvec{x}}_i={\varvec{x}}(t_i)\). A residual can be formulated by considering linear multistep methods [328], a residual can be formulated. In general, these methods take the form:

$$\begin{aligned} \sum _{m=0}^M [\alpha _m {\varvec{x}}_{n-m} + \Delta t \beta _m {\varvec{f}}({\varvec{x}}_{n-m})]=0, \end{aligned}$$
(56)

where \(M, \alpha _0, \alpha _1, \beta _0, \beta _1\) are parameters specific to a multistep scheme. The scheme can be reformulated with a cost function, given as:

$$\begin{aligned} C&= \frac{1}{N-M+1} \sum _{n=M}^N ||\varvec{{\hat{y}}}_n||^2_2 \end{aligned}$$
(57)
$$\begin{aligned} \varvec{{\hat{y}}}_n&= \sum _{m=0}^M [\alpha _m {\varvec{x}}_{n-m}+\Delta t \beta _m \varvec{{\hat{f}}}({\varvec{x}}_{n-m})] \end{aligned}$$
(58)

The idea of the method is strongly linked to the discrete-time PINN presented in Paragraph 2.2.2.1, where a reformulation of the Runge-Kutta method yields the cost function needed to learn the forward solution.

2.2.3 Enforcement of physics by construction

Up to this point, this review only considered the case where physics are enforced indirectly through penalty terms of the PDE residual. The only exception, and the first example of enforcing physics by construction, was the strong enforcement of boundary conditions [37, 204, 248] by modifying the outputs of the NN—which led to a fulfillment of the boundary conditions independent of the NN parameters. For PDEs, this can be achieved by manipulating the output, such that the solution automatically obeys fundamental physical laws. Examples for this are, e.g., given in [341], where stream functions are predicted and subsequently differentiated to ensure conservation of mass, the incorporation of symmetries [342], or invariances [343] by using integrity bases [344]. Dynamical systems have been treated by learning the Lagrangian or Hamiltonian with correspondingly Lagrangian NNs [345,346,347] and Hamiltonian NNs [348]. The quantities of interest are obtained through the differentiable NN and compared to labeled data. Indirectly learning the quantities of interest through the Lagrangian or Hamiltonian guarantees the conservation of energy. Enforcing the physics by construction is also referred to as physics-constrained learning, as the learnable space is constrained. Note, however, that constraining the learnable space also challenges the learning algorithm, thus potentially making convergence more difficult. Therefore, [225] relaxes the requirement of fulfilling the physical laws by introducing a secondary unconstrained network—acting additively on the solution—whose influence is scaled by a hyperparameter. More examples of physics enforcement by construction are provided in the context of simulation enhancement in Sect. 3.2.

3 Simulation enhancement

The category of simulation enhancement deals with any deep learning technique that interacts directly with and, thus, improves a component of a classical simulation. This is the most diverse category and will therefore be subdivided into the individual steps of a classical simulation pipeline:

  • pre-processing

  • physical modeling

  • numerical methods

  • post-processing

Both data-driven and physics-informed approaches will be discussed in the following.

3.1 Pre-processing

The discussed pre-processing methods are trained in a supervised manner relying on the techniques presented in Sect. 2.1 and on labeled data.

3.1.1 Data preparation

Data preparation includes tasks, such as geometry extraction. For instance the detection of cracks from images by means of segmentation [349,350,351] can subsequently be used in simulations to assess the impact of the identified cracks. Also, CNNs have been used to prepare voxel data obtained from computed tomography scans, see [352], where scanning artifacts are removed. Similarly NNs can be employed to enhance measurement data. This was, for example, demonstrated in [353], where the NN acts as a denoiser for magnetic signals in the scope of non-destructive testing. Similarly, low-frequency extrapolation for full waveform inversion has been performed using NNs [354,355,356].

3.1.2 Initialization

Instead of preparing the data, the simulation can be accelerated by an initialization. This can, for example, be achieved through initial guesses by NNs, providing a better starting point for classical iterative solvers [357]Footnote 20. A tighter integration is achieved by using a pre-trained [279] NN ansatz whose parameters are subsequently tweaked by the classical solver, as demonstrated for full waveform inversion in [224].

3.1.3 Meshing

Finally, many simulation techniques rely on meshes. This can be achieved indirectly with NNs, by prediction of mesh density functions [358,359,360,361,362] incorporating either expert knowledge of where small elements are needed, or relying on error estimations. Subsequently, a classical mesh generator is employed. However, NNs (specifically let-it-grow NNs [363]) have also been proposed directly as mesh generators [364, 365].

3.2 Physical modeling

Physical models that capture physical phenomena accurately are a core component of mechanics. Deep learning offers three main approaches for physical models. Firstly, a NN is used as the physical model directly (model substitution). Secondly, an underlying model may be assumed where a NN determines its coefficients (identification of model parameters). Lastly, the entire model can be identified by a NN (model identification). In the first approach, the NN is integrated within the simulation pipeline, while the latter two rely on incorporation of the identified models in a classical sense.

For illustration purposes, the approaches are mostly explained on the example of constitutive models. Here, the task is to relate the strain \(\varepsilon \) to a stress \(\sigma \), i.e., find a function \(\sigma =f(\varepsilon )\). This can, for example, be used within a finite element framework to determine the element stiffness, as elaborated in [366].

3.2.1 Model substitution

In model substitution, a NN \(f_{NN}\) replaces the model, yielding the prediction \({\hat{\sigma }}=f_{NN}(\varepsilon ;\varvec{\theta })\). The quality of the model is assessed with a data-driven cost function (Eq. 5) using labeled data \(\sigma ^{{\mathcal {M}}},\varepsilon ^{{\mathcal {M}}}\). The approach is applied to a variety of problems, where the key difference lies in the definition of input and output quantities. The same deep learning techniques from data-driven simulation substitution (Sect. 2.1) can be employed.

Applications include predictions of stress from strain [366, 367], flow stresses from temperatures, strain rates and strains [368, 369], yield functions [370], crack opening responses from stresses [371], contact stiffness from penetration and contact pressure [372], point of contact from position of neighboring nodes of finite elements [373], or control points of NURBS surfaces [374]. Source terms of simplified equations or coarser discretizations have also been learned for turbulence [74, 375, 376] and the wave equation [377]. Here, the reference—a high-fidelity model—is to be captured in the best possible way by the source term.

Variations also predict the quantity of interest indirectly. For example, strain energy densities \(\psi \) are predicted by NNs from deformation tensors F, and subsequently derived using automatic differentiation to obtain stresses [378, 379]. The approach can also be extended to incorporate uncertainty quantification [380]. By extending the input space with microstructural information, an in-built homogenization is added to the constitutive model [381,382,383]. Thus, the macroscale simulation considers the microstructure at the integration points in the sense of \(\hbox {FE}^2\) [384, 385], but without an additional finite element computation. Incorporation of microstructures requires a large amount of realistic training data, which can be obtained through generative approaches as discussed in Sect. 5. Active learning can reduce the required number of simulations on these geometries [221].

A specialized NN architecture is employed by [386], where a NN first estimates invariants I of the deformation tensor F and thereupon predicts the strain energy density, thus mimicking the classical constitutive modeling approach. Another network extension is the use of RNNs to learn history-dependent models. This was shown in [381, 382, 387, 388] for the prediction of the stress increment from the stress-strain history, the strain energy from the strain energy history [389], and crack patterns based on prior cracks and crystalline orientations [390, 391].

The learned models do not, however, necessarily obey fundamental physical laws. Attempts to incorporate physics as constraints using penalty terms have been made in [392,393,394]. Still, physical consistency is not guaranteed. Instead, NN architectures can be chosen such that they satisfy physical requirements by construction. In constitutive modeling, objectivity can be enforced by using only deformation invariants as input [395], and polyconvexity can be enforced through the architecture, such as input-convex NNs [396,397,398,399] or neural ordinary differential equations [395, 400]. It was demonstrated that ensuring fundamental physical aspects such as invariants combined with polyconvexivity delivers a much better behavior for unseen data, especially if the model is used in extrapolation.

Input-convex NNs [401] enforce the convexity with specialized activation functions such as log-sum-exponential, or softplus functions in combination with constraints on the NN weights to ensure that they are positive, while neural ordinary differential equations [402] (discussed in Sect. 4) approximate the strain energy density derivatives and ensure non-negative values. Alternatively, a mapping from the NN to a convex function can be defined [403] ensuring a convex function for any NN output. Related are also thermodynamics-based NNs [404, 405], e.g., applied to complex microstructures in [406], which by construction obey fundamental thermodynamic laws. Training of these methods can be performed in a supervised manner, relying on stress-strain data, or unsupervised. In the unsupervised setting, the constitutive model is incorporated in a finite element solver, yielding a displacement field for a specific boundary value problem. The computed field, together with measurement data, yields a residual that is referred to as the modified constitutive relation error (mCRE) [407,408,409], which is minimized to improve the constitutive relation [410, 411]. Instead of formulating the mismatch in terms of displacements, [412, 413] formulate it in terms of boundary forces. For an in-depth overview of constitutive model substitution in deep learning, see [32].

3.2.2 Identification of model parameters

Identification of model parameters is achieved by assuming an underlying model and training a NN to predict its parameters for a given input. In the constitutive model example, one might assume a linear elastic model expressed in terms of a constitutive tensor c, such that \(\sigma =c\varepsilon \). The constitutive tensor can be predicted from the material distribution defined in terms of a heterogeneous elasticity modulus \({\varvec{E}}\) defined throughout the domain

$$\begin{aligned} {\hat{c}}=f_{NN}({\varvec{E}};\varvec{\theta }). \end{aligned}$$
(59)

Typical applications are homogenization, where effective properties are predicted from the geometry and material distribution. Examples are CNN-based homogenizations on computed tomography scans [414, 415], predictions of in-vivo constitutive parameters of aortic walls from its geometry [416], predictions of elastoplastic properties [417] from instrumented indentation results relying on a multi-fidelity approach [418], prediction of stress intensity factors from the geometry in microfabricated microcantilevers [419], estimation of effective bone properties from the boundary conditions and applied stresses within a finite element, and incorporating meso-scale information by training a NN on representative volume elements [420].

3.2.3 Model identification

NN models as a replacement of classical approaches are not interpretable, while only identifying model parameters of known models restricts the models capacity. This gap can be bridged by the identification of models in terms of parsimonious mathematical expressions.

The typical procedure is to pose the problem in terms of candidate functions and to identify the most relevant terms. The methodology was inspired by SINDy [335] and introduced in the framework for efficient unsupervised constitutive law identification and discovery (EUCLID) [421]. The approach is unsupervised, as the stress-strain data is only indirectly available through the displacement field and corresponding reaction forces. The \(N_I\) invariants \(I_i\) of the deformation tensor F are inserted into a candidate library \(Q(\{I_i\}_{i=1}^{N_I})\) containing the candidate functions. Together with the corresponding weights \(\varvec{\theta }\), the strain density \(\psi \) is determined:

$$\begin{aligned} \psi (\{I_i\}_{i=1}^{N_I}) = Q^T(\{I_i\}_{i=1}^{N_I}) \varvec{\theta }. \end{aligned}$$
(60)

Through derivation of the strain density \(\psi \) using automatic differentiation, the stresses \(\varvec{\sigma }\) are determined. The problem is then cast into the weak form with which the linear momentum balance is enforced. The weak form is then minimized with respect to \(\varvec{\theta }\) using a fixed-point iteration scheme (inspired by [422]), where a \(L_p\)-regularization is used to promote sparsity in \(\varvec{\theta }\). Despite its young age, the approach has already been applied to plasticity [423], viscoelasticity [424], combinations [425], and has been extended to incorporate uncertainties through a Bayesian model [426]. Furthermore, the approach has been extended with an ensemble of input-convex NNs [413], yielding a more accurate, but less interpretable model.

A similar effort was recently carried out by [427, 428], where NNs are designed to retain interpretability. This is achieved through sparse connections in combination with specialized activation functions representing candidate functions, such that they are able to capture classical forms of constitutive terms. Through the sparse connections in the network and the specialized activation functions, the NN’s weights become physical parameters, yielding an interpretable model. This is best understood by consulting Fig. 7, where the strain energy density is expressed as

$$\begin{aligned} {\hat{\psi }}= & {} \theta ^1_0 e^{\theta ^0_0 I_1} + \theta ^1_1 \ln (\theta ^0_1 I_1) + \theta ^1_2 e^{\theta ^0_2 I_1^2}\nonumber \\{} & {} \quad + \theta ^1_2 \ln (\theta ^0_2 I_1^2) + \theta ^1_3 e^{\theta ^0_3 I_2} + \theta ^1_4 \ln (\theta ^0_4 I_2) \nonumber \\{} & {} \quad + \theta ^1_5 e^{\theta ^0_5 I_2^2} + \theta ^1_6 \ln (\theta ^0_6 I_2^2). \end{aligned}$$
(61)

Differentiating the predicted strain energy density \({\hat{\psi }}\) with respect to the invariants \(I_i\) yields the constitutive model, relating stress and strain.

Fig. 7
figure 7

Automated model discovery through a sparsely connected NN with specialized activation functions acting as candidate functions. The thick black connections are not learnable, while the gray ones represent linearly weighted connections. Figure adapted and simplified from [427]

3.3 Numerical methods

This subsection describes efforts in which NNs are used to replace or enhance classical numerical schemes to solve PDEs.

3.3.1 Algorithm enhancement

Classical algorithms can be enhanced by NNs, by learning corrections to commonly arising numerical errors, or by estimating tunable parameters within the algorithm. Corrections have, for example, been used for numerical quadrature [429] in the context of finite elements. Therein, NNs are used to predict adjustments to quadrature weights and positions from the nodal positions to improve the accuracy for distorted elements. Similarly, NNs have been applied as correction for strain-displacement matrices for distorted elements [430]. NNs have also been employed to provide improved gradient estimates. Specifically, [431] modify the gradient computation to match a fine scale simulation on a coarse grid:

$$\begin{aligned} \frac{\partial ^n u}{\partial x^n}\approx \sum _i \alpha _i^{(n)}u_i. \end{aligned}$$
(62)

The coefficients \(\alpha _i\) are predicted by NNs from the current coarse solution. Special constraints are imposed on \(\alpha _i\) to guarantee accurate derivatives. Another application are specialized strain mappings for damage mechanics embedded within individual finite elements learned by PINNs [432]. It has even been suggested to partially replace solvers. For example, [433] replace either the fluid or structural solver by a surrogate model for fluid-structure interaction problems.

Learning tunable parameters was demonstrated for the estimation of the largest possible time step using a RNN acting at the latent vector of an autoencoder [434]. Also, optimal test functions for finite elements were learned to improve stability [435]. Another approach to learning numerical parameters for simulation is presented in [436], where hyperparameters connected to a similarity-based topology optimization are learned—specifically, an energy scaling factor is predicted from a dissimilarity metric based on a previous topology optimization. These approaches have in common that they spare the user from performing multiple simulations to tune the numerical parameters.

3.3.2 Multiscale methods

Multiscale methods have been proposed to efficiently integrate and resolve systems acting on multiple scales. One approach are the learned constitutive models from Sect. 3.2 that incorporate the microstructure. This is essentially achieved through a homogenization at the mesoscale used within a macroscale simulation.

A related approach is element substructuring [437, 438], where superelements mimic the behavior of a conglomerate of classic basic finite elements. In [439], the superelements are enhanced by NNs, which draw on the boundary displacements to predict the displacements and stresses within the element as well as the reaction forces at the boundary. Through assembly of the reaction forces in the global finite element system, an equilibrium is reached with a Newton-Raphson solver. Similarly, the approach in [440] learns the internal forces from the coarse degrees of freedom of the superelements. These approaches are particularly valuable, as they can seamlessly incorporate history-dependent behavior using RNNs.

Finally, multiscale analysis can also be performed by first solving a coarse global model with a subsequent local analysis. This is referred to as zooming methods. In [441], a NN learns the global model and thereby predicts the boundary conditions for the local model. In a similar sense, DeepONets have been applied for the local analysis [442], whereas the global analysis is performed with a finite element solver. Both are conducted in an alternating fashion until convergence is reached.

3.3.3 Optimization

Optimization is a fundamental task within computational mechanics and therefore addressed separately. It is not only used to find optimal structures, but also to solve inverse problems. Generally, the task can be formulated as minimizing a cost function C with respect to parameters \(\lambda \). In computational mechanics, \(\lambda \) is typically fed to a forward simulation \(u=F(\lambda )\), yielding a solution u inserted into the cost function C. If the gradients \(\nabla _\lambda C\) are available, gradient-based optimization is the state-of-the-art [443], where the gradients are used to update \(\lambda \). In order to access the gradients, the forward simulation F has to be differentiable. This requirement is, for example, utilized within the branch of deep learning called differentiable physics [36]. Incorporating gradient information from the numerical solver into the NN improves learning, feedback, and generalization. An overview and introduction to differentiable physics is provided in [36], with applications in [215, 402, 431, 444,445,446]Footnote 21.

The iterative gradient-based optimization procedure is illustrated in Fig. 8. For an in-depth treatment of NNs in optimization, see the recent review [22].

Fig. 8
figure 8

Gradient-based optimization

Inserting a learned forward operator F, as those discussed in Sect. 2.1, into an optimization problem provides two advantages [447,448,449,450,451]. Firstly, a faster forward operator results in faster optimization iterations. Secondly, the gradient computation is simplified, as automatic differentiation through the forward operator F is straightforward in contrast to the adjoint state method [452, 453]. Note however, that for time-stepping procedures, the computational cost might be greater for automatic differentiation, as shown in [313]. Applications include full waveform inversion [313], topology optimization [454,455,456], and control problems [70, 72, 444].

Similarly, an operator replacing the sensitivity computation can be learned [456,457,458,459]. This can be achieved in a supervised manner with precomputed sensitivities to reduce the cost C [456, 458], or by intending to maximize the improvement of the cost function after the gradient update [457, 459]. In [457, 459], an evolutionary algorithm was employed for the general case that the sensitivites are not readily available. Training can adaptively be reintroduced during the optimization phase, if the cost C does not decrease [456], improving the NN for the specific problem it is handling. Taking this idea to the extreme, the NN is trained on the initial gradient updates of a specific optimization. Later, solely the NN delivers the sensitivities [460] with supervised updates every n updates to improve accuracy, where n is a hyperparameter. The ideas of learning a forward operator and a sensitivity operator are combined in [455], where it is pointed out that the sensitivity from automatic differentiation through the learned forward operator can be inaccurate, despite an accurate forward operatorFootnote 22. Therefore, an additional loss term is added to the cost function, enforcing the correctness of the sensitivity through labels obtained with the adjoint state method. Alternatively, the sensitivity computation can be enhanced by correcting the sensitivity computation performed on a coarse grid, as proposed in [461] and related to the multiscale techniques discussed in Sect. 3.3.2. Here, the adjoint field used for the sensitivity computation is reduced by both a proper orthogonal decomposition, and a coarser discretization. Subsequently, a NN corrects the coarse estimate through a super-resolution NN [462]. Similarly, [456, 463] maps the forward solution on a coarse grid to the design variable sensitivity on a fine grid. A similar application is a correction term within a fixed-point iterator, as outlined in [464].

Related to the sensitivity predictions are approaches that directly predict an updated state. The goal is to decrease the total number of iterations. In practice, a combination of predictions and classical gradient-based updates is performed [111,112,113, 465]. The main variations between the methods in the literature are the inputs and how far the forecasting is performed. In [111], the update is obtained from the current state and gradient, while [113] predicts the final state from the history of initial updates. The history is also considered in [112], but the prediction is performed on subpatches which are then stitched together.

Another option of introducing NNs to the optimization loop is to use NNs as an ansatz of \(\lambda \), see, e.g., [313, 444, 466,467,468,469,470,471,472,473,474]. In the context of inverse problems [313, 444, 466,467,468,469,470], the NN acts as regularizer on a spatially varying inverse quantity \(\lambda (x)=I_{NN}(x;\varvec{\theta })\), providing both smoother and sharper solutions. For topology optimization with a NN parametrization of the density function [471,472,473,474], no regularizing effect was observed. It was however possible to obtain a greater design diversity through different initializations of the NN. Extensions using specialized NN architectures for implicit representations [475,476,477,478,479,480] have been presented in the context of topology optimization in [481]. Furthermore, [313, 468, 472] showed how to conduct the gradient computation without automatic differentiation through the solver F. The gradient computation is split up via the chain rule:

$$\begin{aligned} \nabla _{\varvec{\theta }}C=\nabla _{\lambda } C \cdot \nabla _{\varvec{\theta }} \lambda . \end{aligned}$$
(63)

The first gradient \(\nabla _{\lambda } C\) is computed with the adjoint state method, such that the solver can be treated as a black box. The second gradient \(\nabla _{\varvec{\theta }} \lambda \) is obtained through automatic differentiation. An additional advantage of the NN ansatz is that, if applied to multiple solutions with a problem specific input, the NN is trained. Thus, after sufficient inversions, the NN can be used as predictor, as presented in [482]. The training can also be performed in combination with labeled data, yielding a semi-supervised approach, as demonstrated in [224, 483].

3.4 Post-processing

Post-processing concerns the modification and interpretation of the computed solution. One motivation is to reduce the numerical error of the computed solution. This can for example be achieved with super-resolution techniques relying on specialized CNN architectures from computer vision [484, 485]. Coarse to fine mappings can be obtained in a supervised manner using matching coarse and fine simulations as labeled data, as presented for turbulent flows [462, 486] and topology optimization [487,488,489]. The mapping is typically performed from coarse to fine solution fields, but mappings from a posteriori errors have been proposed as well [490]. Further specialized extensions to the cost function have been suggested in the context of de-homogenization [491].

The methods can analogously be applied to temporal data where the solution is refined at each time step,—as, e.g., presented with RNNs as corrector of reduced order models [492]. However, coarse discretizations in dynamical models lead to an error accumulation, that increases with the number of time steps. Thus, a simple coarse-to-fine post-processing at each time step is not sufficient. To this end, [445, 446] apply a correction at each time step before the coarse solver predicts the next time step. As the correction is propagated through the solver, the sensitivities of the solver must be computed to perform the backward propagation. Therefore, a differentiable solver (i.e., differentiable physics) has to be employed. This significantly outperforms the purely supervised approach, where the entire coarse trajectory is applied without corrections in between. The number of steps performed is a hyperparameter, which increases the accuracy but comes with a higher computational effort. This concept is referred to as solver-in-the-loop.

Further variations perform the coarse-to-fine mapping in a patch-based manner, where the interfaces require a special treatment [493]. Another approach uses a NN to map the coarse solution to the closest fine solution stored in a database [494]. The mapping is performed on patches of the domain.

Other post-processing tasks include feature extraction. After a topology optimization, NNs have been used to extract basic shapes to be used in a subsequent shape optimization [495, 496]. Another aspect that can be ensured through post-processing is manufacturability.

Lastly, adaptive mesh refinement falls under the category of post-processing as well. Closely related to the meshing approaches discussed in Sect. 3.1.3, NNs have been proposed as error indicators [361, 497] that are trained in a supervised manner. The error indicators can subsequently be employed to adapt the mesh based on the error.

4 Discretizations as neural networks

NNs are composed of linear transformations and non-linear functions, which are basic building blocks of most PDE discretizations. Thus, the motivation to construct NNs utilizing discretizations of PDEs are twofold. Firstly, deep learning techniques can hereby be exploited within classical discretization frameworks. Secondly, novel NN architectures arise, which are more tailored towards many physical problems in computational mechanics but potentially also find their use cases outside of that field.

4.1 Finite element method

One method are finite element NNs [14, 498] (see [499,500,501,502,503,504] for applications), for which we consider the system of equations from a finite element discretization with the stiffness matrix \(K_{ij}\), degrees of freedom \(u_j\), and the body load \(b_i\):

$$\begin{aligned} \sum _{j=1}^N K_{ij} u_j-b_i=0, i=1,2,\dots ,N. \end{aligned}$$
(64)

Assuming constant material properties along an element and uniform elements, a pre-integration of the local stiffness matrix \(k_{ij}^e=\alpha ^e w_{ij}^e\) can be performed, as, e.g., shown in [505]. The goal is to pull out the material coefficients of the integration, leading to the following assembly of the global stiffness matrix:

$$\begin{aligned} K_{ij}=\sum _{e=1}^M \alpha ^e W_{ij}^e \text { with } W_{ij}^e={\left\{ \begin{array}{ll} w_{ij}^e \text { if } i, j\in e\\ 0 \text { else} \end{array}\right. }. \end{aligned}$$
(65)

Inserting the assembly into the system of equations from Eq. (64) yields

$$\begin{aligned} \sum _{j=1}^N\left( \sum _{e=1}^M \alpha ^e W_{ij}^e \right) u_j-b_i=0, i=1,2,\dots ,N. \end{aligned}$$
(66)

The nested summation has a similar structure of a FC-NN, \(a_i^{(l)}=\sigma (z_i^{(l)})=\sigma (\sum _{j=1}^{N^{(l)}}a_j^{(l-1)}+b_i^{(l)})\), (where \(z_i^{(l)}=\sum _{j=1}^{N^{(l)}}a_j^{(l-1)}+b_i^{(l)}\)) without activation \(\sigma \) and bias b (see Fig. 9):

$$\begin{aligned} a_i^{(2)}=\sum _{j=1}^{N^{(2)}} W_{ij}^{(1)} a_j^{(1)}=\sum _{j=1}^{N^{(2)}} W_{ij}^{(1)}\left( \sum _{k=1}^{N^{(1)}} W_{jk}^{(0)} a_k^{(0)}\right) . \end{aligned}$$
(67)

Thus, the stiffness matrix \(K_{ij}\) is the hidden layer. In a forward problem, \(W_{ij}^e\) are non-learnable weights, while \(u_j\) contains a mixture of learnable weights and non-learnable weights coming from the imposed Dirichlet boundary conditions. A loss can be formulated in terms of body load mismatch, as \(\frac{1}{2}\sum _{i=1}^N ({\hat{b}}_i - b_i)^2\). In the inverse setting, \(\alpha ^e\) becomes learnable—instead of \(u_j\), which is then fixed. For partial domain knowledge in the inverse case, \(u_j\) becomes partially learnable.

Fig. 9
figure 9

Finite element NNs, prediction of forces \(b_i\) from material coefficients \(\alpha ^e\) via assembly of global stiffness matrix \(K_{ij}\), and evaluations of equations with the displacements \(u_j\) [498]

A different approach are the hierarchical deep-learning NNs (HiDeNNs) [506] with extensions in [507,508,509,510,511,512]. Here, shape functions are treated as NNs constructed from basic building blocks. Consider, for example, the one-dimensional linear shape functions

$$\begin{aligned} N_1(x)&=\frac{x-x_2^e}{x_1^e-x_2^e} \end{aligned}$$
(68)
$$\begin{aligned} N_2(x)&=\frac{x-x_1^e}{x_2^e-x_1^e}, \end{aligned}$$
(69)

which can be represented as a NN, as shown in Fig. 10, where the weights depend on the nodal positions \(x_1^e, x_2^e\). The interpolated displacement field \(u^e(x)\), which is valid in the element domain \(\Omega ^e\), is obtained by multiplication with the nodal displacements \(u_1^e, u_2^e\), treated as shared NN weights.

$$\begin{aligned} u^e(x)=N_1^e(x)u_1^e+N_2^e(x)u_2^e \end{aligned}$$
(70)

They are shared, as the nodal displacements \(u_1^e, u_2^e\) are also used for the neighboring elements \(u^{e-1}, u^{e+1}\). Finally the displacement over the entire domain u is obtained by superposition of all elemental displacement fields \(u^e\), which are first multiplied by a step function defined as 1 inside the corresponding element domain \(\Omega ^e\) and 0 outside.

A forward problem is solved with a minimization of the variational loss function, as presented in Sect. 3.2 with the nodal values \(u^e_i\) as learnable weights. According to [506], this is equivalent to iterative solution procedures employed for large systems of equations in finite elements. The additional advantage is a seamless integration of r-refinement [513,514,515] (also referred to as adaptive mesh refinement), i.e., the shift of nodal positions to optimal positions by making the nodal positions \(x_i^e\) learnable. Special care has to be taken to avoid element inversion, which is handled by an additional loss term. Inverse problems can similarly be solved by using learnable input parameters, as presented for topology optimization [512].

The method has been combined with reduced order modeling techniques [508]. Furthermore, the shape functions have been extended with convolutions [510, 511]. Specifically, a weighting field W(x), i.e., kernel (e.g., radial basis functions) with learnable dilation parameterFootnote 23, is introduced to enhance the finite element space \(u^c(x)\) through convolutions, thereby increasing the space’s expressivity and continuity:

$$\begin{aligned} u^c(x)=u^e (x) * W(x). \end{aligned}$$
(71)

This introduces a smoothing effect over the elements and can efficiently be implemented using NNs and, thereby, obtain a more favorable data-structure to exploit the full parallelization capabilities of GPUs [511]. The enhanced space has been incorporated in the HiDeNN framework. While an independent confirmation is still missing, the authors promise a speedup of several orders of magnitude compared to traditional finite element solvers [512]Footnote 24.

Fig. 10
figure 10

HiDeNN with one-dimensional linear elements [506]

Lastly, another approach related to finite elements was presented as FEA-net [516, 517]. Here, the matrix-vector multiplication of the global stiffness matrix \({\varvec{K}}\) and solution vector \({\varvec{u}}\) including the assembly of the global stiffness matrix is replaced by a convolution. In other words, the computation of the force vector \({\varvec{f}}\) is used to compute the residual \({\varvec{r}}\).

$$\begin{aligned} {\varvec{r}}={\varvec{f}} -{\varvec{K}}\cdot {\varvec{u}} \end{aligned}$$
(72)

Assuming a uniform mesh with homogeneous material properties, the mesh is defined by the segment illustrated in Fig. 11. The degree of freedom \(u_j\) only interacts with the stiffness contributions \(K_i^1, K_i^2, K_{i+1}^1, K_{i+1}^2\) of its neighboring elements i and \(i+1\). Therefore, the force component \(f_j\) acting on node j can be expressed by a convolution:

$$\begin{aligned} f_j = [K_i^1, K_i^2+K_{i+1}^1, K_{i+1}^2] * [U_{j-1}, U_{j}, U_{j+1}] \end{aligned}$$
(73)

This can analogously be applied to all degrees of freedoms, with the same convolution filter \({\varvec{W}} = [K^1, K^1 + K^2, K^2]\), assuming the same stiffness contributions for each element.

$$\begin{aligned} {\varvec{K}}\cdot {\varvec{u}} = {\varvec{W}} * {\varvec{U}} \end{aligned}$$
(74)

The convolution can then be exploited in iterative schemes which minimize the residual \({\varvec{r}}\) from Eq. 72). This saves the effort of constructing and storing the global stiffness matrix. By constructing the filter \({\varvec{W}}\) as a function of the material properties of the adjacent elements, heterogeneities can be taken into account [517]. If the same iterative solver is employed, FEA-Net is able to outperform classical finite elements for non-linear problems on uniform grids.

Fig. 11
figure 11

Segment of one-dimensional finite element mesh with degrees of freedom (left). Local element definition with stiffness contributions (right)

Fig. 12
figure 12

Analog RNN

4.2 Finite difference method

Similar ideas have been proposed for finite differences [518], as for example employed in [313], where convolutional kernels are used as an implementation of stencils exploiting the efficient NN libraries with GPU capabilities. Here, the learnable parameters can be the finite difference stencil for inverse problems or the output for forward problems. This has, for example, been presented in the context of full waveform inversion, which is modeled as a RNN [519, 520]. The stencils are written as convolutional filters and repeatedly applied to the current state and the corresponding inputs. These are the wave field, the material distribution, and the source. The problem can then be regarded as a RNN. However, it is computationally expensive to perform automatic differentiation throughout the time steps for full waveform inversion, thereby obtaining the sensitivities with respect to \(\gamma \)—both regarding memory and wall clock computational time. A remedy is to combine automatic differentiation with the adjoint state method as in [313, 468, 472] and discussed in Sect. 3.3.3.

Taking this idea one step further, the discretized wave equation can be regarded as an analog RNN [521] where the weights are the material distribution. Here, a binary material is learned in a trainable region between source and probing location. The input x(t) is encoded as a signal and emitted as source, which is measured at the probing locations \(y_i(t)\) as output. By integrating the outputs, a classification of the input can be performed.

4.3 Material discretizations

Deep material networks [522, 523] construct a NN from a material distribution. An output is constructed from basic building blocks, inspired by analytical homogenization techniques. Given two materials defined in terms of their compliance tensors \(c_1\), \(c_2\), and volume fractions \(f_1, f_2\), an analytical effective compliance tensor \({\bar{c}}\) is computed. The effective tensor is subsequently rotated with a rotation tensor R, defined in terms of the three rotation angles \(\alpha , \beta , \gamma \), yielding a rotated effective tensor \({\bar{c}}_r\). Thus, the building block takes as input two compliance tensors \(c_1,c_2\) and outputs a rotated effective compliance tensor \({\bar{c}}_r\), where \(f_1, f_2, \alpha , \beta , \gamma \) are the learnable parameters (see Fig. 13). By connecting these building blocks, a large network can be created. The network is applied to homogenization tasks of RVEs [522, 523], where the material of the phases is varied during evaluation.

Fig. 13
figure 13

A single building block of the deep material network [522]

4.4 Neural differential equations

In a more general setting, neural ordinary differential equations [402] consider the forward Euler discretization of ordinary differential equations. Specifically, RNNs are viewed as Euler discretizations of continuous transformations [524,525,526]. Consider the iterative update rule of the hidden states \(y_{t+1}=y(t+\Delta t)\) of a RNN.

$$\begin{aligned} y_{t+1}=y_t+f(y_t;\varvec{\theta }) \end{aligned}$$
(75)

Here, f is the evaluation of one recurrent unit in the RNN. In the limit of the time step size \(\lim {\Delta t\rightarrow 0}\), the dynamics of the hidden units \(y_t\) can be parametrized by an ordinary differential equation

$$\begin{aligned} \frac{dy(t)}{dt}=f(y(t),t;\varvec{\theta }) \end{aligned}$$
(76)

The input to the network is the initial condition y(0), and the output is the solution y(T) at time T. The output of the NN, y(T), is obtained by solving Eq. 76 with a differential equation solver. The sensitivity computation for the weight update is obtained using the adjoint state method [453, 527], as backpropagating through each time step of the solver leads to a high memory cost. This also makes it possible to treat the solver as a black box. Similar extensions to PDEs [525] have been proposed by considering recurrent CNNs with residual connections, where the CNNs act as spatial gradients.

Similarly, [528] establish a connection between deep residual RNNs and iterative solvers. Residual connections in NNs allow information to bypass NN layers. Consider the estimation of the next state of a PDE with a classical solver \(u_{t+1}=u(t+\Delta t)=F[u(t)]\). The residual \(r_{t+1}=r(t+\Delta t)\) is determined in terms of the ground truth \(u_{t+1}^{{\mathcal {M}}}\):

$$\begin{aligned} r_{t+1} = u_{t+1}^{{\mathcal {M}}} - u_{t+1}. \end{aligned}$$
(77)

An iterative correction scheme is formulated with a NN. The iterations are indicated with the superindex (k).

$$\begin{aligned} u_{t+1}^{(k+1)}&=u_{t+1}^{(k)} + f_{NN}(r_{t+1}^{(k+1)};\varvec{\theta }) \end{aligned}$$
(78)
$$\begin{aligned} r_{t+1}^{(k+1)}&= u_{t+1}^{{\mathcal {M}}} - u_{t+1}^{(k)} \end{aligned}$$
(79)

Note that the residual connection, i.e., \(u_{t+1}^{(k)}\) as directly used in the prediction of \(u_{t+1}^{(k+1)}\), allows information to pass past the recurrent unit \(f_{NN}\). A related approach can be found in [529], where an autoencoder iteratively acts on a solution until convergence. In the first iteration, a random initial solution is used as input.

5 Generative approaches

Generative approaches (see [33] for an in-depth review in the field of design and [530] for a hands-on textbook) aim to model the underlying probability distribution of a data set to generate new data that resembles the training data. Three main methodologies exist:

  • autoencoders,

  • generative adversarial networks (GANs),

  • diffusion models,

and are described in detail in Appendix B. Currently, there are two prominent areas of application in computational mechanics. One area of focus is microstructure generation (Sect. 5.1.1), which aims to produce a sufficient quantity of realistic training data for surrogate models, as described in Sect. 2.1. The second key application area is generative design (Sect. 5.1.2), which relies on algorithms to efficiently explore the design space within the constraints established by the designer.

5.1 Applications

5.1.1 Data generation

The most straightforward application of variational autoencoders and GANs in computational mechanics is the generation of new data, based on existing examples. This has been demonstrated in [531,532,533,534,535] for microstructures in [93] for velocity models used in full waveform inversion, and in [536] for optimized structures using GANs. Variational autoencoders have also been used to model the crossover operation in evolutionary algorithms to create new designs from parent designs [537]. Applications of diffusion models for microstructure generation can be found in [538,539,540].

Microstructures pose a unique challenge due to their inherent three-dimensional nature, while often only two-dimensional reference images are available. This has led to the development of specialized architectures that are capable of creating three-dimensional structures from representative two-dimensional slices [541,542,543]. The approach typically involves treating three-dimensional voxel data as a sequence of two-dimensional slices of pixels. Sequences of images are predicted from individual slices, ultimately forming a three-dimensional microstructure. In [544], a RNN is applied to a two-dimensional reference image, yielding an additional dimension, and consequently creating a three-dimensional structure. The RNN is applied at the latent vector inside an encoder decoder architecture, such that the inputs and outputs of the RNN have a relatively small size. Similarly, [545, 546] apply a transformer [172] to the latent vector. An alternative formulation using variational autoencoder GANs is presented in [547] to reconstruct three-dimensional voxel models of porous media from two-dimensional images.

The generated data sets can subsequently be leveraged to train surrogate models, as demonstrated in [536, 548,549,550] where CNNs were used to verify the physical properties of designs, and in the study by [551] on the homogenization of microstructures with CNNs. Similarly, [93, 552] generate realistic material distributions, such as velocity distributions, to train an inverse operator for full waveform inversion.

5.1.2 Generative design and design optimization

Within generative design, the generator can also be considered as a reparametrization of the design space that reduces the number of design variables. With autoencoders, the latent vector serves as the design parameter [553, 554], which is then optimizedFootnote 25. Similarly, [556] find that point cloud autoencoders [117, 557, 558] are advantageous as geometric dimensionality reduction tools (potentially combined with performance features) for efficiently exploring the design space. In the context of GANs, the optimization task is aimed at the random input \(\varvec{\xi }\) provided to the generator. This approach is demonstrated in various studies, such as ship hull design parameterized by NURBS surfaces [559], airfoil shapes expressed with Bézier curves [560, 561], structural optimization [562], and full waveform inversion [563]. For optimization, variational autoencoder GANs are particularly important, as the GAN ensures high quality designs, while the autoencoder ensures well-behaving gradients. This was shown for microstructure optimization in [564].

An important requirement for generative design is design diversity. Achieving this involves ensuring that the entire design space is spanned by the generated data. For this, the cost function can be extended, as presented in [565], using determinantal point processes [566] or in [559] with a space-filling term [567].

Other strategies are specifically focused on promoting design diversity. This involves identifying novel designs via a novelty score [568]. The novelty within these designs is segmented and used to modify the GAN using methods outlined in [569]. An alternative approach proposed by [570] quantifies creativity and maximizes it. This is achieved by performing a classification in pre-determined categories by the discriminator. If the classification is unsuccessful, the design must lie outside the categories and is therefore deemed creative. Thus the generator then seeks to minimize the classification accuracy.

However, some applications necessitate a resemblance to prior designs due to factors such as aesthetics [571] or manufacturability [572]. In [571], a pixel-wise \(L^1\)-distance to previous designs is included in the lossFootnote 26. A complete workflow with generative design enforcing resemblance of previous designs and surrogate model training for the quantification of mechanical properties is described in [573]. Another option is the use of style transfer techniques [555], which in [574] is incorporated into a conventional topology optimization scheme [575] as a constraint in the loss. These are tools with the purpose of incorporating vague constraints based on previous designs for topology optimization.

GANs can also be applied to inverse problems, as presented in [576] for full waveform inversion. The generator predicts the material distribution, which is used in a differentiable simulation providing the forward solution in the form of a seismogram. The discriminator attempts to distinguish between the seismogram indirectly coming from the generator and the measured seismograms. The underlying material distribution is determined through gradient descent.

5.1.3 Conditional generation

As stated earlier, GANs can take specific inputs to dictate the output’s nature. The key difference to data-driven surrogate models from Sect. 2.1 is that GANs provide a tool to generate multiple outputs given the same conditional input. They are thus applicable to problems with multiple solutions, such as design optimization or data generation.

Examples of conditional generation are rendered cars from car sketches [577], hierarchical shape generation [578], where the child shape considers its parent shape and topology optimization with predictions of optimal structures from initial fields, e.g., strain energy, of the unoptimized structure [579, 580]. Physical properties can also be used as input. The properties are computed by a differentiable solver after generation and are incorporated in the loss. This was, e.g., presented in [581] for airplane shapes, and in [582] for inverse homogenization. For full waveform inversion, [583] trains a conditional GAN with seismograms as input to predict the corresponding velocity distributions. A similar effort is made by [584] with CycleGANs [585] to circumvent the need for paired data. Here, one generator generates a seismogram \({\hat{y}}=G_y(x)\) and another a corresponding velocity distribution \({\hat{x}}=G_x(y)\). The predictions are judged by two separate discriminators. Additionally, a cycle-consistency loss ensures that a prediction from a prediction, i.e., \(G_y({\hat{x}})\) or \(G_x({\hat{y}})\), matches the initial input x or y. This cycle-consistency loss ensures, that the learned transformations preserve the essential features and structures of the original seismograms or velocity distributions when they are transformed from seismogram to velocity distribution and back again.

Lastly, coarse-to-fine mappings as previously discussed in Sect. 3.4, can also be learned by GANs. This was, for example, demonstrated in topology optimization, where a conditional GAN refines coarse designs obtained from classical optimizations [579, 586] or CNN predictions [102]. For temporal problems, such as fluid flows, the temporal coherence between time steps poses an additional challenge. Temporal coherence can be ensured by a second discriminator, which receives three consecutive frames of either the generator or the real data and decides if they are real or generated. The method is referred to as tempoGAN [587].

5.1.4 Anomaly detection

Finally, a last application of generative models is anomaly detection, see [588] for a review. This is particularly valuable for non-destructive testing, where flawed specimens can be identified in terms of anomalies. The approach relies on generative models and attempts to reconstruct the geometry. At first, the generative model is trained on structures without flaws. During evaluation, the structures to be tested are then fed through the NN. In case of an autoencoder, as in [589], it is fed through the encoder and decoder. For a GAN, as discussed, e.g., in [590,591,592], the input of the generator is optimized to fit the output as well as possible. The mismatch in reconstruction then provides a spatially dependent measure of where an anomaly, i.e., defect is located.

Another approach is to use the discriminator directly, as presented in [593]. If a flawed specimen is given to the discriminator, it will be categorized as fake, as it was not part of the undamaged structures during training. The discriminator can also be used to check if the domain of application of a surrogate model is valid. Trained on the same training data as the surrogate model, the discriminator estimates the dissimilarity between the data to be tested and the training data. For large discrepancies, the discriminator detects that the surrogate model becomes invalid.Footnote 27

6 Deep reinforcement learning

In reinforcement learning, an agent interacts with an environment through a sequence of actions \(a_t\), which is illustrated in Fig. 14. Upon executing an action \(a_t\), the agent receives an updated state \(s_{t+1}\) and reward \(r_{t+1}\) from the environment. The agent’s objective is to maximize the cumulative reward \(R_{\Sigma }\). The environment can be treated as a black box. This presents an advantage in computational mechanics when differentiable physics are not feasible (as for example in crash simulations [594]). Reinforcement learning has achieved impressive results such as human-level performance in games like Atari [20], Go [595], and StarCraft II [596]. Further, reinforcement learning has successfully been demonstrated in robotics [597]. An example hereof is learning complex maneuvers for autonomous helicopter flight [598,599,600].

A comprehensive review of reinforcement learning exceeds the scope of this work, since it represents a major branch of machine learning. An introduction is, e.g., given in [25, 38], and an in-depth textbook is [45]. However, at the intersection of these domains lies deep reinforcement learning, which employs NNs to model the agent’s actions. In Appendix C, we present the main concepts of deep reinforcement learning and delve into two prominent methodologies: deep policy networks (Appendix C.1) and deep Q-learning (Appendix C.2) in view of applications in computational mechanics.

Fig. 14
figure 14

Reinforcement learning in which an agent interacts with an environment with actions \(a_t\), states \(s_t\), and rewards \(r_t\). Figure adapted from [45]

6.1 Applications

Deep reinforcement learning is mainly used for inverse problems (see [25] for a review within fluid mechanics), where the PDE solver is treated as a black box, and assumed to not be differentiable.

The most prominent application are control problems. One example is discovering swimming strategies for fish—with the goal of efficiently minimizing the distance to a leader fish [601, 602]. The environment is given by the Navier Stokes equation. Another example is balancing rigid bodies with fluid jets while using as little force as possible [603]. Similarly, [604] control jets in order to reduce the drag around a cylinder. Reducing the drag around a cylinder is also achieved by controlling small rotating cylinders in the wake of the flow [605]. A more complex example is controlling unmanned aerial vehicles [606]. The control schemes are learned by interacting with simulations and, subsequently, applied in experiments.

Further applications in connection with inverse problems are learning filters to perturb flows in order to match target flows [607]. Also, constitutive laws can be identified. The individual arithmetic manipulations within a constitutive law can be represented as graphs. An agent constructs the graph in order to best match simulation and measurement [608], which yields an interpretable law.

Topology optimization has also been tackled by reinforcement learning. Specifically, the ability to predict only binary states (material or no material) is desirable—instead of intermediate states, as in solid isotropic material with penalization [609, 610]. This has been shown with binary truss structures, modeled with graphs in order to minimize the total structural volume under stress constraints. In [611], an agent removes trusses from existing structures, and trusses are added in [612]. Similarly, [613] removes finite elements in solid structures to modify the topology. Instead, [614] pursues design diversity. Here a NN surrogate model predicts near optimal structures from reference designs. The agent then learns to generate reference designs as input, such that the corresponding optimal structures are as diverse as possible.

Also, high-dimensional PDEs have been solved with reinforcement learning [615, 616]. This is achieved by recasting the PDE into stochastic control problems, thereby solving these with reinforcement learning.

Finally, adaptive mesh refinement algorithms have been learned by reinforcement learning [617]. An agent decides whether an element is to be refined based on the current state, i.e., the mesh and solution. The reward is subsequently defined in terms of the error reduction, which is computed with a ground truth solution. The trained agent can thus be applied to adaptive mesh refinement to previously unseen simulations.

6.1.1 Extensions

Each interaction with the environment requires solving the differential equation, which, due to the many interactions, makes reinforcement learning expensive. The learning can be accelerated through some basic modifications. The learning can be perfectly parallelized by using multiple environments simultaneously [618], or by using multiple agents within the same environment [619]. Another idea is to construct a surrogate model of the environment and thereby exploit model-based approaches [620,621,622,623]. The general procedure consists of three steps:

  • model learning: learn surrogate of environment,

  • behavior learning: learn policy or value function,

  • environment interaction: apply learned policy and collect data.

Most approaches construct the surrogate with data-driven modeling (Sect. 2.1), but physics-informed approaches have been proposed as well [620, 622] (Sect. 3.2).

7 Conclusion and outlook

In order to structure the state-of-the-art, an overview of the most prominent deep learning methods employed in computational mechanics was presented. Five main categories were identified: simulation substitution, simulation enhancement, discretizations as NNs, generative approaches, and deep reinforcement learning.

Despite the variety and abundance of the literature, few approaches are competitive in comparison to classical methods. This manifests itself in the lack of comparisons in the literature of NN-based methods to classical methods. We have found little evidence that NN-based methods truly outperform classical methods in computational mechanics. However, with only few exceptions, current research is still in its early stages, with a focus on showcasing possibilities without focusing too much attention on accuracy and efficiency. Future research must, nevertheless, shift its focus to incorporate more in-depth investigations into the performance of the developed methods—including thorough and meaningful comparisons to performant classical methods dedicated to the task under investigation. This is in agreement with the recent review article on deep learning in topology optimization [22], where critical and fair assessments are requested. This includes the determination of generalization capabilities, greater transparency by including, e.g., worst case performances to illustrate reliability, and computation times without disregarding the training time.

In line with this, and to the best of our knowledge, we provide a final overview outlining the potentials and limitations of the discussed methods.

  • Simulation substitution has potential for surrogate modeling of parameterized models that need to be evaluated many times. However, currently this is only realizable for small parameter spaces, due to the amount of data required and unlikely to replace established methods, as also stated in [42]. Complex problems can still be tackled by NN surrogates if they are first reduced to a low-dimensional space through model order reduction techniques. Physics-informed learning further reduces the amount of required data and improves the generalization capabilities. However, enforcing physics through penalty terms increases the computational effort, where the solutions still do not necessarily satisfy the corresponding physical laws. Instead, enforcing physical laws by construction guarantees that they are obeyed, which is more favorable to adding constraints through penalty terms.

  • Simulation enhancement is currently one of the most promising areas of investigation. It is in particular beneficial for tasks where classical methods show difficulties. An excellent example for this is the formulation of constitutive laws, which are inherently phenomenological and thereby well-suited to be identified from data using tools such as deep learning. In addition, simulation enhancement, makes it possible to draw on insights gained from classical methods developed since the inception of computational mechanics. Furthermore, it is currently more realistic to learn smaller components of the simulation chain with NNs rather than the entire model. These components should ideally be expensive and have limited requirements regarding accuracy and reliability. Lastly, it is also easier to assess whether a method enhanced by deep learning outperforms the classical method, as direct and fair comparisons are readily possible.

  • An interesting research direction is to employ discretizations as NNs, as this offers the potential to discover NNs tailored to computational mechanics tasks, such as CNNs for computer vision or RNNs and transformers for natural language processing. In computational mechanics, their main benefit seems to stem from being able to exploit the computational benefits of tools and hardware that were created for the wider community of deep learning—such as NN libraries programmed for GPUs which enable an efficient, yet effortless massive parallelization. In our assessment, none of the methods encountered in this review were shown to be able to consistently outperform classical approaches using a comparable amount of computational resources.

  • Generative approaches have been shown to be highly versatile in applications of computational mechanics since the accuracy of a specific instance under investigation is less of a concern here. They have been used to generate statistically equivalent data to train other machine learning models, to incorporate vague constraints based on data within optimization frameworks, and to detect anomalies.

  • Deep reinforcement learning has already shown encouraging results—for example in controlling unmanned vehicles in complex physics environments. It is mainly applicable for problems where efficient differentiable physics solvers are unavailable, which is why it is popular in control problems for turbulence. In the presence of differentiable solvers, gradient-based methods are, however, still the state-of-the-art [443] and, thus, preferred.