1 Introduction

For scientific computing, differential equations (DEs) are efficient for description of various engineering problems (Boussange et al. 2023; Mallikarjunaiah 2023). Differential equations are first-order or higher derivatives of anonymous functions and are classified accordingly as ordinary differential equations (ODEs) or partial differential equations (PDEs) (Taylor et al. 2023; Farlow 2006). Unlike algebraic techniques, the approach establishes an equational relationship between unknown expressions and the related derivatives. Hence, the solutions should conform to the relationship. Practically, DEs are very important for depiction of complex problems occurring daily in nature (Namaki et al. 2023; Soldatenko and Yusupov 2017; Rodkina and Kelly 2011). Since PDEs are a very important type of DEs which relate to the issues addressed in this paper, we focus on PDEs. Some simplified PDEs have solutions using common operators (Melchers et al. 2023; Farlow 2006). Some popular techniques for PDE solving are the Finite element method (FEM) (Zienkiewicz and Taylor 2000), the Finite volume method (FVM) (Versteeg and Malalasekera 2011), Particle-based methods (Oñate and Owen 2014), and the Finite cell method (FCM) (Kollmannsberger 2019).

For higher order PDEs, a finite element method (FEM) may be conveniently used for PDE solving and can give an accurate solution using extensive computational resources (Innerberger and Praetorius 2023). Also, the multi-iteration solution limits practicality. As of now, solving PDEs is chiefly used for advanced applications like design of aircrafts using fluid dynamics, forecast of weather,... etc. For improving the PDE solving capability, some spline functions have been used (Kolman et al. 2017; Qin et al. 2019). On similar lines, FEM basically discretizes and approximates the PDE solution numerically. Hence, introducing Fourier transforms or Laplace transforms with higher efficiency for expressing PDEs has better feasibility (Mugler and Scott 1988; El-Ajou 2021). Large-data models have robust fitting abilities for functions which are multi-variate and high-dimensional, and have been developed for light-weight and rapid solving of PDEs. Before proceeding further, we will give a brief introduction to FEM, and discuss briefly the relevance and importance of neural network techniques in this regard. Next, we will enumerate the large-data models used for PDE solving. These models will be discussed in detail in the paper subsequently.

1.1 Finite element method (FEM)

The numerical approximation of the continuous field u of any PDE can be given by Eq. 1 on a certain domain and can be solved by different techniques including the Finite element method (FEM) (Zienkiewicz and Taylor 2000). FEM is discussed here with emphasis on the Galerkin-based FEM.

$$\begin{aligned}{} & {} L(u) = 0 \hspace{3pt} \text {on} \hspace{3pt} \Omega \end{aligned}$$
(1)
$$\begin{aligned}{} & {} u = y_d \hspace{3pt} \text {on} \hspace{3pt} T_D \end{aligned}$$
(2)
$$\begin{aligned}{} & {} \frac{\partial {u}}{\partial {x}} = g \hspace{3pt} \text {on} \hspace{3pt} T_N \end{aligned}$$
(3)

Let the PDE be given by Eq. 1, where \(L(\cdot )\) is an arbitrary function of the continuous field u, and let it be defined on the domain \(\Omega \in \mathbb {R}^n\), which is a set of all possible inputs for the PDE equation, along with boundary conditions (BCs) given by Eqs. 2 and 3. Let \(y_d\) and g be the Dirichlet and Neumann BCs, respectively. The Dirichlet BC gives the numerical value that the variable u at the domain boundary assumes when solving the PDE. The Neumann BC assumes the derivative value of the variable u applied at the domain boundary \(\omega\) , as against to the variable u itself as in the Dirichlet BC. The Finite element formulation of Eq. 1 on a discrete domain having m elements and n nodes, including the BCs, will give the next system Eq. 4.

$$\begin{aligned} \underbrace{ \begin{pmatrix} k_{1,1} &{} k_{1,2} &{} \cdots &{} k_{1,m} \\ k_{2,1} &{} k_{2,2} &{} \cdots &{} k_{2,m} \\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ k_{n,1} &{} k_{n,2} &{} \cdots &{} k_{n,m} \end{pmatrix} }_\text {K}(u^h) \underbrace{ \begin{pmatrix} u_1 \\ u_2 \\ \vdots \\ u_n \end{pmatrix} }_{u^h} = \underbrace{ \begin{pmatrix} F_1 \\ F_2 \\ \vdots \\ F_n \end{pmatrix} }_\text {F} \end{aligned}$$
(4)

In Eq. 4, \(K(u^h)\) is the left-hand side matrix and is non-linear, and is known as the stiffness matrix. The stiffness matrix is a matrix that gives the system of linear equations to be solved for ascertaining the approximate solution to the PDE. \(u^h\) is the discretized solution field, and \(F \in \mathbb {R}^n\) is the right-hand side vector giving the forces applied, where \(F_i\) is the force at the \(i^{th}\) node. The equation system can be reduced to be as:

$$\begin{aligned} r(u^h) = K(u^h)u^h - F \end{aligned}$$
(5)

For obtaining the solution \(u^h\), the Newton-Ralphson method can be used by linearizing \(r(u^h)\) and its tangent. This technique needs solving per iteration, an equation-system which is linear. The iterations keep proceeding till the residual norm \(||r||_n\) adjusts to the tolerance. For a linear operator, convergence is achieved in only one iteration. For excessive elements and nodes, the most computationally expensive FEM step is the one for finding the linear equation-system solution. For applications with critical computational efficiency like real-time models, digital twins, etc. This step needs to be avoided at all costs. Applications of techniques like model-order reduction, build a surrogate to significantly reduce the computational cost. Large-data based techniques like deep networks can do away with this cost completely. Large-data models like Convolutional neural networks have some notable merits for solving PDEs (Willard et al. 2022).

1.2 Large-data models for solving PDEs

With time, the popular large-data models like CNNs (Lecun et al. 1998; Krizhevsky et al. 2012; Hafiz et al. 2021) used in deep learning (Hassaballah and Awad 2020; Minaee et al. 2023; Xu et al. 2023; Xiang et al. 2023; Hafiz et al. 2022; Hafiz and Hassaballah 2021), Recurrent neural networks (RNNs) (Ren et al. 2022), Long short term memory (LSTM) neural networks (José et al. 2021), Generative adversarial networks (GANs) (Gao and Ng 2022; Yang 2019), and the attention-based Transformers (Cao 2021) have also been applied for solving PDEs. Deep learning is an area wherein neural networks with large number of layers are used for classification, regression, etc. Introduced by Lecun et al. (1998), Convolutional neural networks (CNNs) rose to popularity with Krizhevsky et al. (2012). AlexNet was a CNN that gave outstanding performance on the ImageNet dataset classification challenge (Vakalopoulou et al. 2023; Deng et al. 2009). At that time, obtaining a high classification accuracy on the ImageNet images dataset was considered a tough computer vision task. Since then, CNNs and deep learning have shattered many records on applications like computer vision (Hassaballah and Awad 2020; Hafiz and Bhat 2020; Hafiz et al. 2020, 2023), speech recognition (Jean et al. 2022; Bhangale and Kothandaraman 2022), financial market forecasting (Zhao and Yang 2023; Ashtiani and Raahemi 2023), and for developing intelligent chatbots like the popular ChatGPT (Gordijn and ten Have 2023). Given the prowess of CNNs, it was only a matter of time before they were applied to tasks like solving PDEs, and demonstrated promising results. This success of CNN based PDE solving was due to their unique strengths like implementation simplicity for supervised learning, and consistency (Smets et al. 2023; Alt et al. 2023; Jiang et al. 2023).

CNNs have both strengths as well as weaknesses for solving PDEs (Michoski et al. 2020; Peng et al. 2023; Choi et al. 2023). The strengths of CNNs in this regard are:

  1. 1.

    Significant ease of implementing PDEs.

  2. 2.

    Convenience of using large data.

  3. 3.

    Consistent solutions over the full space of parameters.

As for (1), it can be said that highly complicated PDEs systems with a very large number of parameters and high dimensionality, can be implemented in Python Language using TensorFlow, and PyTorch in hundreds of code lines in a couple of days (Yiqi and Ng 2023; Quan and Huynh 2023). TensorFlow and PyTorch are CNN based Python Language code-libraries offering a rich set of functions encapsulating the state-of-the-art CNNs and their required data processing related programming sub-routines. This ease of implementation of CNNs as of now offers a convenient and accurate solution for PDEs. This is much easier than using many legacy solvers for PDEs (Kiyani et al. 2022). For (2), using data in the PDEs in supervised learning for CNNs is simple and so is the empirical integration (Jiagang et al. 2022; Fang et al. 2023; Fuhg et al. 2023). As for (3), one more important advantage is that the exploration of the parameter space , requires basic solution-domain augmentation (Ren et al. 2022) (i.e., the space-time parameter space, denoting the possible parameter values of: i) the active PDE variable, and ii) the time parameter) with more parameters like \((x,t, p_1, \cdots , p_n)\), followed by optimization of the CNN for solving the PDE as a function of the input parameters. By augmenting the input-space, much lesser computational complexity is added to the algorithm as compared to solving the PDE in space-time (xt) parameter values at only one parameter point \((p_1, \cdots , p_n)\) which is much easier than the sequential exploration of the parameter space points (Boussif et al. 2022; Tanyu et al. 2023). It must be noted that space-time refers to the (xt) values, where x may be a PDE input variable like displacement, and t refers to the input time variable.

On the other side, some of the contemporary weaknesses of CNN-based PDE solvers are:

  1. 1.

    Absence of a guarantee for theoretical convergence of residuals for non-convex PDE minimization.

  2. 2.

    Overall slower run-time for each forward-solve.

  3. 3.

    Weak grounding of theoretical methods in analysis of PDEs.

As for (1), there is the challenge of optimization convergence in non-convex domains, wherein the solution may be trapped inside local minima (Shaban et al. 2023; Mowlavi and Nabi 2023). It may be noted that for a convex function, the global minimum is unique and any local minimum is also the global minimum. However, a non-convex function can have multiple local minima, which are function solutions where the function reaches a low value but this minimum may not be the globally lowest value. Hence, other minima or potentially better solutions may exist. As for (2) it is much more subtle as compared to its appearance and strongly depends on the different aspects of the CNN architecture used, such as hyperparameters optimization, ultimate simulation goal, etc (Tang et al. 2023b; Grohs et al. 2023). With respect to (3), it merely indicates that CNNs have only now been seriously used for solving PDEs, and hence are theoretically untouched at large (Chen et al. 2023). In spite of the above weaknesses, having been inspired by the strengths of deep learning, large-data models have also been used for solving PDEs (Hou et al. 2023). Examples of CNNs (Lagaris et al. 1998) used for solving PDEs are Physics-informed neural networks (PINNs) (Raissi et al. 2019; Baydin et al. 2018), DeepONet (Lu et al. 2021b), etc. Examples of other variants of large-data neural networks used are RNNs like PhyCRNet (Ren et al. 2022), LSTMs (José et al. 2021), GANs (Gao and Ng 2022), etc. Also, recently developed large-data models like Transformers (Cao 2021), and Deep reinforcement learning neural networks (DRLNNs) (Han et al. 2018) have also been used for solving PDEs. Through this literature survey it is hoped that the readers will get some insight into the area of using state-of-the-art large-data models and that they will be encouraged to engage in research in this interesting field. It is also hoped that by this discussion, future inroads into the merger of high-level mathematical modeling and large-data model based simulation and prediction will be laid.

The main contributions of this paper are summarized as follows.

  • A comprehensive survey paper is presented in the domain of solving PDEs, to help researchers review, summarize, solve challenges, and plan for future.

  • An overview of current trends and related techniques for solving PDEs using large-data models is given.

  • The major issues and future scope of using large-data models are also discussed.

The rest of the paper is organized as follows. Section 2 discusses the works related to using large-data models. Section 3 presents the current trends in the area. Section 4 provides the issues and future directions. Finally, the conclusion is given in Sect. 5.

2 Related work

Solving differential equations with neural networks has been going on for some time (Huang et al. 2022). Traditionally, shallow neural networks were used. Shallow neural networks are the older generation of neural networks which have a few layers, and approximate a small number of parameters. One of the first works can be traced to the year 1990 (Hyuk Lee and In Seok 1990). In the same work, the first- and higher-order DEs were quantified by finitesimal approaches followed by using an energy function for the transformed algebraic functions. This energy function was subsequently minimized by using Hopfield networks. Since then there has been a lot of research on solving DEs using various models like neural networks, CNNs and recently other large-data models (Boussange et al. 2023). As reported by the dimensions online database (Hook et al. 2018) the total number of publications is 337 till date for the phrase search: ‘ PDE solving using neural networks OR CNNs OR deep learning OR RNN OR LSTM OR GANs OR Transformers OR DRL’. The year-wise breakup of the number of publications obtained from the Dimensions online database, for the same phrase, is shown in Fig. 1. Out of the search results, the important and ground-breaking works with notable impact, and citations, were considered for the current work. Also, those works were included in the current paper which had novelty, and a significant contribution to the field of PDE solving.

Fig. 1
figure 1

Total number of publications year-wise till date for the phrase search: ’PDE solving using neural networks OR CNNs OR deep learning OR RNN OR LSTM OR GANs OR Transformers OR DRL’ on the dimensions online database (Hook et al. 2018)

2.1 Shallow neural networks for PDE solving

The mathematical modeling of physical problems can be efficiently incorporated by neural networks. The works of Meade Jr and Fernandez (1994a, b) proved to be important for solving PDEs. In the work (Gobovic and Zaghloul 1993), a technique using local connections of neurons was proposed for solving PDEs of heat flow processes. Energy functions are formulated for the PDEs having constant parameters. The energy functions are minimized by using Very large scale integrated (VLSI) Complementary metal oxide (CMOS) circuits. The design of the CMOS circuits was implemented as a neural network with each neuron representing a CMOS cell. Later works like (Gobovic and Zaghloul 1994; Yentis and Zaghloul 1994, 1996) used local neural networks for obtaining solutions of the PDEs by parallelization, and by Neural integrated circuits (NICs). The shallow neural networks initially paved the way for more work in the area of PDE solving. However, their weak approximation capabilities due to lesser number of hidden layers, led to the use of CNNs whose capability was better at capturing the numerous dependencies in PDEs. Hence the CNNs performed better than their shallow counterparts in PDE solving.

2.2 Deep neural networks for PDE solving

Deep learning using CNNs is a popular technique for computer vision tasks (Hassaballah and Awad 2020; Girshick et al. 2014; Girshick 2015; Ren et al. 2017; Sajid et al. 2021; Shelhamer et al. 2017; Chen et al. 2018; Zhao et al. 2017; Hafiz and Bhat 2020; Vinyals et al. 2015). Deep learning has also been used for tasks like Natural Language Processing (NLP) (Amanat et al. 2022). CNNs pre-trained on large datasets like ImageNet (Deng et al. 2009) are used after fine-tuning for two notable reasons (Jing and Tian 2021). First, the feature maps learned by CNNs from the large datasets help them to generalize better and faster. Second, pre-trained CNNs are adept at avoiding over-fitting during fine-tuning for smaller down-stream applications.

The accuracy of CNNs depends on their architecture (Hafiz et al. 2022; Hafiz and Hassaballah 2021) and the training technique (Hafiz et al. 2021). Many CNNs have been developed with huge numbers of parameters. For training these parameters, huge datasets are required. Some popular CNNs include AlexNet (Krizhevsky et al. 2012), VGG (Simonyan and Zisserman 2014), GoogLeNet (Szegedy et al. 2015), ResNet (He et al. 2016), and DenseNet (Huang et al. 2017). Popular CNN training datasets for computer vision include ImageNet (Deng et al. 2009) and OpenImage (Kuznetsova 2020). CNNs have achieved state-of-the-art classification performance for many computer vision tasks (Girshick et al. 2014; Shelhamer et al. 2017; Vinyals et al. 2015; Hassaballah and Hosny 2019; Ledig et al. 2017; Tran et al. 2015; Hafiz et al. 2020, 2023).

2.2.1 CNNs for solving non-linear equations

As per the work of Lagaris et al. (1998), the Differential Equations can be broken down into sub-components using the Dirichlet and Neumann expressions. By using neural networks (Lagaris et al. 1998), non-linear equations were solved up to the seventh decimal digit. Since (Lagaris et al. 1998) is the first work to use CNNs for solving PDEs, it is worthy of being explained briefly. Considering a general differential equation given by Eq. 6 which needs a solution:

$$\begin{aligned} F(x,y(x,t),\nabla y (x,t),{\nabla }^2 y(x,t)) = 0, \hspace{5pt} \bar{x} \in B \end{aligned}$$
(6)

Here \({x} = (x_1, x_2, ..., x_n) \in \mathbb {R}^n\) for certain boundary conditions as per an arbitrary boundary S, \(B \subset \mathbb {R}^n\) is the defining domain and y(xt) is the solution needed. It should be noted that we do not define the boundary conditions here because we are defining a general equation above.

For obtaining a solution to Eq. 6, first the domain B has to be discretized into a set of points \(\hat{B}\). Also the arbitrary boundary S (given here) of the general equation has to discretized into a set of points \(\hat{S}\). Then, the DE may be expressed as a system which has constraints of the generally defined boundary conditions as per Eq. 7:

$$\begin{aligned} F(x_i,y(x_i,t), \nabla y (x_i,t), {\nabla }^2 y (x_i,t)) = 0 \hspace{5pt} \forall \hspace{4pt} {x_i} \in \hat{B} \end{aligned}$$
(7)

Here y(xt) is the solution. It can be obtained from two components given in Eq. 8:

$$\begin{aligned} y(x,t) = A({x}) + f({x}, N ({x},{p})) \end{aligned}$$
(8)

Here A(x) has fixed parameters, p is the parameter set, and N(xp) is the neural network for minimization.

Although the initial CNNs used for solving PDEs gave promising results, they had certain issues. These included a lack of interpretability, weak adaptation of their structure to problems, and average performance (Ruthotto and Haber 2020; Uriarte et al. 2023).

2.2.2 Physics-informed neural networks (PINNs) for PDE solving

After taking inspiration from the works of Lagaris et al. (1998) and Hornik (1991), wherein CNNs were used for universal approximation, a new genre of CNNs was introduced wherein the physical constraints in the form of PDEs were added to the loss function, hence the name Physics-informed neural networks (PINNs) (Raissi et al. 2019; Baydin et al. 2018). More specifically, the technique involved applying the laws of physics expressed by PDEs as CNN loss functions. And in turn, the loss functions could be optimized for finding solutions (Maziar and George 2018). PINNs do not need discretization of the domains. They are also quite practical as the heavy computation is avoided. PINNs use minimization techniques for non-linear parametric PDEs of the form given by Eq. 9:

$$\begin{aligned} y(x,t) + N[y;\lambda ] = 0, \hspace{5pt} x \in \Omega , \hspace{5pt} t \in [0,T] \end{aligned}$$
(9)

Here y(xt) is the solution which is hidden. \(N[\cdot ; \lambda ]\) is an operator of \(\lambda\). \(\Omega\) belongs to \({\mathbb {R}^D}\) where D is the number of dimensions.

Using PINNs, Raissi et al. (2019) extensively studied complex dynamic processes like post-cylindrical flow, and aneurysm in Raissi et al. (2019),Raissi et al. (2020). Figure 2 shows the schematic for the PINN. Although PINN was a PDE-dedicated CNN model which led to better performance, it suffered from issues of other CNNs like lack of interpretability, need for large data and long training times (Mowlavi and Nabi 2023; Meng et al. 2023). In spite of these, it was partly successful as an expert system for solving PDEs. This was due to its PDE-dedicated framework (Tang et al. 2023a; Jia et al. 2022).

Fig. 2
figure 2

Schematic of the Physics-informed neural network (PINN) used for solving PDEs. Huang et al. (2022)

2.2.3 DeepONet CNN for PDE solving

Due to improved computing capability and availability of high-performance computing (HPC) systems, CNN implementation and training became convenient. Subsequently, different techniques which were previously difficult to implement, were implemented. Lu et al. considered Chen et al.’s non-linear operators (Chen and Chen 1995a, b) as a theoretical basis for justifying the use of neural networks for operator learning. They came up with a robust CNN called DeepONet (Lu et al. 2021b). As CNNs have superior expressibility and many special advantages e.g. in computer vision and in sequential analysis respectively, the authors used many CNNs in the DeepONet model for diversifying and targeting various net assemblies. Also, Lu et al. exploited a bifurcated parallel structure in their proposed CNN. Figure 3 shows the schematic for the DeepONet model. FEA-Net (Yao et al. 2019), Hierarchical deep-learning neural network (Lei et al. 2021), etc. are other examples of CNNs used for PDE solving.

Fig. 3
figure 3

Schematic of the DeepONet CNN model for approximation of the nonlinear operator G(u)(v) where u(x) and v are the variables of the PDE to be solved (Huang et al. 2022)

2.2.4 Recurrent neural networks (RNNs) for PDE solving

Ren et al. (2022) proposed a PINN called physics-informed convolutional recurrent network (PhyCRNet) for solving PDEs. They used a convolutional encoder-decoder long short-term memory (LSTM) network. This network was used for low dimensional feature extraction and evolutional learning. Next, PDE residuals are used for the PDE solution estimation. The residual is a value which is obtained when we substitute the current estimate into the discretized PDE. If the PDE solution depends on time, then the residuals have to converge with every time step. They (Ren et al. 2022) used the PDE residual R(x,y,t,\(\theta\)) given by Eq. 10:

$$\begin{aligned} R(x,y,t;\theta ) := u^\theta _t + F[u^\theta , \nabla _{x} u^{\theta } , \nabla _x^2 u^\theta , ...; \lambda ] \end{aligned}$$
(10)

Here \({y \in \mathbb {R}^n}\) is the solution of the PDE in the temporal domain \(t \in\) [0,T], and in the physical domain \(\Omega\). \(\textit{u}_t\) is the first order derivative. \(\nabla _x\) is the gradient for x and F is the non-linear function having parameter \(\lambda\). It must be noted that F plays an important role in Eq. 10 because the residual is obtained by substituting the iterative input parameters in the PDE function.

The loss function L is the sum of squares of the residuals of the PDEs. For a 2D PDE system, L is given by Eq. 11:

$$\begin{aligned} L = \sum _{i=1}^{n}{\sum _{j=1}^{m}{\sum _{k=1}^{T}} || R(x_i;y_j;t_k;\theta )||_2^2} \end{aligned}$$
(11)

Here n and m are the height and the width in the spatial domain respectively. T is the total of time-steps and \(||\cdot ||_2\) is the \(l_2\) Norm.

The loss function was based on PDE residuals. The initial- and boundary-conditions were fixed inside the network. The model was enhanced by regressive as well as residual links which simulated time flow. They solved three types of PDEs using their model viz. 2D Burgers’ equation, the \(\lambda -\omega\) equation and the FitzHugh Nagumo equation. Promising results were obtained. Figure 4 gives the overview of the PhyCRNet architecture.

Fig. 4
figure 4

Schematic of the PhyCRNet model for solving PDEs. Each RNN cell consists of an encoder unit, a Long short term memory (LSTM) unit and a decoder unit. C and h are the cell and hidden states of the Convolutional long short term memory (ConvLSTM) units respectively. BC Encoding refers to the Boundary condition (BC) encoding wherein the BCs are enforced on the output variables. \({u_0}\) is the Initial condition (IC) and \({u_i}\) refers to the state variable for time-step \({t \in [1,T]}\). Ren et al. (2022)

RNNs are usually used for prediction of time-series problems. Their application to PDE solving in forms like PhyCRNet, marked a shift of technique and gave promising results. However, the performance of RNNs for solving PDEs is affected by issues like high complexity, narrow scope, long-training times, etc.

2.3 Long short term memory (LSTM) neural networks for PDE solving

In their work (José et al. 2021), Ferrandis et al. proposed prediction of naval vessel motion by PDEs using Long short term memory (LSTM) neural networks. The input to their LSTM model was the stochastic wave elevation for a particular sea state and the output of their LSTM model comprised of the vessel motions viz. pitch, heave and roll. Promising prediction results were obtained for the vessel motion for arbitrary wave elevations. They trained their LSTM neural networks using offline simulation and extended the prediction to online mode. Their objective function for minimization during training was the Mean squared error (MSE) given by Eq. 12:

$$\begin{aligned} \text {MSE} = \frac{1}{n}\hspace{3pt}{\sum _{i=1}^{n}{(X_i - \hat{X_i})^2}} \end{aligned}$$
(12)

Here X is the observed value vector and \(\hat{X_i}\) is the predicted value vector.

Their work was modelled by the universal approximation theorem for functional problems. They claimed that their work was the first to implement such a model for real engineering processes. A schematic of the general LSTM neural network model is shown in Fig. 5.

Fig. 5
figure 5

Schematic of the general Long-short term menory (LSTM) neural network illustrating the unfolding of the feedback loop which makes it suitable for sequential data processing. In the unfolded model \(x_i\) is input to the \(i^{th}\) cell, \(c_i\) is cell-state for \(i^{th}\) cell and \(h_i\) is the hidden-state of the \(i^{th}\) cell. The subscripts \(t-1\), t and \(t+1\) represent three successive time-steps.Song et al. (2020)

As is evident from the above, LSTM NNs are complex in nature. In spite of this, LSTM NNs are suitable for regression problems like time-series prediction tasks, due to their recurrent nature. However, LSTM NNs are used less in PDE solving. And as of now, few works of PDE solving using LSTM NNs are found. This issue is due to the complexity of modeling PDEs with LSTM NNs. In spite of this, some unique applications of LSTM NN based PDE solving have been proposed and promising results are being obtained.

2.4 Generative adversarial networks (GANs) for PDE solving

Generative adversarial networks (GAN) (Goodfellow et al. 2014; Gui et al. 2021; Yang 2019; Gao and Ng 2022) are machine learning algorithms that use deep learning for generation of new data. A GAN is made up of two NNs i.e., a generator and a discriminator. The generator and discriminator are trained together. The generator generates artificial data which imitates the real data, while the discriminator separates the data generated by the generator from the real data. GANs have become popular with tasks like image-morphing (Gui et al. 2021). In their work (Gao and Ng 2022), Gao and Ng proposed the physics informed GAN called Wassertein Generative Adversarial Network (WGAN), which was used for solving PDEs. These GANs use a unique function known as the Wassertein function for convergence. Wassertein function is suitable for PDE solving, and leads to convenient convergence for the latter. The usage of this function has paved the way for using GANs for PDE solving. They stated that GANs could be formulated in the general form given by Eq. 13:

$$\begin{aligned} \text {min}_{g_\theta \in G} \hspace{3pt} \text {max}_{f_\alpha \in F} \hspace{3pt} \mathbb {E}_{z \sim \pi } \hspace{2pt} f_\alpha (g_\theta (z)) \hspace{2pt} - \hspace{2pt} \mathbb {E}_{x \in v} \hspace{2pt} f_\alpha (x) \end{aligned}$$
(13)

In Eq. 13, G and F are the classes of the generator and the discriminator respectively. \(\pi\) is the distribution of the source and v is the distribution used for approximation. \(g_\theta\) and \(f_\alpha\) are the generator and discriminator functions respectively. x, z, \(\alpha\), and \(\theta\) are the inputs to the PDE function to be solved. They form the input parameter-space to be explored for solving the PDE. \(\mathbb {E}\) is the energy function to be minimized for solving the PDE.

The authors of Gao and Ng (2022) showed that the generalization error to be minimized for obtaining the PDE solution, converged to the approximation error of the GAN model for large data, and they obtained promising results. Schematic for the WGAN used in their work is shown in Fig. 6.

Fig. 6
figure 6

Schematic of the GAN. The GAN is trained till the discriminator which compares the real data x and the generated data G(z) cannot distinguish between the two (Wang et al. 2017)

There are many types of GANs found in literature which have been applied diversely to engineering problems. The most popular application of GANs is image-morphing, though the former is not limited to the latter. Using GANs for solving PDEs comes as a unique application. This is because of the complexity in adapting GANs to solve specific framework problems like those of PDE solving. However as seen with WGANs, progress in this area is being made, and promising results have been obtained.

2.5 Transformers for PDE solving

Inspired by the initial work on Transformers (Vaswani et al. 2017), Shuhao Cao applied self-attention based Transformers to data driven learning for PDEs (Cao 2021). He used Hilbert space approximation for operators. He showed that soft-max normalization was not needed for the scaled dot product in attention mechanisms. He introduced a novel normalization layer which mimicked the Petrov-Galerkin projection for scalable propagation through the attention-based layers in Transformers. The Galerkin attention operator uses the best approximation of f in the \(l_2\) norm \(||\cdot ||_H\) as given by Eq. 14:

$$\begin{aligned} ||f-g_\theta (y)||_H \le c^{-1}\hspace{3pt} min_{q \in \mathbb {Q}_h} \hspace{3pt}max_{v \in \mathbb {V}_h} \hspace{3pt} \frac{|b(v,f_h - q)|}{||v||_H} + ||f-f_h ||_H \end{aligned}$$
(14)

Here H is the Hilbert space, and \(f \in H\). \((\mathbb {Q},\mathbb {V})\) refer to the Query and Value subspaces used in the attention maps respectively. \(g_\theta (\cdot )\) is a learnable map of the Galerkin attention operator. \(f_h \in \mathbb {Q}_h\) is the best approximation of f in \(|| \cdot ||_H\), \(b(\cdot , \cdot ):V \times Q \rightarrow \mathbb {R}\) is the continuous bilinear form and c is the boundary condition limit. \(y \in \mathbb {R}^{n \times d}\) is the current latent representation.

This novel technique helped the Transformer model to obtain a good accuracy in learning operators for un-normalized data tasks. The three PDEs he used for experimentation purposes were the Burgers’ equation, the Darcy flow interface process and a coefficient identification process. He called his improved Transformer as the Galerkin Transformer, which demonstrated a better cost of training and a better performance over the conventional counterparts. A schematic of Galerkin attention mechanism is shown in Fig. 7.

Fig. 7
figure 7

Schematic of the Galerkin attention module used in the novel Transformer which is used to solve PDEs. Here (QKV) are the (Query, Key and Value) matrices of the input to the attention module. \(K^T\) denotes the transpose of the Key matrix K. z is the attention-value matrix which is the output of the attention module. Cao (2021)

Transformers were initially developed for Natural language processing (NLP) and were later on adapted to computer vision in the form of Visual transformers (ViTs). Their recent application to PDE solving is an encouraging step. This is because the huge number of training parameters in Transformers can effectively absorb the large number of dependencies in PDEs. However, Transformers have their own issues e.g. need for very large amount of training data, long training times, and those performances which still are not able to rival that of CNNs. Now, as more training data for semi-supervised, un-supervised and formula-driven supervised learning are becoming available, these issues are being addressed.

2.6 Deep reinforcement learning neural networks (DRLNNs) for solving PDEs

Deep reinforcement learning neural networks (DRLNNs) (Hafiz 2023; Hafiz et al. 2023) are deep networks using Reinforcement learning (RL) (Hafiz et al. 2021). DRLNNs have also been used for solving PDEs (Han et al. 2018). Han et al. (2018) proposed a deep learning based technique which was capable of solving high dimensional PDEs. They reformulated the PDEs using stochastic DEs and the solution gradient was approximated by deep neural networks using RL. Their backward stochastic differential equation (BSDE) played the role of model based RL and the solution gradient played the role of the policy function (Han et al. 2018). Considering the PDE given by Eq. 15, its model showed promising results as shown in Table 1. The PDE used for experimentation was a high dimensional (d = 100) Gobet and Tukedjiev equation from the work (Gobet and Turkedjiev 2017) given by Eq. 15 as:

$$\begin{aligned} \nabla {y(x,t)} + \frac{1}{2}{\Delta y(x,t)} + \text {min} \{1, \left( y(x,t) - y^{*}(x,t)\right) ^2\} = 0 \end{aligned}$$
(15)

where (tx) are the temporal and spatial variables of the oscillating solution \(y^{*}(t,x)\) given by Eq. 16 as:

$$\begin{aligned} y^{*}(x,t) = k + \text {sin}\left( \lambda \sum _{i=1}^{d}{x_i}\right) \text {exp}\left( \frac{\lambda ^2 D(t-T)}{2}\right) \end{aligned}$$
(16)

Here D is the dimensionality of the system and T is the time. In the above equations k=1.6, \(\lambda\)=0.1 and T=1.

Table 1 Experimental results obtained after solving the PDE given in Eq. 15 using the BSDE technique with varying numbers of hidden layers in the deep RL network (Han et al. 2018)

It is observed from Table 1 that the SD is quite low and both the Mean error (%) and SD decrease with the increase in the number of layers in the deep network. This is testimony to the fact that large-data models like DRLNNs are quite capable of solving complex high dimensional PDEs. RL was initially limited to basic algorithms. Now it has developed into a substantial field of research having numerous techniques for many applications. With the development of DRLNNs, the fields of deep learning and RL were merged. However, there is not a perfect merger due to their respective unique natures. Also with RL there is the potential issue of RL systems going ‘rogue’ due to greed and harming their frameworks and environments. Nevertheless, DRLNNs offer solutions to many important modern day problems and are generally controlled (Raissi 2024; Siegel et al. 2023).

A graphical abstract of the above models for PDE solving is given in Fig. 8 in the form of a timeline. It can be observed from the timeline that the majority of large-data models for solving PDEs are neural networks. To summarize the main large-data models discussed in this work, the PDEs they solve, and the pros and cons of these models, we highlight the same in Table 2.

Fig. 8
figure 8

Timeline for using large-data models

Table 2 A summary of the notable large-data models used, the PDEs they solve, and their pros and cons

3 Current trends

There has been significant research on analyzing the generalization errors using techniques like PINNs (Mishra and Molinaro 2022; Penwarden et al. 2023) and DeepONet (Kovachki et al. 2021a; Lanthaler et al. 2022). These techniques have very good performance. Lu et al. (2022a) compared the efficiency of these two techniques. There is often a desire for incremental research based on existing methods like PINNs. The model architecture for neural network training for solving PDEs consists of input, neural net approximation, and the network loss function (Huang et al. 2022). A three pronged approach has been used for improving performance as discussed below.

  • Loss function: Improving the loss function is very useful for obtaining superior performance. Jagtap et al. (2020) improved the PINNs by using the cPINNs as well as the XPINNs (Jagtap and Karniadakis 2021). They used multiple domains and additional constraints. Kutz et al. and Patel et al. did the same by using parsimony (Nathan Kutz and Brunton 2022) and spatio-temporal schemes (Patel et al. 2022). Other studies have used the initial conditions and the boundary differently. For instance, Li et al. (2022) who fused partial integration and level set techniques.

  • Model used: Recurrent Neural Networks (RNNs), Generative Adversarial Networks (GANs), Long short term memory (LSTM) neural networks, and even the attention based Transformers have been recently used and have given better performance (Ren et al. 2022; José et al. 2021; Cao 2021; Gao and Ng 2022; Meng et al. 2022) in solving PDEs. U-FNO with more U-Fourier layers (Wen et al. 2022) and the attention based model viz. FNO (Peng et al. 2022) have been developed. Also, research has been done for optimization of activation functions (Wang et al. 2021; Yifan and Zaki 2021; Venturi and Casey 2023).

  • Input: Kovacs et al. (2022) used parameters for determination of the eigen-value equation coefficients as additional inputs . Hadorn improved the DeepONet model by allowing the base function to shift and scale the input values (Patrik 2022). Lye et al. improved the ML technique accuracy by using a multi-level approach (Lye et al. 2021).

The state-of-the-art technologies have led to much better capability. Lu et al. (2021a) crafted DeepXDE using PINNs and DeepONet. In the industry, NVIDIA the popular Graphics Processing Unit (GPU) manufacturing company has used PINNs, DeepONet, and enhanced finite numerical optimization techniques in a toolbox to build the Digital Twin. Also, the advances in quantum-based technology are favoring numerical optimization (Swan et al. 2021). In addition to these, more research (Wang et al. 2022) is being done on PDE solving by introduction of Gaussian processes (Chen et al. 2021), and use of hybrid Finite Element Method-Neural Network models (Mitusch et al. 2021). Li et al. have improved their previous works of Finite Numerical Optimization and have come up with a CNN model for operator learning (Kovachki et al. 2021b). A notable aspect of the newly developed model is its robustness to discretization-invariance, leading to its use for a wider range of applications. As mentioned earlier, like the case of DeepONet (Lu et al. 2021b), there is an effort to use the state-of-the-art deep learning models in this area. In spite of the fact that there are numerous new techniques, they share a common thing. That is, the boundary between theoretical process mechanism and experimental data is being dissolved. Lastly fusion of these two aspects has profoundly improved (Nelsen and Stuart 2021; Kadeethum et al. 2021; Gupta and Jaiman 2022; Gin et al. 2021; Bao et al. 2020; Jin et al. 2022; Lu et al. 2022b). Using alternate techniques for developing large-data models like Transformers (Cao 2021) is also an interesting pointer to the potential to be unlocked in PDE solving. The diversity of large-data models available today also offers rich choices for solving PDEs and has many potential applications (Antony et al. 2023; Li et al. 2023; Shen et al. 2023).

4 Issues and future scope

The mixing of scientific computing with deep learning is very likely due to the advances in technology and research (Huang et al. 2022). However, this trend is reaching a flash point due to abundant computational resources warranting new research directions. Also, the gaps in the theoretical models and the experimental data, pose a problem and their elimination is difficult by conventional means. Further, in spite of the fact that CNNs have robust pattern recognition capabilities, their interpretability and inner working are not extensively researched. New techniques like partial integration and numerical optimization have tried to relate the earlier knowledge of PDEs and the new information of big-data from models like CNNs. Large-data models are often referred to as being ‘data-hungry’ due to the need for extensive training. As such there needs to be research on ways to augment training data without the need for huge amounts of ‘natural’ data. This is an open research area. Again, the interpretability and complexity raise issues, and addressing the same remains an open problem.

Techniques can be coarsely classified into iterative numerical techniques and machine learning based techniques (Psaros et al. 2023). Numerical analysis directly depicts the DE mechanism, while as other ML techniques use probabilistic expressions for data characterization. However, for handling specific PDE solving, approximation can also be used. An example is the parameter approximation used in CNNs for PDEs. Also, special CNNs like DeepONet may be used for functional description of the large-data physical models. These techniques are promising for solving PDEs. Hence using different large-data models can also benefit by their respective strengths. In addition to using neural networks and bifurcated structures, different models may be used (Jagtap et al. 2022; Sirignano and Spiliopoulos 2018; Gupta et al. 2021). Also, the reverse path can be used in dynamical system pattern application for model enhancement. This can lead to substantial benefits for model learning interpretability. As mentioned above, one interesting area is the generation of training data for ‘data-hungry’ models without the use of manual collection. An example of this is mathematical formula-based generation of training data, e.g., by Formula-driven supervised learning or FDSL (Hafiz et al. 2023).

5 Conclusion

In this review paper, an overview of solving Partial differential equation (PDE) using large-data models was given. An introduction to the area was presented along with its publication trends. This was followed by a discussion of various techniques used for solving PDEs using large-data models. The large-data models discussed included Convolutional neural networks (CNNs), Recurrent neural networks (RNNs), Long-short term memory (LSTM) neural networks, Generative adversarial networks (GANs), attention-based Transformers and the Deep reinforcement learning neural networks (DRLNNs). The pros and cons of these techniques were discussed. A trend timeline for the purpose was also given. Then, the major issues and future scope in the area were discussed. Finally, we hope this literature survey becomes a quick guide for the researchers and motivates them to consider using large-data models to solve significant problems in PDE based mathematical modeling.