1 Introduction

Over the past decade, machine learning (ML) methods have played a prominent role in scientific and engineering applications. They can learn how to perform a given task that previously could only be done by humans. The famous applications include self-driving cars [1], natural language processing [2], and image recognition [3]. Many of these applications utilize deep learning (DL), a subset of ML that has attracted attention for its ability to perform the mapping between input and output features. Accurate prediction by DL can be achieved by sufficiently training artificial neural networks (NNs).

When it comes to the field of computational mechanics, there are various types of boundary value problems (BVP) described by partial differential equations (PDEs), which are commonly solved using numerical methods such as the finite element method (FEM), finite difference method, mesh-free method, etc. The problems with the use of numerical schemes can be the computational cost, the complexity of the mesh generation, and, more importantly, the fact that each simulation has to be performed almost from scratch for every scenario [4]. DL has been utilized as a promising alternative to avoid these problems. Two mainstreams exist in the domain of DL applications to computational mechanical problems. One is supervised learning based on available labeled data. For example, Liang et al. demonstrated the potential of a DL model to be a surrogate of FE analysis in estimating the stress distribution of cardiovascular vessels, which allows for fast and accurate predictions of stress distribution in biomedical applications [5]. DL for biomedical applications has also been seen in the prediction of adolescent idiopathic scoliosis [6] and pediatric spinal deformities [7], in which X-ray data is used as clinical input and the results of calculations by FEM are employed as mechanistic input. One can also see the employment of graph neural networks for the prediction of material concentration in neurite networks [8] and the combination of the isogeometric analysis with convolutional neural networks (CNNs) for the prediction of neuron growth [9]. In addition, Li et al. proposed an encoder-decoder-based CNN for reaction-diffusion systems as a fast and accurate surrogate tool to FEM [10]. In material modeling, Hsu et al. presented a DL-based predictive model for crack propagation of crystalline materials using image data processed from the visualized results of molecular dynamics (MD) simulations [11]. Furthermore, Fernandez et al. proposed a DL model on the constitutive behavior of grain boundaries, which takes the traction-separation effects into account, based on the data obtained from MD simulations [12]. Studies have also been conducted that delve into learning differential operators for PDEs from data [13, 14]. Many other DL models on computational mechanical problems have also been developed within the scope of supervised learning; see [15,16,17,18,19,20] as examples.

The other common approach is unsupervised learning based on governing equations of BVPs, even without labeled data for training. Originally proposed by Raissi et al., NNs trained based on physics-based loss functions from BVPs are called physics-informed neural networks (PINNs) [21]. The key idea is to incorporate governing PDEs directly into the loss functions of NNs with the power of automatic differentiation. Upon successful training, PINNs can accurately predict physical behaviors within the domain of a problem. The training can be seen as the minimization problem in which the residual of PDEs is used as a target function. For the last five years, many researchers have tested the capability of PINNs to predict the behavior of a physical system. Jin et al. developed a PINN framework for incompressible Navier–Stokes equations and verified its capability of obtaining approximate solutions to ill-posed problems with noisy boundary conditions and inverse problems in the context of flow simulation [22]. Mao et al. modeled high-speed flow based on the Euler equation using PINNs [23]. Mahmoudabadbozchelou et al. presented non-Newtonian PINNs for solving coupled PDEs for fluid while considering the constitutive relationships [24]. Besides, many other studies on fluid-oriented applications of PINNs, such as [25,26,27,28], have already been investigated in recent years. For heat conduction problems, Zobeiry et al. applied the PINN architecture to the heat transfer equation with convective boundary conditions [29]. Cai et al. modeled heat convections with unknown boundary conditions and the two-phase Stefan problem [30]. Zhao et al. developed a combined framework of PINNs and CNNs for predictions of temperatures from the information of heat source [31]. Furthermore, Guo et al. worked on the prediction of three-dimensional transient heat conduction targeted for functionally graded materials using the deep collocation method for space and the Runge–Kutta scheme for time integration, showing the applicability of PINN approaches to spatiotemporal three-dimensional complex geometry cases [32]. Readers can also refer to [33,34,35,36,37] for other PINNs examples on heat transfer problems. When it comes to solid mechanics problems, Samaniego et al. developed a variational energy-based physics-informed loss function for the classical linear elasticity problem and the phase-field model for fracture [38]. Abueidda et al. used the collocation method to solve solid mechanics problems with various types of material models, including hyperelasticity with large deformation [39]. Haghighat et al. demonstrated the applicability of PINNs to the von Mises plasticity model in their PINN framework [40]. Rezaei et al. proposed a PINN solver for solid problems with heterogeneous elasticities [41]. Harandi et al. solved the thermomechanical coupled system of equations in the heterogeneous domains [42]. Bai et al. developed a modified loss function using the least squares weighted residual method for two- and three-dimensional solid mechanics, which can predict well the displacement and stress fields [43]. Other investigations into PINNs for solid mechanics can also be found in [44, 45]. In addition, the idea of PINNs has also been combined with the isogeometric analysis for predicting material transports in neurons [46].

While previous works on PINNs have provided many discoveries and insights, it is vital to address some drawbacks to enhance applications. For example, a review paper pointed out that PINNs could fail to learn complex physics such as turbulent or chaotic phenomena [47]. Wang et al. provided a theoretical analysis of the convergence rate of loss terms in PINNs, revealing the reason why training PINNs may fail in some problem setups [48]. They proposed a neural tangent kernel-based loss-balancing method that reduces the effects of convergence rate discrepancies. Furthermore, PINNs need to be retrained when one wants to consider different boundary conditions or problem domains, although transfer learning can be utilized in this context [41, 49, 50]. As a new DL model that can avoid the latter problem, operator learning has been investigated in recent years as a surrogate for PDE solvers [51,52,53,54]. The idea is to learn an operator that maps between infinite dimensional Banach spaces. Examples are Fourier neural operators (FNO) [55, 56], deep Green networks [57, 58] and deep operator networks (DeepONets) [59,60,61]. Operator learning can be done in both supervised and unsupervised manners. In the latter case, Wang et al. introduced a physics-informed DeepONet in which physics-informed loss functions from PDEs are used to train the neural operators [62]. Koric et al. compare the performance of the data-driven and physics-informed DeepONets for the heat conduction problem with parametric source terms [63]. Li et al. introduced a physics-informed version of FNO that works in a hybrid manner to leverage known physics in FNO [55].

Another open problem in physics-informed deep learning is the failure to predict time-dependent evolutionary processes. Wang et al. argued that the causality of physics must be respected in training PINNs when one considers time-continuous problems [64]. This is the case when we directly treat the temporal dimension as an additional dimension to the spatial domain. Mattey et al. developed a PINN model that enforces backward compatibility over the temporal domain in the loss function to overcome this limitation in the Allen-Cahn and Cahn-Hilliard equations [65]. Xu et al. utilized transfer learning for DeepONet to train the networks with better stability than the original DeepONet for dynamic systems [49]. Li et al. presented a phase-field DeepONet, which aims to predict the dynamic behavior of phase-fields in the Allen-Cahn and Cahn-Hilliard equations using the concept of gradient flows [66]. In the latter framework, the trained networks work as an explicit time-stepper that can predict the evolution of the phase field at the next step based on the current phase field. Furthermore, an emerging approach for spatiotemporal predictions is the utilization of numerical discretizations or convolutions to discretize derivatives to learn a discrete mapping on a discretized domain. This direction can also enhance training efficiency by avoiding time-consuming automatic differentiation, especially when higher-order derivatives need to be computed. For static problems, Fuhg et al. proposed a deep convolutional Ritz method as a surrogate of numerical solvers, in which the convolution is exploited to take central differences, and the network takes the energy form as a physics-informed loss [67]. Gao et al. utilized CNN architecture to deal with the discretized domain and extended it to irregular domains through coordinate transformation [68]. Rezaei et al. devised a framework that they named finite operator learning (FOL) based on FEM for parametrically solving PDEs with a demonstration for a steady heat equation with heterogeneity [69]. Some other works have applied FEM to integrate the weak-form loss into NNs, such as for advection–diffusion [70], quantification of wind effects on vibrations [71], etc. Furthermore, Khara et al. employed the energy-form loss in the FEM-inspired loss function and demonstrated its performance in Poisson’s equation including a three-dimensional case [72]. When it comes to spatiotemporal problems, Geneva et al. proposed a CNN-based framework with autoregressive encoder-decoder architecture, whose performance is showcased for some types of dynamic PDEs [73]. Ren et al. presented a discrete learning architecture that combines CNN with long short-term memory for spatiotemporal PDEs [74]. Liu et al. embedded known PDE information into CNN architecture itself to preserve the behavior of the PDE of interest for spatiotemporal dynamic phenomena [75]. Furthermore, Xiang et al. employed graph neural networks in combination with radial basis function finite difference to predict spatiotemporal dynamics for irregular domains [76]. The abovementioned works have shown the capability of discrete mapping learned by NNs for spatiotemporal dynamics. However, the researchers in this domain are still looking for approaches that can easily address irregular domains, as it is difficult for CNN-based methods in particular. In this sense, the direction of the incorporation of FEM into discrete operator learning for parametrically solving spatiotemporal PDEs is beneficial to address more realistic problem setups.

Fig. 1
figure 1

Schematic of training and evaluation parts in the proposed finite element-based physics-informed operator learning framework termed finite operator learning (FOL)

In this study, we aim to develop a novel physics-informed discrete-type operator learning framework, which we refer to as FOL, that can parametrically predict the dynamic behavior of physical quantities over time. The schematic illustration of the developed framework is provided in Fig. 1. The key idea is to provide physical fields of a system at the current time step as input and return those at the next time step as output, realizing a surrogate model of time-marching numerical schemes. The time-dependent heat equation, also known as a transient heat conduction problem, is chosen as the target BVP to validate the framework proposed in this work. Not only does this study consider homogeneous thermal conductivity, but it also takes into account heterogeneous conductivity. The training follows a physics-informed loss function constructed based on the finite element discretization of the heat equation [69], thereby making it unsupervised learning without labeled data. The extension of the framework to irregular domains is also tested at the end.

The difference from the previous frameworks, such as the one by [66] or by [75] for example, is that this framework directly uses the discretized weak form loss that is identical to the formulation when solving with FEM. This is also demonstrated in a representative model later in this paper. Furthermore, this study considers the heterogeneity of physical properties, which is not addressed in the aforementioned studies. For clarity, the comparison of the architecture with the vanilla PINNs and physics-informed DeepONet (PI-DeepONet) is described in Fig. 2. The pivotal difference is that in FOL we embed the coordinate information into the loss function, allowing us to integrate the branch and trunk nets in DeepONet into a single network. In addition, we do not take the temporal coordinate as input unlike PINNs or PI-DeepOnet; FOL takes into account the temporal evolution by discretizing a given PDE in time between the current and next time steps.

This paper consists of five sections. Section 1 describes the background in the field of scientific machine learning with a focus on physics-informed deep learning and the objective of the present work. Section 2 briefly summarizes the formulation of the discretized heat equation in a weak form by FEM. Following that, the methodology, including the problem setup, developed operator learning framework, and procedure of the training data generation, is explained in Sect. 3. The results and discussion on the performance of the present framework, as compared to the reference solution by FEM, are reported in Sect. 4. Finally, the conclusion of the present work is provided along with the outlook in Sect. 5.

Fig. 2
figure 2

Comparison of vanilla PINNs, physics-informed DeepONet (PI-DeepONet), and finite operator learning (FOL)

2 Discretized weak form of heat equation

In this work, we consider the transient heat conduction problem, which is described by the heat equation, as a benchmark problem to demonstrate the ability of the proposed framework. The heat equation describes how the temperature \(T(\varvec{x}, t)\), with \(\varvec{x}\) being the position and t the time, evolves in the domain \(\varvec{x} \in \Omega\) over time. Let the heat source be \(Q: \Omega \times \ \left( 0,~\tau \right) \rightarrow \mathbb R\), the boundary temperature \(T_d(\varvec{x}): \ \Gamma _d \ \times \ \left( 0,\tau \right) \rightarrow \mathbb R\), and the boundary heat source \(q_n: \Gamma _n \times \ \left( 0,\tau \right) \rightarrow \mathbb R\), where \(\Gamma _d\) is the domain on which the Dirichlet boundary condition is applied, \(\Gamma _n\) is the domain on which the Neumann boundary condition is applied, and \(t \in \left( 0,\tau \right)\) denotes the open range of the temporal domain with \(\tau\) the end. The strong form of the heat equation is given as,

$$\begin{aligned} c \rho \dot{T}(\varvec{x}, t)= -\text {div}(\varvec{q}) + Q \quad \text {in} \ \Omega \ \times \ \left( 0,\tau \right) , \end{aligned}$$
(1)

where c is the specific heat capacity, \(\rho\) is the density, \(\dot{T}\) represents the first-order partial derivative with respect to time, and \(\varvec{q} = -{k(\varvec{x})} \nabla T(\varvec{x},t)\) is the heat flux with \({k(\varvec{x})}\) the position-dependent thermal conductivity. The boundary and initial conditions are enforced by,

$$\begin{aligned}{} & {} T(\varvec{x},t) = T_d(\varvec{x}) \quad \text {on} \ \Gamma _d \ \times \ \left( 0,\tau \right) , \end{aligned}$$
(2)
$$\begin{aligned}{} & {} \nabla T(\varvec{x},t) \cdot \varvec{n} = q_n(\varvec{x}) \quad \text {on} \ \Gamma _n \ \times \ \left( 0,\tau \right) , \end{aligned}$$
(3)
$$\begin{aligned}{} & {} T(\varvec{x},0) = T_0(\varvec{x}) \quad \varvec{x} \in \Omega , \end{aligned}$$
(4)

where \(\varvec{n}\) is the outward normal vector. After multiplication by a test function, taking the integral over the domain and applying Gauss theorem, and assuming no heat source term Q and heat influx and outflux \(q_n\), one can obtain the corresponding weak form for \(\Omega \ \times \ \left[ 0,~\tau \right]\) as,

$$\begin{aligned} \int _{\Omega } w c \rho \dot{T} d V + \int _{\Omega } {\nabla w}^T {k(\varvec{x})} \nabla T d V = 0, \end{aligned}$$
(5)

with the initial condition

$$\begin{aligned} \int _{\Omega } w c \rho T(t=0) d V= \int _{\Omega } w c \rho T_0 d V, \end{aligned}$$
(6)

where w is the test function defined on an appropriate function space. With the weak form at hand, one can arrive at the discretized weak form by the finite element method as,

$$\begin{aligned} \left( \varvec{M} + \alpha {\Delta t} \varvec{K} \right) \varvec{T}^{n+1} = \left( \varvec{M} - (1-\alpha )\Delta t \varvec{K} \right) \varvec{T}^{n}, \end{aligned}$$
(7)

where

$$\begin{aligned} \varvec{M} = \int _{\Omega } \varvec{N}^T (\rho c) \varvec{N} dV, \end{aligned}$$
(8)
$$\begin{aligned} \varvec{K} = \int _{\Omega } \varvec{B}^T k (\varvec{x}) \varvec{B} dV. \end{aligned}$$
(9)

In the formulation above, \(\alpha\) is the parameter that can be selected from 0, 0.5, 1 depending on the choice of time integration scheme, and \(\varvec{T}\) is the vector storing nodal temperature values, and the superscript n is used to denote the number of time step increments. Here we introduce the shape function \(\varvec{N}\), its spatial derivative \(\varvec{B}\), and thermal conductivity \(k(\varvec{x})\). At the element level, they are defined in the case of iso-parametric quadrilateral elements as,

$$\begin{aligned}{} & {} \varvec{N}_e = \left[ N_1 \cdots N_4 \right] , \end{aligned}$$
(10)
$$\begin{aligned}{} & {} \varvec{B}_e=\left[ \begin{array}{lll} N_{1, x} &{} \cdots &{} N_{4, x} \\ N_{1, y} &{} \cdots &{} N_{4, y} \end{array}\right] , \end{aligned}$$
(11)
$$\begin{aligned}{} & {} \varvec{k}_e = \left[ k_1 \cdots k_4 \right] ^T, \end{aligned}$$
(12)

where \(N_i\) and \(k_i\) denote the shape function and thermal conductivity for node i and the subscript e represents the element number. The thermal conductivity is interpolated for Gaussian integration by the nodal thermal conductivity values using the shape function

$$\begin{aligned} k(\varvec{x}) = \varvec{N}_e \varvec{k}_e. \end{aligned}$$
(13)

It is worth noting that in the practical implementation, one has to manipulate the left-hand side matrix \(\left( \varvec{M} + \alpha {\Delta t} \varvec{K} \right)\) and the right-hand side \(\left( \varvec{M} - (1-\alpha )\Delta t \varvec{K} \right) \varvec{T}^{n}\) based on the Dirichlet boundary conditions to appropriately impose fixed temperatures at desired boundary nodes.

3 Methodology

3.1 Problem setup

The dimensions of the problem domain and the boundary conditions are depicted in Fig. 3. One can imagine that a heat source supplies heat into the system from the left, and the right boundary is connected to a cold device that removes the heat from the system. In this problem setup, the heat source on the left boundary is prescribed as a Dirichlet boundary condition with a temperature of 1.0 \(^\circ\)C. Similarly, the heat sink on the right boundary has a temperature of 0.0 \(^\circ\)C. The Neumann boundary condition, which does not allow heat transfer, is applied to the top and bottom boundaries. This study considers two types of thermal conductivity distributions over the domain, homogeneous and heterogeneous, the distributions of which are shown in Fig. 4. For instance, the microstructure of carbon fiber-reinforced plastics or architectural metamaterials can be used for heterogeneous thermal conductivity cases. As initial temperature fields, five different distributions are conceived and considered in Fig. 5 to test the performance of the network prediction for different temperature inputs. The initial temperature field, represented by an 11 by 11 grid of linearly discretized finite element points, is input into a neural network as described in Sect. 3.2. Physical fields, such as temperature fields and heterogeneity maps, are then upscaled to a 165 by 165 grid using bilinear shape functions. Further details on sample temperature fields for the training of the DL model are provided in Sect. 3.3.

Fig. 3
figure 3

Dimensions of the problem domain and the boundary conditions

Fig. 4
figure 4

Two types of thermal conductivity distributions considered in this study

3.2 Proposed finite element-based physics-informed operator learning framework

The core idea of the FOL framework is to predict physical fields at the next time step, utilizing their current time step state, which is equivalent to other time marching FE solvers. The network architecture is shown in Fig. 6, implemented using the TensorFlow-based deep learning library SciANN [40]. The domain is first discretized through finite elements; see Fig. 7. The nodes in the discretized domain are the representative points for evaluating temperature evolution by the networks. Analogous to the finite element method, the Gaussian integration is performed to integrate over the elements using the bi-linear shape function, shown in the right of Fig. 7. Regarding the network architecture, it is worth noting that separate feedforward NNs are used to predict each node’s temperature output. In [69], the authors showed that separate networks with a small number of neurons per layer in each network outperformed a single fully connected network with a large number of neurons per layer. Nevertheless, it is also shown in the same work that using a simple fully connected network with a properly reduced architecture performs very well in finding the correct solutions. Therefore, the user needs to study this matter according to the problem at hand and the nature of the equations and outputs. The comparison of the performance between the separated network architecture and the fully connected architecture is described in Sect. 4.5. All the separated NNs are trained at the same time through a physically informed loss function based on the input and output temperature fields. Substituting \(\alpha = 1\), which means backward Euler approximation in time, into Eq. (7) and taking the L2 norm of the residual yields the loss function in this framework, which reads,

$$\begin{aligned} \mathcal {L} =\left\| \left( \varvec{M} + {\Delta t} \varvec{K} \right) \varvec{T}^{n+1}- \varvec{M} \varvec{T}^{n} \right\| \ \text {in} \ \Omega , \end{aligned}$$
(14)

where \(\Vert \cdot \Vert\) denotes the L2 norm. More concretely, \(\varvec{M}\) and \(\varvec{K}\) are constructed as,

$$\begin{aligned} \varvec{M} = {\varvec{\mathcal {A}}}_{e=1}^{n_{e l}}\sum _{j = 1}^{n_{gauss}} \varvec{N}^T_e(\varvec{\xi }_j)~\rho c~\varvec{N}_e(\varvec{\xi }_j) \det {J(\varvec{\xi }_j)}~\mu _j, \end{aligned}$$
(15)
$$\begin{aligned} \varvec{K} = {\varvec{\mathcal {A}}}_{e=1}^{n_{e l}}\sum _{j = 1}^{n_{gauss}} \varvec{B}^T_e(\varvec{\xi }_j)~k(\varvec{\xi }_j)~\varvec{B}_e(\varvec{\xi }_j) \det {J(\varvec{\xi }_j)}~\mu _j. \end{aligned}$$
(16)

Here, \({\varvec{\mathcal {A}}}_{e=1}^{n_{e l}}\) denotes the assembly of the element contributions from element 1 to element \(n_{el}\) (total number of elements), \(n_{gauss}\) is the number of Gaussian points, \(\varvec{\xi }_j\) is the coordinate of jth Gaussian point, and \(\mu _j\) is the weight of Gaussian quadrature for jth Gaussian point. In Eq. (14), \(\varvec{T}^n\) and \(\varvec{T}^{n+1}\) can be considered the input and output temperature fields of the network, respectively (the initial temperature field as well as the next step one).

Fig. 5
figure 5

Five types of initial temperature fields for evaluating the performance of the trained networks

Fig. 6
figure 6

Network architecture and loss function used in the proposed framework

Fig. 7
figure 7

Discretized domain by finite elements. The networks evaluate the yellow nodes in the training and prediction stages. The black nodes are removed from the training target by applying hard boundary conditions

To enforce Dirichlet boundary conditions, this framework employs hard-constrained boundary conditions for the nodes, focusing solely on predicting unknown temperatures. This procedure is once again very similar to the classical finite element approach. In Fig. 7, only the inner nodes, colored yellow, are evaluated and predicted through the networks. On the other hand, the black-colored nodes at the left and right boundaries are removed from the set of nodes used for training. However, it is worth mentioning that the influence of the Dirichlet boundary nodes is taken into account through the formulation of the physics-informed loss function in which the nodal field values of the Dirichlet boundary nodes are incorporated.

In the training, mini-batch learning with multiple input samples is employed to optimize the networks. The loss in each mini-batch iteration is defined with the mean squared error as

$$\begin{aligned} \mathcal {L} = \frac{1}{n_s} \sum _{i=1}^{n_s} \mathcal {L}_i^2, \end{aligned}$$
(17)

where \(n_s\) is the number of samples per mini-batch. The predictive performance by the trained networks is evaluated by the relative L2-norm error \(E_{rr}\), which reads,

$$\begin{aligned} E_{rr} = \frac{\Vert \varvec{T}_{NN} - \varvec{T}_{FE} \Vert }{\Vert \varvec{T}_{FE} \Vert }. \end{aligned}$$
(18)

In Eq. (18), \(\varvec{T}_{NN}\) is the temperature predicted by the NNs and \(\varvec{T}_{FE}\) is the corresponding FE solution, both of which are stored in a vector form. Finally, the networks are trained with this setup. A list of the hyperparameters of the networks, as well as the material parameters used in the study, are summarized in Table 1. For the units of the material parameters and the temperature, we consider Kelvin as a unit for temperature; here we consider the temperature ranges from 0 to 1 \(^\circ\)C, and as for the thermal conductivity, W/mK is assumed.

3.3 Generation of training samples

The input samples used for training are generated by combining three types of functions, i.e., Gaussian distribution, Fourier series, and constant field. Some of the generated input samples are shown in Fig. 8. The first is based on the Fourier series. The temperature field generated by the Fourier series with randomly generated amplitudes and frequencies has a smooth distribution without a steep gradient. The function is given as,

$$\begin{aligned} \begin{aligned} T(\varvec{x})&= \sum _i^{n_{sum}} [c_i + A_i \sin {(C_i\cdot x)} \cos {(D_i\cdot y)} \\ {}&\quad +B_i \cos {(C_i\cdot x)} \sin {(D_i\cdot y)} \\ {}&\quad + A_i\sin {(C_i\cdot x)} \sin {(D_i\cdot y)} \\ {}&\quad +B_i \cos {(C_i\cdot x)} \cos {D_i\cdot y)}]. \end{aligned} \end{aligned}$$
(19)

In Eq. (19), \(\varvec{x} = (x, y)^T\), \(c_i\) is the real-valued constant, \(A_i\) represents the amplitude in the x-direction, \(B_i\) the amplitude in the y-direction, \(C_i\) represents the frequency in the x-direction, and \(D_i\) the frequency in the y-direction. These are parameters for generating different patterns of temperature distribution. The ranges of values from which the parameters \(c_i, A_i, B_i, C_i, \ \text {and} \ D_i\) are randomly chosen, are determined (see Table 2):

$$\begin{aligned} \begin{aligned} c_i&\in c_r = \left\{ r \ | \ a_c\le r \le b_c\right\} \\ A_i&\in A_r = \left\{ r \ | \ a_A\le r \le b_A\right\} \\ B_i&\in B_r = \left\{ r \ | \ a_B\le r \le b_B\right\} \\ C_i&\in C_r = \left\{ r \ | \ a_C\le r \le b_C\right\} \\ D_i&\in D_r = \left\{ r \ | \ a_D\le r \le b_D\right\} . \\ \end{aligned} \end{aligned}$$
(20)

For this training \(n_{sum}\) was set to 50. The parameters are generated \(n_{sum}\) times and then the input temperature samples are created by summing the Fourier series \(n_{sum}\) times with the prepared parameter set. At the end, a normalization is performed to restrict the range between 0 and 1. In total, 1200 input samples were prepared using this procedure in this study. The second temperature generator comes from the Gaussian random process. For a generation of input samples, the output is normalized between 0 and 1 after initial random patterns are generated. This process is done iteratively for each node and the number of input samples by the Gaussian generator. With this generator, 1500 input samples were prepared for training. In addition to the input samples generated by the two temperature generators, one can also consider input samples with varying constant temperatures to increase the variety of training data. This generator created 300 input samples for the training. In total, 3000 input samples were eventually generated from the three types of functions to cover a wide range of temperatures considered in the training process.

Table 1 Summary of hyperparameters and material parameters
Fig. 8
figure 8

Examples of input temperature samples generated by the Gaussian, Fourier, and constant-temperature generators

Table 2 Parameters used for generating temperature samples with the Fourier series

4 Results

Using the framework introduced in the previous section, we have attempted to predict the solution of the heat equation for any given initial temperature field. As a demonstration, we show here only the results of the predictions for three of the initial temperature fields, one with \(T(\varvec{x}) = \frac{1}{2}\left( \sin (10 y) + 1 \right)\), one from the Gaussian distribution, and the last one with \(T(\varvec{x}) = 0.5x^2 |(\sin (10x)+\cos (10y)|\), considered in this work. We can confirm that the chosen initial temperatures are not exactly represented in the initial training samples. Furthermore, as will be shown later, the temporal temperature evolution results in complex patterns that significantly differ from the training samples. It is important to note that since the training is entirely unsupervised, the network is not provided with solutions for these initial temperature samples. Hence, the subsequent results serve as rigorous tests to evaluate the network’s performance. In the main study, we trained the NNs for 5000 epochs to ensure that the loss reaches a plateau due to convergence after hundreds of epochs. This is confirmed in Fig. 9, although it continues to decrease slightly over the epoch. We do not employ fixed criteria on the residual value for stopping the training, as this is currently not the main scope of this work.

Fig. 9
figure 9

Loss history for the homogeneous and heterogeneous thermal conductivities

In post-processing, we calculated the heat flux based on the obtained temperature fields with respect to the discretized system to further investigate the physical behavior of the predicted heat conduction. When the homogeneous thermal conductivity is applied, Fig. 10 exhibits that the FOL predictions agree with the reference FE solution with a maximum error of about 0.003 in the nodal absolute temperature error and 0.03 in the nodal absolute heat flux magnitude error at \(t = 10 \Delta t = 0.5\) (s) with \(T(\varvec{x}) = \frac{1}{2}\left( \sin (10 y) + 1 \right)\) as the initial temperature field, where in Fig. 10\(T_{diff}\) and \(\varvec{q}_{diff}\) denote the difference in temperature and heat flux magnitude between the prediction and reference solution at each node, respectively. The heat flux transition over time shows that the transient behavior is accurately predicted by the proposed FOL framework upon training. The absolute error distribution shown at the bottom of Fig. 10 looks rather random. The error can be further reduced by, for example, enhancing the variety of training samples, which will lead to better coverage of possible temperature fields in the network input. Looking at the results in Fig. 11, we confirmed that even for a random temperature pattern generated from the Gaussian distribution, one can obtain reasonable predictions with a maximum of 0.003 in nodal absolute temperature error and 0.03 in nodal absolute heat flux magnitude error at \(t = 10 \Delta t = 0.5\) (s). Here, we define “reasonable prediction” as the relative L2 error being small against the finite element solution, which means the predictions by FOL capture the important features of the temperature evolutions. More concretely, for \(N=11\), if the relative L2 error in temperature prediction is less than 0.1, the FOL prediction is at reasonable accuracy for example. In Fig. 12, the temperature error increased by a factor of 10 to approximately 0.03 at maximum in the nodal absolute temperature error, while the heat flux magnitude error increased by a factor of 5 to about 0.15. In Fig. 15 in the nodal absolute heat flux magnitude error at \(t = 10 \Delta t = 0.5\) (s) with \(T(\varvec{x}) = 0.5x^2 |(\sin (10x)+\cos (10y)|\) as the initial temperature field. Nevertheless, the prediction correctly captured the main feature of the temperature evolution seen in the reference solution by FEM in the qualitative comparison of the results from FOL and FEM. When it comes to the case with the heterogeneous thermal conductivity, one can also see in Figs. 13, 14 and 15 the agreement of the FOL prediction with the corresponding FEM solutions. The main difference to the homogeneous case is that the error accumulation in the temperature field is dominant around the inserted low conductivity regions (i.e., phase boundaries) due to steep changes in the solution. To enhance the solution’s quality, one can train the neural network with additional input neurons and utilize advanced optimizers, such as L-BFGS, with hyperparameter tuning to prevent getting trapped in local minima. It is important to note that the prediction of dynamic temperature evolution in the other part of the domain is reasonable. One can also confirm in Figs. 13, 14, and 15 that the red arrows representing the heat flux bypass the low conductivity region area in both FOL and FEM. Furthermore, we also predicted for more time steps up to \(t = 50\Delta t = 2.5\) (s) to see if a long-term prediction is possible; see Fig. 16. The prediction was still in good agreement with the FE solution even at \(t = 50\Delta t\), although error concentration is seen in the low conductivity region. Overall, the results of the demonstration on the transient heat conduction problem show that the proposed FOL framework for spatiotemporal PDEs is capable of predicting the solution and even its spatial gradients under a given time step size and boundary conditions.

Fig. 10
figure 10

Temperature predictions and obtained heat flux by the networks (top), reference solutions by FEM (middle), and the difference between the predictions and reference solutions (bottom) in the case with the homogeneous thermal conductivity and temperature field with \(T(\varvec{x}) = \frac{1}{2}\left( \sin (10 y) + 1 \right)\) as an initial temperature field

Fig. 11
figure 11

Temperature predictions and obtained heat flux by the networks (top), reference solutions by FEM (middle), and the difference between the predictions and reference solutions (bottom) in the case with the homogeneous thermal conductivity and temperature field with the Gaussian distribution-based temperature field as an initial temperature field

Fig. 12
figure 12

Temperature predictions and obtained heat flux by the networks (top), reference solutions by FEM (middle), and the difference between the predictions and reference solutions (bottom) in the case with the homogeneous thermal conductivity and temperature field with \(T(\varvec{x}) = 0.5x^2 |(\sin (10x)+\cos (10y)|\) as an initial temperature field

Fig. 13
figure 13

Temperature predictions and obtained heat flux by the networks (top), reference solutions by FEM (middle), and the difference between the predictions and reference solutions (bottom) in the case with the heterogeneous thermal conductivity and sinusoidal temperature field with \(T(\varvec{x}) = \frac{1}{2}\left( \sin (10 y) + 1 \right)\) as an initial temperature field

Fig. 14
figure 14

Temperature predictions and obtained heat flux by the networks (top), reference solutions by FEM (middle), and the difference between the predictions and reference solutions (bottom) in the case with the heterogeneous thermal conductivity and temperature field with the Gaussian distribution-based temperature field as an initial temperature field

Fig. 15
figure 15

Temperature predictions and obtained heat flux by the networks(top), reference solutions by FEM (middle), and the difference between the predictions and reference solutions (bottom) in the case with the heterogeneous thermal conductivity and temperature field with \(T(\varvec{x}) = 0.5x^2 |(\sin (10x)+\cos (10y)|\) as an initial temperature field

Fig. 16
figure 16

Temperature predictions and obtained heat flux up to \(t = 50 \Delta t = 2.5\) by the networks (top), reference solutions by FEM (middle), and the difference between the predictions and reference solutions (bottom) in the case with the heterogeneous thermal conductivity and temperature field with \(T(\varvec{x}) = 0.5x^2 |(\sin (10x)+\cos (10y)|\) as an initial temperature field

We also looked into cross-sectional changes in temperature and heat flux magnitude for the heterogeneous case with an initial temperature field of \(T(\varvec{x}) = \frac{1}{2}\left( \sin (10 y) + 1 \right)\) at \(t = 10 \Delta t = 0.5\). As shown in Fig. 17, the predictions represented by the dots agree with the corresponding FE solutions, although one can find discrepancies with the FE solution at \(t = 10 \Delta t\). This may also be due to the error accumulation problem. We conclude that after sufficient training with a variety of training samples, the proposed framework can predict the dynamic behavior of transient heat conduction with acceptable accuracy.

Fig. 17
figure 17

Cross-sections of temperatures and heat flux magnitudes at \(y = 0.18\), \(y = 0.45\), and \(x = 0.5\) obtained from the networks and reference solutions by FEM in the case with the heterogeneous thermal conductivity and temperature field with \(T(\varvec{x}) = \frac{1}{2}\left( \sin (10 y) + 1 \right)\) as an initial temperature field

4.1 Influence of training samples

This subsection examines the effect of the training samples on the prediction accuracy over time. The analysis includes not only the combination of Gaussian, Fourier, and constant fields but also the combinations of two out of the three temperature generators. To ensure that the constant fields account for only \(10\%\) of the total training samples, we determined the proportion of samples from each generator. For the dataset of the combination of Fourier and constant fields, we prepared 2700 Fourier-type samples and 300 constant-type samples. Similarly, for the dataset of the combination of Gaussian and constant fields, we prepared 2700 Gaussian-type samples and 300 constant-type samples. To observe the impact of sample size on prediction, a new case was added with 1000 training samples, each with the same percentage as the original one. For this study, the NNs were trained for 1000 epochs with \(\Delta t = 0.05\) (s). We also kept this condition for the other studies in the following subsections as well. The results are shown in Figs. 18 and 19 for the homogeneous and heterogeneous thermal conductivities, respectively. Notably, the combination of Fourier and constant resulted in errors about twice as large as in the other three cases, including those obtained from 1000 samples. This suggests that the accuracy of the prediction increases as more of the possible temperature field patterns are covered by training samples. The comparison of the temperature fields at \(t = 10\Delta t = 0.5\) (s) with the initial temperature field \(T(\varvec{x}) = 0.5x^2 |(\sin (10x)+\cos (10y)|\) suggests that the fluctuation in the predictions concerning the reference solution by FEM is mitigated by enhancing training samples. When we employed the homogeneous thermal conductivity, it was clear that the error sometimes decreased from the previous time steps. This can happen in the present framework since the input temperature fields at the subsequent time steps after the first step are likely to be similar to some of the training samples, resulting in a better interpolation accuracy in the network prediction than the previous step. We also give a detailed discussion on this point in the next subsection about the influence of time step size. For the heterogeneous conductivity case, the prediction accuracy with the combination of Fourier and constant was low compared to the other three types of training samples. This is the same trend as for the homogeneous case. The contours on the right of the figure show that the NNs failed to correctly predict the temperature evolution, particularly in the low conductivity regions when only the Fourier and constant generators are used; see Fig. 19a, which also means that the frequency in a variety of training samples greatly affects the prediction. Overall, the employment of diverse patterns for the training data sets was shown to be effective in improving the predictive performance of the present framework.

4.2 Influence of time step size

The present FOL framework is flexible in terms of the time step size. Although one can choose an arbitrary time step size according to the time scale of the situation of interest, it is worth investigating how the choice of time step size influences the accuracy of the prediction over time, especially in terms of the number of time-marching steps. We used 3000 training samples from the three temperature pattern generators, and the networks were trained for 1000 epochs with homogeneous thermal conductivity. The average relative L2 error over the results from the five initial temperature fields is shown on the right side of Fig. 20, indicating that the error increases as the time step gets smaller. This can be partly attributed to the accumulation of errors due to successive inferences over time. In addition, the initial temperature field, being the most extreme of the input temperature fields presented to the framework throughout its temporal evolution, poses a significant challenge to the networks to accurately predict the subsequent state. This difficulty arises because the extreme temperature field lies within the sparse part of the training sample distribution. However, after the first time step, the error was reduced in each case, especially when \(\Delta t = 0.01\), up to \(t = 0.2\), which was also seen in Fig. 18. The latter can be explained by the distribution of training samples. As the temperature field approaches a steady state, the NNs are more likely to experience patterns similar to the input temperature field during the training phase. On the other hand, for \(\Delta t = 0.01\), the error increases again after \(t = 0.2\), which may be due to error accumulation from multiple inferences. The overall results suggest that the time step size needs to be adjusted according to the target phenomenon to decrease the error accumulation that affects the accuracy of the prediction. In future studies, one could consider incorporating the time step size as an additional input and appropriately balancing the dynamical terms with the right-hand side of the equation using higher-order time integration algorithms.

Fig. 18
figure 18

Left: Average relative L2 error norm from the five initial temperature fields over time in the case with the homogeneous thermal conductivity for different training data sets. Right: Temperature fields at \(t = 10\Delta t = 0.5\) (s) when the initial temperature field \(T(\varvec{x}) = 0.5x^2 |(\sin (10x)+\cos (10y)|\) is given

Fig. 19
figure 19

Left: Average relative L2 error norm from the five initial temperature fields over time in the case with the heterogeneous thermal conductivity for different training data sets. Right: Temperature fields at \(t = 10\Delta t = 0.5\) (s) when the initial temperature field \(T(\varvec{x}) = 0.5x^2 |(\sin (10x)+\cos (10y)|\) is given

Fig. 20
figure 20

Left: Average relative L2 error norm from the five initial temperature fields over time in the case with the homogeneous thermal conductivity for three different time step sizes. Right: Temperature fields at \(t = 10\Delta t = 0.5\) (s) when the initial temperature field \(T(\varvec{x}) = 0.5x^2 |(\sin (10x)+\cos (10y)|\) is given

4.3 Influence of number of epochs and optimizer

The impact of the number of epochs and optimizers on prediction performance was also studied. The NNs were trained using 3000 samples and a time step of 0.05 (s). The results for different numbers of epochs and two optimizers, Adam and L-BFGS (optimization algorithm employing quasi-Newton), are shown in Fig.  21. The results indicate that increasing the number of epochs generally improved the predictive performance. Looking into the detail shown in the log-log plot on the right of Fig. 21 indicates the exponential decrease in relative L2 error with an increasing number of epochs. This implies that along with the loss history in Fig. 9, even a slight decrease in the loss value, or namely the residual, improved predictive accuracy. One should also keep in mind that there is a trade-off relationship between the training cost and predictive accuracy. The temperature distributions on the bottom show that the main trend of the temperature evolution was sufficiently captured by FOL even with 500 and 1000 epochs, suggesting one does not need to train the NNs for too many epochs to have reasonable predictions. In terms of optimizers, a comparison between Adam and L-BFGS optimizers for the same number of epochs shows that Adam outperformed L-BFGS in FOL. This may be due to insufficient hyperparameter tuning for L-BFGS or the smoothness of the addressed optimization problem. Another reason for potential performance issues with L-BFGS optimization could be the approximation of the Hessian matrix, which estimates the curvature of the parameter space. As the parameter space increases in size, L-BFGS may not perform as well as ADAM. A hybrid combination of Adam and L-BFGS can also be more promising in the future, see [42, 77].

Fig. 21
figure 21

Left: Average relative L2 error norm from the five initial temperature fields over time in the case with the homogeneous thermal conductivity for different numbers of epochs and optimizers. Right: Relative L2 error with increasing number of epochs when Adam optimizer is employed. Bottom: Temperature fields at \(t = 10\Delta t = 0.5\) (s) when the initial temperature field \(T(\varvec{x}) = 0.5x^2 |(\sin (10x)+\cos (10y)|\) is given

4.4 Influence of the activation function

The study also investigated the impact of different activation functions. In addition to Swish, which was used in the other cases, the performance of sigmoid and hyperbolic tangent (tanh) functions were tested. The training was done for 1000 epochs with 3000 samples from the three types of temperature field generators. Fig. 22 confirms that Swish outperformed sigmoid and tanh in terms of relative L2 error, averaged from the results of the five initial temperature cases. One possible reason for Swish’s superior performance in this situation is the restricted temperature range between 0 and 1 in the present study. However, other activation functions may be viable options in problem setups with different value ranges. When comparing the performance of ReLU with Swish, it was observed that the error was noticeably larger with ReLU than with Swish at the first step. This could be due to the discontinuity at zero in the ReLU function, which affects learning in the vicinity of a temperature of 0 \(^\circ\)C.

4.5 Influence of network architecture

The present framework consists of separate NNs for corresponding nodal temperatures at the next time step. On this matter, other network architecture options could be conceived, such as a fully connected one that returns the whole output field from a single NN. To quantitatively ensure the superiority of the present NN architecture, we evaluated the three types of network architectures, including the original one used for the main study [69]. The created network architectures are compared in Fig. 23. Compared to the original architecture, the elementwise-connected architecture takes nodal temperatures from adjacent elements as input for the output nodal temperature. For the fully connected architecture, nodal temperatures are provided as inputs and the entire nodal temperature field for the next time step is returned as output. In this study, 4 layers with 170 neurons in each layer were selected to ensure that the number of trainable parameters is comparable with each other. As a result, the fully connected architecture has 121,139 trainable parameters, whereas the separated and elementwise-connected architectures have 110,979 trainable parameters. It is noted that the nodal number starts with 2 as we apply hard constraints for the Dirichlet boundary conditions and, therefore, eliminate those nodes from the training targets, as explained in Sect. 3.2. The results shown in Fig. 24 indicate that the two network architectures newly introduced here led to worse accuracy compared to the original architecture approximately by a factor of 5. Separating NNs for different outputs while considering the entire problem domain for input is the most effective way to learn the mapping between the input and output physical fields in FOL. In terms of the training cost in the above cases, on the other hand, the fully connected architecture spent 6 h 23 min 54 s, whereas the separated architecture took 7 h 29 min 30 s and the elementwise connected architecture 7 h 40 min 32 s when they were trained with SciANN for 1000 epochs on a single GPU node of NVIDIA GeForce RTX 2080 12GB, indicating that the fully connected one can be an efficient option that can still serve as a surrogate with sufficiently high-accuracy prediction. This is also supported by the comparison of the inference costs, where the fully connected architecture was shown to be 3.103 times faster than the separated network architecture in the measurement. Regarding the improvement of the separated and elementwise-connected architectures, one interesting direction in the future would be to investigate whether increasing the size of the element groups in Fig. 23b can enhance predictive accuracy, which is in the end identical to that of Fig. 23a. Reducing the number of input dimensions to each NN shown in Fig. 23a, b leads to the reduction of parameters in NNs in the present framework, resulting in less training cost. This improvement would play a role when one wants to apply FOL to a model with more nodes than what is considered in this work. Furthermore, it would also be beneficial to even reduce the number of neurons or layers in each network architecture, leading to the prevention of overfitting in NNs. Nevertheless, for large-scale problems in which a large number of degrees of freedom need to be taken into account, one would have more benefits with the fully connected architecture, especially in combination with learning in latent space by employing techniques to condense information such as autoencoder.

Fig. 22
figure 22

Left: Average relative L2 error norm from the five initial temperature fields over time in the case with the homogeneous thermal conductivity for three different activation functions. Right: Temperature fields at \(t = 10\Delta t = 0.5\) (s) when the initial temperature field \(T(\varvec{x}) = 0.5x^2 |(\sin (10x)+\cos (10y)|\) is given

Fig. 23
figure 23

Comparison of the three network architectures: a original architecture, b elementwise-connected architecture, and c fully connected architecture. The input and output fields do not include Dirichlet boundaries due to the hard boundary condition

Fig. 24
figure 24

Left: Average relative L2 error norm from the five initial temperature fields over time in the case with the homogeneous thermal conductivity for three different network architectures. Right: Temperature fields at \(t = 10\Delta t = 0.5\) (s) when the initial temperature field \(T(\varvec{x}) = 0.5x^2 |(\sin (10x)+\cos (10y)|\) is given

4.6 Influence of mesh size

Since the mesh size affects the prediction accuracy and the subsequent training cost, it is also pivotal to gain insights on this point. For that, three different element sizes, one is the same one as in the main study (\(N = 11\)) and the others are finer than the original one (\(N = 15, 21\)), on the squared domain were considered to perform the comparison. Due to the increase in the training cost for finer mesh, we performed this study on the JAX platform which has the equivalent architecture to the SciANN-based code for faster training. The fully connected network architecture with the number of neurons in each layer set to 170 and 4 layers was employed for the study. The batch size was set to 60 for 3000 samples and 100 for 5000 samples to be consistent in terms of the batch size ratio to the total number of samples. The obtained average relative L2 errors along with the temperature distribution at \(t = 10 \Delta = 0.5\) s for the initial temperature field \(T(\varvec{x}) = 0.5x^2 |(\sin (10x)+\cos (10y)|\) are shown in Figs. 25 and 26 for the homogeneous and heterogeneous thermal conductivities, respectively. In Fig. 25, the magnitudes of the average relative L2 error norm over time exhibited noticeable differences between \(N = 11\), \(N = 15\), and \(N = 21\). This could be due to the reduced number of trainable parameters in each nodal evaluation for finer meshes. One can also confirm that by enhancing training samples in the case of \(N = 21\) a decent improvement in overall prediction accuracy is achieved. On the other hand, in Fig. 26 the error started to blow up after several time steps when 3000 training samples were utilized for training in the case of \(N = 21\). This was significantly mitigated by enhancing the training samples from 3000 to 5000, part of which is through adding higher frequencies to the Fourier series sample generator. This indicates that in the heterogeneous case, the quality of training samples critically affects the prediction accuracy in the finer mesh as shown in the case of \(N = 21\), which is different from the observation in Fig. 19 where one does not see such an extreme error evolution. When it comes to the training cost, one has to pay the price for increasing the number of nodes in FOL. As shown in Fig. 27, the training time increased linearly with increasing number of nodes. The homogeneous case with \(N=21\) required 2.645 times more training time than the same setup with \(N=11\). The same trend was confirmed for the heterogeneous case. It is noted that the training was done within 30 min for all four cases using JAX. On the other hand, the more the number of nodes increases, the more speedup FOL can achieve against FEM as shown in Fig. 28, given the same network architecture with varying input and output dimensions. This indicates the strong potential of FOL as a fast surrogate to conventional FE solvers for large models with a large number of nodes. In application scenarios, one has to therefore adapt the variety of training samples to achieve reasonable prediction, especially when considering heterogeneous domains. Further exploring the integration of this method with data-driven auto-encoders presents intriguing possibilities to address potential constraints associated with increased mesh densities (i.e. higher resolutions), as highlighted in [75, 78, 79].

Fig. 25
figure 25

Left: Average relative L2 error norm from the five initial temperature fields over time in the case with the homogeneous thermal conductivity for three different mesh sizes. Right: Temperature fields at \(t = 10\Delta t = 0.5\) (s) when the initial temperature field \(T(\varvec{x}) = 0.5x^2 |(\sin (10x)+\cos (10y)|\) is given

Fig. 26
figure 26

Left: Average relative L2 error norm from the five initial temperature fields over time in the case with the heterogeneous thermal conductivity for three different mesh sizes. Right: Temperature fields at \(t = 10\Delta t = 0.5\) (s) when the initial temperature field \(T(\varvec{x}) = 0.5x^2 |(\sin (10x)+\cos (10y)|\) is given

Fig. 27
figure 27

Normalized training time for three different mesh sizes with 3000 samples and a batch size of 60

Fig. 28
figure 28

Speedup of the FOL evaluation time compared to the FE calculation time for three different mesh sizes

4.7 Capability of handling arbitrary domains

One of the main advantages of leveraging FEM in the context of finite-dimensional operator learning lies in the capability of handling arbitrary domains. To demonstrate the applicability of the present framework to such a scenario, we considered a different domain, as shown in Fig. 29a. In addition, we introduced heterogeneity of thermal conductivity, shown in Fig. 29b, to this problem setup. The training was first performed with 3000 samples and \(\Delta t = 0.05\) (s) for 1000 epochs. Here, we generated the training samples from Gaussian and constant-temperature generators, not using the Fourier series, to reduce the complexity of sample generation, some of which are shown in Fig. 29c. Additionally, please refer to the discussions in Sect. 4.1, where we demonstrated that accurate predictions can be obtained without relying on Fourier series-based samples. We tested the performance of the networks for the initial temperatures of \(T(\varvec{x}) = 0.5\) and \(T(\varvec{x}) = |\sin (10 x)|\). We confirmed that in both cases, as

Fig. 29
figure 29

a Irregular domain discretized by quadrilateral elements and prescribed boundary conditions. b Introduced heterogeneous thermal conductivity. c Examples of the training samples for the irregular domain

Fig. 30
figure 30

Temperature evolution from the FOL prediction (top), FE solution (middle), and the difference with \(T(\varvec{x}) = 0.5\) as an initial temperature field

Fig. 31
figure 31

Temperature evolution obtained from the FOL prediction (top), FE solution (middle), and the difference with \(T(\varvec{x}) = |\sin (10 x)|\) as an initial temperature field

shown in Figs. 30 and 31, the overall solution trend obtained by the proposed FOL framework agreed with the solution by FEM with the maximum absolute error around 0.1. The error was concentrated in the upper right and left part of the domain at \(t = 10 \Delta t = 0.5\) and \(t = 50 \Delta t = 2.5\) (s), which is also explained by the steep change in the temperature evolution due to the presence of heterogeneity. Furthermore, the error magnitudes did not differ very much from the results with the square domain as in Figs. 13, 14, and 15. The results of Fig. 31 are particularly noteworthy. This proves that FOL works for a problem setup with complex initial temperature, heterogeneity, and irregular domain discretized by unstructured mesh without cumbersome modification to the framework.

4.8 Computational cost and advantages of proposed framework

The main goal of this work is to establish a surrogate model for conventional numerical solvers. In this context, the prediction by the networks should be faster than the solution by numerical analysis. To quantitatively evaluate the speed of obtaining solutions by the FOL framework, the runtimes of the network inference with the same network architecture and finite element calculation were measured by performing the same task. The measurement was performed on the same CPU platform and environment to ensure the fairness of the measurement. We assumed ten-time steps in solving the heat equation with FEM. Therefore, the same network evaluation was performed ten times as well. As a result, the prediction time with the separated network architecture was 10.8 times faster than FEM for the same setup. This result suggests that the network has the potential to be used as a surrogate for classic numerical solvers.

Although a faster inference than solving with a classical solver can be achieved with the proposed framework, one has to train the NNs for a relatively long time. For example, the training time for 1000 epochs on the problem setup shown in Figs. 3 and 7 took approximately 7 h on a single GPU node of NVIDIA GeForce RTX 2080 12GB. However, the training time will be much faster using JAX (high-performance machine learning framework) which has superior features for deep learning, such as just-in-time compilation and vectorization. In addition, training NNs in FOL is usually a one-time investment. Once trained, users can use it for any input and can obtain the solution much quicker than numerical solvers, even for a model that requires many nodes to solve accurately.

As a result, the developed physics-informed operator learning framework has several advantages over other deep learning-based methods. First, the training of the networks is completely unsupervised. Unlike data-driven deep learning models, there is no need to prepare an extensive dataset from costly simulations or experiments. Instead, a dataset of random temperature patterns generated by the Gaussian random process and Fourier series combined with constant temperature fields is used for training. This approach allows for covering a wide range of possible temperature cases without relying on labeled data. Additionally, the framework utilizes shape functions for spatial discretization and backward difference approximation for temporal discretization. The resulting pure algebraic equation, similar to data-driven loss functions, eliminates the need for time-consuming automatic differentiation during weight and bias optimization, resulting in faster training. Furthermore, as shown in the previous subsection, the present framework can handle irregular domains quite easily, along with heterogeneity in the domains, thanks to the feature of the finite element method, which will be helpful in many engineering applications. Lastly, other types of spatiotemporal PDEs, such as the Allen-Cahn equation or Cahn-Hilliard equation, could be incorporated into this framework, given corresponding finite element formulations. This makes the proposed framework usable in the context of other physics.

5 Conclusion

This study has presented a novel physics-informed operator learning framework based on the finite element discretization scheme for spatiotemporal PDEs. After training with various temperature fields, including those generated by Gaussian distribution and Fourier series, as well as constant temperature fields, the network can accurately predict dynamic temperature evolutions for any arbitrary temperature input within the assumed temperature range. This is achieved with a relative L2 error below 0.1 in most cases, without the need for retraining under fixed boundary conditions and domain. The applicability of the method to heterogeneous heat conductivity and irregular domains is also confirmed. Additionally, the suggested network design can achieve over ten times the speed of the corresponding FEM solver on the same platform. It is important to note that the training is conducted entirely without ground truth data obtained from numerical simulations, making the framework a completely unsupervised learning approach. Furthermore, the training efficiency is enhanced compared to other operator learning approaches that rely on time-consuming automatic differentiation. This is because the current framework uses FE-based discretization for space and backward difference approximation for time. To summarize, this work explores the development of deep learning-based surrogates for dynamic physical phenomena without the need for supervised learning.

On the other hand, although the proposed framework offers useful features, there are still some limitations that could be addressed in future work. Firstly, heat conductivity can also be a target of training in addition to the temperature field. This makes the framework flexible for various micro morphologies with phase-field modeling in mind. It may be possible to improve accuracy by implementing a higher-order temporal discretization scheme. One could also think about different network architectures that take multiple temperature fields as input. Moreover, this study focused on transient heat conduction to showcase the performance of the framework. The present framework could be extended to other types of spatiotemporal PDEs, such as the convection-diffusion equation, the Allen-Cahn equation, or the Cahn-Hilliard equation, as the framework is developed with the aim of a generic deep learning framework for spatiotemporal dynamics phenomena described by PDEs. Additionally, although the present study only utilizes the bilinear interpolation that works for the majority of the possible applications, this framework could also be combined with higher-order basis functions such as the quadratic one for a better representation of geometry with curvature and further accuracy in prediction. Lastly, to efficiently handle large models with a large number of nodes, a reduced parametric space illustrated in Fig. 32 could be introduced by employing techniques such as autoencoder [78,79,80].

Fig. 32
figure 32

Schematic of employing autoencoder in FOL for efficient learning in a reduced parametric space