1 Introduction

1.1 Background

Conservation laws, which are regarded as fundamental laws of nature, have wide-ranging applications in various fields such as physics, chemistry, biology, geology, and engineering. One of the significant branches in computational mathematics involves solving the differential equations associated with conservation laws, and several effective methods have been proposed in this regard, including the ones discussed in [31, 36, 37, 51, 52, 59, 81]. However, it is well known that solutions to conservation laws are prone to developing discontinuities in finite time, even for smooth initial data [19, 36, 49, 52]. Scalar nonlinear convection-diffusion partial differential equations (PDEs) of the form

$$\begin{aligned} u_t + f(u)_x&= \alpha A(u)_{xx}, \qquad A(u)= \int _{0}^{u} a(v) dv, \qquad a(v)\ge 0 \nonumber \\ u|_{t=0}&=u_0(x) \nonumber \\ u_x|_{x=0}&=u_x|_{x=L}=0 \end{aligned}$$
(1.1)

where \(x\in [0,L]\) arise in different applications such as sedimentation of particles in liquid and various traffic flow type of problems. Here \(u = u(x, t)\) is the main variable which depends on the position x and time t. The flux function f(u) represents the convective transport whereas the diffusion function, denoted by \(A(u) = \int _{0}^{u} a(v) dv\), is a function of u that describes the diffusive transport. For a typical situation we have a priori information about the magnitude of the scaling factor \(\alpha >0\) but not precise information about the functional form of neither A(u) nor f(u). Our goal is to determine analytical expressions for both f(u) and A(u) using observational data of different types.

1.2 Mathematical framework associated with (1.1)

Focusing on the Cauchy problem associated with (1.1), weak solutions must be considered in the sense of the integral equality

$$\begin{aligned}&\int _0^T\int _{\mathbb {R}} \Bigl (u\varphi _t + (f(u)-A(u)_x)\varphi _{x} \Bigr ) \,dx\,dt + \int _{\mathbb {R}} u_0(x)\varphi (x,0)\,dx =0 \end{aligned}$$
(1.2)

for all test functions \(\varphi \in C_c^{\infty }(\mathbb {R}\times [0,T))\) where \(u\in C([0,T];L^1(\mathbb {R}))\cap L^{\infty }( \mathbb {R}\times [0,T) )\) and \(A(u)_x \in L^{\infty }(\mathbb {R}\times [0,T) )\). Herein, we have set \(\alpha =1\) without loss of generality. If A(u) is constant on an interval then weak solutions are not uniquely determined from the initial data. An additional entropy condition must be imposed to determine the unique physically relevant solution. More precisely, this unique weak entropy solution u can be characterized by the following integral inequality when (1.1) is considered on \(\mathbb {R}\) [10, 86]

$$\begin{aligned}{} & {} \int _0^T\int _{\mathbb {R}} \Bigl (|u-k|\varphi _t +\text {sgn}(u-k)\Bigl (f(u)-f(k) - A(u)_x\Bigr )\varphi _{x} \Bigr ) \,dx\,dt\nonumber \\{} & {} \quad \qquad + \int _{\mathbb {R}} |u_0(x)-k|\varphi (x,0)\,dx \ge 0 \end{aligned}$$
(1.3)

for all nonnegative \(\varphi \in {C}_c^{\infty }(\mathbb {R}\times [0,T))\) and any constant \(k\in \mathbb {R}\). It is instructive to note that from (1.3) the following characterization of a jump \((u_l,u_r)\) moving with a velocity s can be extracted [27]:

  1. (i)

    The shock speed s is given by the Rankine-Hugoniot type of condition

    $$\begin{aligned} s=\frac{f(u_r)-f(u_l) - \Bigl (A(u)_x|_{r} - A(u)_x|_{l} \Bigr )}{u_r-u_l} \end{aligned}$$
    (1.4)

    where \(u_l\) and \(u_r\) are the left and right limits of u(xt) of a discontinuity moving in the \(x-t\) space, whereas \(A(u)_x|_{l}\) and \(A(u)_x|_{r}\) are corresponding left and right limit values of \(A(u)_x\).

  2. (ii)

    For all \(v\in [\min \{u_l,u_r\},\max \{u_l,u_r\}]\), the following entropy condition holds

    $$\begin{aligned} \frac{f(u_r) - f(v) - A(u)_x|_{u_r}}{u_r-u_l} \le s \le \frac{f(u_l) - f(v) - A(u)_x|_{u_l}}{u_r-u_l}. \end{aligned}$$
    (1.5)

For scalar conservation laws with \(a(u)=0\), the entropy framework (usually called entropy conditions) was introduced by Kruzkov [50] and Volpert [85], while for degenerate parabolic equations entropy solution were first considered by Volpert and Hudajev [86]. Uniqueness of entropy solutions was first proved by Carrillo [10]. Finite difference and finite volume schemes for degenerate equations were analysed by Evje and Karlsen [26,27,28] (using upwind difference schemes), Holden et al. [37] (using operator splitting methods), Kurganov and Tadmor [51] (central difference schemes), Bouchut et al. [6] (kinetic BGK schemes), Afif and Amaziane [1] and Ohlberger, Gallouet et al. [29, 70] (finite volume methods), and Cockburn and Shu [16] (discontinuous Galerkin methods). Rigorous estimates of the convergence rate of finite volume schemes for degenerate parabolic equations were proved in [45, 46]. More recently, convection-diffusion equations have also been explored where f(u) and A(u) as well as initial data are assumed to be known only in a stochastic sense. The problem of random initial data with \(a(u)=0\) was considered in [66], and the existence and uniqueness of a random entropy solution was shown as well as convergence for Monte Carlo based finite volume discretizations. In [48] the mathematical framework was extended to include degenerate convection diffusion equations with random convective and diffusive flux functions with possibly correlated random perturbations.

1.3 The inverse problem of (1.1) when \(a(u)=0\)

Previous research on inverse problems for nonlinear conservation laws where \(a(u)=0\) includes a sophisticated methodology introduced by James and Sepúlveda [40, 41]. They formulated the identification of f as an optimization problem with a cost functional J that measures the distance between the observed solution and the PDE solution at a fixed time point. Under the assumption of a strictly convex flux f (and \(a(u) = 0\)), Kang and Tanuma [44] show that f can be identified uniquely from initial data with compact support and observations of a single shock that is formed after a sufficiently large time.

Over the past decade, the rapid advancement in machine learning, data science, and computing power has led to the widespread adoption of modern machine learning techniques, particularly deep neural network models, in various fields of research and application. These techniques have proven particularly effective in the context of recovering ordinary differential equations (ODEs) [12, 13, 15, 21, 35, 76] and partial differential equations (PDEs) [7, 8, 30, 32, 33, 43, 53, 73, 89]. These developments have opened up new avenues for data-driven modeling and have enabled researchers to extract meaningful insights from complex, high-dimensional data. Raissi et al. [73] introduced physics informed neural network (PINN) for solving two main classes of problems: data-driven solution and data-driven discovery of partial differential equations. They suggested that if the considered PDE is well-posed and its solution is unique, then the PINN method is capable of achieving good predictive accuracy given a sufficiently expressive neural network architecture and a sufficient number of collocation points. The method was explored for Schrödinger equation, Allen-Cahn equation, and Korteweg-de Vries (KdV) in one dimension (1D) and Navier–Stokes in two dimensions (2D). However, the neural network methods are not straightforward to use in the context of nonlinear hyperbolic PDE that governs two-phase transport in porous media [30]. The authors experimentally indicate that this shortcoming of PINN for hyperbolic PDEs is related to the lack of regularity in the solution. For the same reason it is natural to assume that the fact that solutions of (1.1) cannot be understood in a classical sense but in the sense of (1.3), hampers use of various modern neural network-based methods.

1.4 Problem statement and purpose of this work

In this work focus is on learning the functional form of both the flux function f(u) and the diffusion function A(u) in the degenerate convection-diffusion model (1.1), where u is the primary variable. Main challenges associated with that problem is:

  1. (i)

    It is well known that the solutions of (1.1) typically develop discontinuities in accordance with the formulation (1.3) [10, 27, 86], as expressed by (1.4) and (1.5). This lack of regularity prevents direct use of various modern learning methods mentioned above.

  2. (ii)

    As jumps arise and disappear in the solution over the time period for which observation data is collected when a(u) vanishes over intervals in u, the data may lack information about f(u) in these intervals [53]. In the current problem, there is a blend of effects from f(u) and A(u) present in the observation data that might further complicate the identification of f and A. An illustration of this situation is given in Fig. 1. In panel (a) and (b), respectively, we plot the flux function \(f(u)=u^2/(u^2 + (1-u)^2)\) and the diffusion function

    $$\begin{aligned} a(u) = {\left\{ \begin{array}{ll} 0.0, &{} \text {if } u \in [0, 0.8]\\ 1.0, &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$
    (1.6)

    In panel (c) the entropy solution after a time \(T=0.25\) is shown (red solid line). At the time \(t=0\), the initial data \(u_0(x)\) involves one jump at \(x=1\) and another jump at \(x=2\). The initial jump at \(x=1\) is instantly transformed into a solution that is a combination of a continuous wave solution and a discontinuous wave \((u_l,u_r)\approx (0.4,0.8)\). Similarly, the initial jump at \(x=2\) is transformed into a solution that is a combination of a continuous wave solution and a discontinuous wave \((u_l,u_r)\approx (0.8,0)\). From this example, we see that we have no observation data that directly can reveal the shape of f(u) and a(u) in the interval \(u\in [0.4,0.8]\) (approximately). In panel (c) we also include the solution with \(a(u)=0\) (dashed blue line). In particular, we see that the evolving discontinuities are different as the discontinuities obeying the entropy condition (1.5) clearly are affected by the presence of the diffusive flux term A(u), see a plot of it at time \(T=0.25\) in panel (d).

  3. (iii)

    A third issue is the role played by selecting different types of observation data. Are some observation data better than others to identify the unknown f(u) and A(u)?

Fig. 1
figure 1

a Nonlinear flux function \(f(u)=\frac{u^2}{u^2 + (1-u)^2}\). b Diffusion function (1.6). c The solution u(xT) of (1.1) at time \(T=0.25\) is shown (red solid curve) together with its initial data \(u_0(x)\) (red dashed line). Corresponding solution of (1.1) with \(a(u)=0\) is also shown (blue dashed line). d Illustration of A(u(xT)) with \(T=0.25\) (color figure online)

The methodology we rely on represents a generalization from the pure conservation law case with \(a(u)=0\) explored in [53, 55] to the degenerate convection-diffusion case (1.1). The main contribution of this work includes:

  1. (i)

    We introduce a novel approach for learning the functional forms of both the flux function f(u) and diffusion function A(u) in the degenerate convection-diffusion model (1.1), where u is the primary variable. Our approach is based on combining symbolic multilayer neural network functions to represent f(u) and A(u) with an entropy consistent discrete numerical scheme (ECDNS) for (1.1) which is known to converge to the correct entropy solution [27]. Relying on observation data of the form \(u(x_j,t_i)\) at fixed positions \((x_j,t_i)\), this allows us to accurately learn both the unknown flux and diffusion functions using significantly less observational data compared with, for example, [88].

  2. (ii)

    Going beyond the reliance on the equation’s solution in terms of \(u(x_j,t_i)\) as observational data, we investigate the same physical system from a distinct perspective. Specifically, we consider a scenario where the system is made up of particles that follow the flow field as described by u(xt) governed by (1.1). That is, particle \(z_j(t)\) is governed by

    $$\begin{aligned} \dot{z_j}(t)=w(u(z_j(t),t)), \qquad z_j(t=0)=x_j^0 \end{aligned}$$

    for a given function w(v). Using particle-based observational data, the proposed method generates quite remarkable outcomes despite the fact that the number of observation data points is much lower than in point (i).

  3. (iii)

    Representation of unknown functions can take many forms, and in this paper, we employ the neural network S-Net as our chosen method. In addition, we conduct a comparative experiment using the piecewise affine functions method described in [24]. The experimental results demonstrate the clear advantage of neural networks in terms of effective and robust identification of the unknown functions f and A.

  4. (iv)

    The study provides insight into the relative impact from the flux function f(u) and diffusion function A(u). In particular, accurate learning of a(u) is difficult as the impact from a(u) goes through the integrated form \(A(u)=\int ^u a(s)\,ds\). More precisely, since the total flux \(f(u)-A(u)_x\) strongly impacts the evolution of (1.1), accurate identification of f(u) and A(u) is obtained, whereas the corresponding \(A'(u)=a(u)\) may differ from the ground truth a behind the observation data.

2 Related work

The inverse problem of flux identification, which involves minimizing a suitable cost function, was formulated by James and Sepúlveda [42]. They demonstrated, through the use of the viscous approximation, that the perturbed problem converges to the original hyperbolic problem by allowing the viscous term to vanish. Holden et al [38] employed the front-tracking algorithm to reconstruct the flux function from observed solutions with appropriate initial data. More recently, there have been several studies aimed at reconstructing the flux function for sedimentation problems, where a flocculated suspension separates into a concentrated sediment and a clear fluid [9, 23]. In [24], Diehl proposed a direct inversion method that utilizes linear combinations of finite element hat functions to represent unknown flux and diffusion functions. In our paper, we use this approach as a comparative method.

The last decade has seen remarkable progress in the field of machine learning which has enabled several notable efforts aimed at uncovering the governing equations for nonlinear dynamical systems. Among these, the SINDy (Sparse Identification of Nonlinear Dynamics) model [8] has emerged as a breakthrough tool for addressing this challenge. SINDy was developed to recursively identify the sparse solution from a pre-defined basis function library using a sequential threshold ridge regression algorithm. SINDy quickly emerged as one of the leading methods in this field of study, igniting significant interest [11, 14, 20, 69, 79]. However, the success of this sparsity-promoting approach relies on a well-defined candidate function library, which requires prior knowledge of the system. It is also limited by the possibility that a linear combination of candidate functions may be insufficient to capture complex mathematical expressions. Furthermore, as the library size increases, it may fail to maintain the sparsity constraint, according to empirical observations.

A symbolic regressor offers great flexibility in model selection by allowing the free combination of mathematical operators and symbols to identify the governing equations that best describe the underlying physics. This approach to data-driven nonlinear dynamics discovery was first introduced in [5] and [80] that recast the problem as a symbolic regression task and used genetic programming (GP) to solve it [4]. This idea has inspired a series of subsequent endeavors [18, 34, 84]. More recently, Deep Symbolic Regression (DSR) [68, 71] was introduced as a reinforcement learning-based model that generally outperforms GP-based models. However, this approach is built upon ad-hoc steps and, to some extent lacks flexible automation in equation discovery.

Mesh-based simulations have made significant progress recently [33, 39, 72, 89], offering faster runtimes than principled solvers and greater adaptivity to the simulation domain compared to grid-based convolutional neural networks (CNNs) [83, 87]. However, the efficacy of this method was tested on one-dimensional conservation law problems by [56], revealing that when the parameters deviate significantly from the training parameters, the predictions have a large deviation.

Attempts have been made to tackle the nonlinear dynamics discovery problems by introducing neural networks with activation functions replaced by commonly seen mathematical operators [47, 62, 63, 65, 78]. Long et al [62, 63] proposed a combination of numerical approximation of differential operators by convolutions and a symbolic multi-layer neural network for model recovery. They used convolutions to approximate differential operators with properly constrained filters and to approximate the nonlinear response by deep neural networks. The intricate formulas are obtained via symbolic expansion of the well-trained network. This interpretation of physical laws results in larger candidate pools compared with the library-based representation of physics employed by SINDy. This type of approach has been demonstrated to be effective in tackling conservation law problems, as evidenced in previous research [53, 55] as well as ODEs [54].

Magiera et al. [64] investigated methods for building a reliable surrogate Riemann solver based on neural networks that incorporate conservation properties, such as mass or momentum conservation. In [3], Deniz et al. propose a data-driven physics-informed finite-volume scheme for the approximation of small-scale dependent shocks. A convolutional neural network combines local interpolation polynomials with a regularization term to form the adaptive nonlinear numerical flux in this numerical scheme. Reinforcement learning was utilized by Wang et al. in [88] to identify novel and potentially better data-driven solvers for conservation laws. A multilayer perceptron (MLP) is explored in [74]. It is trained offline using a supervised learning strategy and then employed as a universal troubled-cell indicator that can be used for generic conservation laws. In addition, a few recent studies have exploited machine learning to auto-discover conservation laws [60, 61, 67].

3 Method

In this section, we introduce our proposed method. Firstly, we present the two important components of the model: ECDNS and S-Net. Secondly, we illustrate how to combine ECDNS with S-Net in our model.

3.1 Entropy consistent discrete numerical scheme (ECDNS)

We discretize the spatial domain [0, L] into \(N_x\) points \(\{x_i\}_{i=1}^{N_x}\), where \(x_i=(1/2+i) \Delta x\) for \(i=1,\ldots ,N_x\), and \(\Delta x=L/N_x\). Additionally, we consider a set of time steps \(\{t^n\}_{n=0}^{N_t}\) with \(N_t\Delta t =T\). To discretize (1.1), we use the Rusanov scheme [52], which can be written as follows:

$$\begin{aligned} u_j^{n+1}&= u_j^n - \lambda (F_{j+1/2}^n - F_{j-1/2}^n) +\lambda (D_+A_{j+1/2}^n - D_+A_{j-1/2}^n), \quad \lambda =\frac{\Delta t}{\Delta x} \nonumber \\ u_1^{n+1}&=u_2^{n+1}, \quad u_{N_x}^{n+1}=u_{N_x-1}^{n+1} \end{aligned}$$
(3.1)

with \(j=2,\ldots , N_x-1\) and \(F_{j+1/2}^n\) and \(D_+A_{j+1/2}^n\) take the forms

$$\begin{aligned} F_{j+1/2}^n = \frac{f(u_j^n) + f(u_{j+1}^n)}{2} -\frac{M_{j+1/2}^n}{2} (u_{j+1}^n - u_j^n), \end{aligned}$$
(3.2)

and

$$\begin{aligned} D_+A^{n}_{j+1/2} = \frac{A(u_{j+1}^n)-A(u_j^n)}{\Delta x} =\frac{\int _{0}^{u^{n}_{j+1}}a(z)dz - \int _{0}^{u^{n}_j} a(z)dz }{\Delta x}. \end{aligned}$$
(3.3)

We adopt a local estimation by using \(M_{j+1/2}^n=\max \{|f'(u_{j}^n)|, |f'(u_{j+1}^n)|\}\). The CFL condition determines the magnitude of \(\Delta t\) for a given \(\Delta x\),

$$\begin{aligned} \text {CFL}:=\frac{\Delta t}{\Delta x} \left( M +\frac{2K_{}}{\Delta x}\right) \le 1, \qquad M=\max _u|f'(u)|, \qquad K_{}=\max _u a(u). \end{aligned}$$

We apply the CFL condition in Algorithm 1 to ensure numerical stability. To learn the solution \(U^n=\left\{ u(x_j,t^n)\right\} _{j=1}^{N_x}\) of the discrete conservation law (3.1), we employ Algorithm 2.

Algorithm 1
figure a

CFL

Algorithm 2
figure b

DataGenerator

3.2 S-Net

To identify the unknown functions, we have opted for machine learning methods, which offer greater robustness and flexibility. In particular, we are using S-Nets, as employed in [62, 63, 65, 78]. Figure 2 depicts the building of the S-Net with three hidden layers. The linear combination map uses parameters \(\textbf{w}_{\textbf{1}}\) and \(\textbf{b}_{\textbf{1}}\) to choose two elements from u and are denoted by \(\alpha _1\) and \(\beta _1\)

$$\begin{aligned} (\alpha _1, \beta _1)^T = \textbf{w}_{\textbf{1}} \cdot (u) + \textbf{b}_{\textbf{1}}, \textbf{w}_{\textbf{1}} \in \mathbb {R}^{2 \times 1}, \textbf{b}_{\textbf{1}} \in \mathbb {R}^{2 \times 1}. \end{aligned}$$
(3.4)

These two elements of \(\alpha _1\) and \(\beta _1\) are multiplied which give

$$\begin{aligned} f_1 = \alpha _1 \beta _1. \end{aligned}$$
(3.5)

Apart from u gotten by the identity map, \(f_1\) also is input to the second hidden layer

$$\begin{aligned} (\alpha _2, \beta _2)^T = \textbf{w}_{\textbf{2}} \cdot (u, f_1)^T + \textbf{b}_{\textbf{2}}, \textbf{w}_{\textbf{2}} \in \mathbb {R}^{2 \times 2}, \textbf{b}_{\textbf{2}} \in \mathbb {R}^{2 \times 1}. \end{aligned}$$
(3.6)

Similarly, with the first hidden layer, we get another combination \(f_2\)

$$\begin{aligned} f_2 = \alpha _2 \beta _2. \end{aligned}$$
(3.7)

Then we obtain \(\alpha _3\) and \(\beta _3\) by means of \(\textbf{w}_{\textbf{3}}\) and \(\textbf{b}_{\textbf{3}}\) from u, \(f_1\) and \(f_2\)

$$\begin{aligned} (\alpha _3, \beta _3)^T = \textbf{w}_{\textbf{3}} \cdot (u, f_1, f_2)^T + \textbf{b}_{\textbf{3}}, \textbf{w}_{\textbf{3}} \in \mathbb {R}^{2 \times 3}, \textbf{b}_{\textbf{3}} \in \mathbb {R}^{3 \times 1}. \end{aligned}$$
(3.8)

\(f_3\), which is the product of \(\alpha _3\) and \(\beta _3\) is put into the third hidden layer

$$\begin{aligned} f_3 = \alpha _3 \beta _3. \end{aligned}$$
(3.9)

Finally, we arrive at the analytic expression of the function f

$$\begin{aligned} f=\textbf{w}_{\textbf{4}} \cdot (u, f_1, f_2, f_3)^T + \textbf{b}_{\textbf{4}}, \textbf{w}_{\textbf{4}} \in \mathbb {R}^{1 \times 4}, \textbf{b}_{\textbf{4}} \in \mathbb {R}. \end{aligned}$$
(3.10)

The weight and bias parameters of S-Net for this example are \({\theta } = \{\textbf{w}_{\textbf{1}}, \textbf{w}_{\textbf{2}}, \textbf{w}_{\textbf{3}}, \textbf{w}_{\textbf{4}}, \textbf{b}_{\textbf{1}},\) \( \textbf{b}_{\textbf{2}}, \textbf{b}_{\textbf{3}}, \textbf{b}_{\textbf{4}}\}\). The core of the method is the appropriate choice of \({\theta }\). If S-Net has n hidden layers, it is capable of learning a polynomial of order \(2^n\). The complexity of S-Net is determined by the weights and biases across its structure. Specifically, the weights \(\textbf{w}\) follow a pattern of \(1 \times 2 + 2 \times 2 + 2 \times 3 + \cdots + 2 \times n + (n + 1)\), while the biases \(\textbf{b}\) are summed up as \(1 \times 2 + 2 + 3 + \cdots + n + 1\). This structure highlights the exponential increase in the order of the polynomial it can learn, directly correlating with the depth of the network through its hidden layers. In theory, increasing the number of hidden layers enhances the model’s representational flexibility. Yet, an excess of hidden layers complicates optimization. Thus, we aim for a minimal number of hidden layers while maintaining sufficient expressive capability. In this paper, focus is on studying flux functions that can be approximated well within the class of polynomial functions. However, to enhance its learning capabilities, we can readily incorporate additional operators, such as those for sine and cosine. This is among the reasons why we opted for S-Net in our study.

Fig. 2
figure 2

The structure of S-Net with three hidden layers. \(\alpha _i, \beta _i\) are terms with respect to u, obtained by the parameters \(w_i, b_i, i=1, 2, 3\)

Besides S-Nets, other classical methods explicitly represent functions, such as the piecewise affine functions method described in [24]. In the piecewise affine functions method, divide the u-axis into n equidistant intervals: let \(u^{max}\) be the largest value of the data and set \(u_k = k u^{max}/n\) for \(k = -1,...,n+1\). Assume that

$$\begin{aligned} \hat{f}(u) = \sum _{k=0}^{n}f_k\psi _k(u) \end{aligned}$$
(3.11)

where \(f_k\) is \(n + 1\) parameters to be determined, and the hat functions are defined by

$$\begin{aligned} \psi _k(u) = {\left\{ \begin{array}{ll} \frac{u - u_{k-1}}{u_k - u_{k-1}}, &{} u_{k-1}< u \le u_{k} \\ \frac{u_{k+1} - u}{u_{k+1} - u_{k}}, &{} u_{k} < u \le u_{k+1} \\ 0, &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(3.12)

where \(k = 0,..., n\). In this work, we employ the piecewise affine functions method as a means of comparison.

3.3 The model

The model combines two classes of \(S-Nets\), \(S-Net_{f}\) and \(S-Net_{a}\), and ECDNS to learn f(u) and a(u). We employ a single S-Net, denoted as \(f_{\theta }(u)\), to learn f(u) across the entire space \(u \in [0, 1]\). However, we utilize \(N_a\) number of S-Nets distributed across different intervals \(\{[\frac{k}{N_a}, \frac{k+1}{N_a}) | k = 0, 1,..., N_a-1 \}\), to learn a(u), designated as \(a_{\theta }(u)\),

$$\begin{aligned} a_{\theta }(u) = \sum _{k=0}^{N_a-1}a_{\theta _k}(u)I_k(u), \end{aligned}$$
(3.13)

where

$$\begin{aligned} I_k(u) = {\left\{ \begin{array}{ll} 1, &{} \frac{k}{N_a} \le u < \frac{k+1}{N_a} \\ 0, &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$
(3.14)

In order to identify unknown equations, a commonly used approach is to rely on observed data in the form of solutions u(xt) to the equations, as demonstrated in prior work [24, 53, 55, 62, 63]. In this work, we supplement this standard observational dataset with another set of observations obtained from particle trajectories.

  • Observation Data I (standard). We consider observation data in terms of x-dependent data at fixed times \(\{t_i^*\}_{i=1}^{N_{\textrm{obs}}}\) extracted from the solution U as follows:

    $$\begin{aligned} U_{\textrm{sub}} =\Bigl \{u(x_j,t_1^*),u(x_j,t_2^*), \ldots ,u(x_j,t_{N_{\textrm{obs}}}^*)\Bigr \},\qquad j=1,\ldots ,N_x. \end{aligned}$$
    (3.15)

    (3.15) is utilized to select synthetic observation data denoted by \(U_{\textrm{sub}}\) as well as predictions based on the learned \(f_{ \theta _f}(u)\) and \(a_{\theta _a}(u)\) written as \(\hat{U}_{\textrm{sub}}\). We specify times for collecting the time-dependent data

    $$\begin{aligned} T_{\textrm{obs}}=\{t_i^{*}=i \Delta t^{\textrm{obs}}: i=1, \ldots , N_{\textrm{obs}}\}. \end{aligned}$$
    (3.16)

    Typically, we set \(N_{x}=200\) and \(N_{\textrm{obs}}=9\) with \(\Delta t^{\textrm{obs}}=0.1\).

  • Observation Data II (particle trajectories).

    In addition to studying the conservation law (1.1) from the standard perspective, we also investigate the same physical system from a different vantage point. Specifically, we consider a scenario where the system consists of particles that move with the flow field described by (1.1). Instead of examining the solution u(xt) as observation data, we focus on the trajectories of these particles as it has been suggested in the context of traffic flow for a scalar nonlinear conservation law [17, 58, 75]. The classical traffic flow model takes the following form

    $$\begin{aligned} u_t + (uw(u))_x = 0. \end{aligned}$$
    (3.17)

    The flux function in this case is \(f(u) = uw(u)\). The work [17] suggests to study an ODE of the form \(\dot{z}(t)=w(u(z(t),t))\) for a suitable known speed function w and a prescribed initial position to represent the path z(t) of one driver that travels along the road modelled by (3.17). The authors demonstrate the existence and stability of particle trajectories in relation to their initial positions. Additionally, [25] establishes several novel findings regarding the stability of particle trajectories when subjected to variations in the initial field. We may apply the same approach based on the model (1.1). Let us suppose a single particle j begins at time \(t_0\) from a location \(x_{j}^0\), and let \(z_{j}(t)\) denote its position at time t. The velocity field of the flow is represented by a function w whose functional form is assumed known. Without loss of generality, we will use \(w(u)=u\) in the following. Then \(z_{j}(t)\) satisfies the following equation

    $$\begin{aligned} \dot{z_{j}}(t) = w(u(z_j(t), t)), \qquad j=1,\ldots ,N_{\textrm{par}} \end{aligned}$$
    (3.18)

    where u(xt) is the entropy solution of (1.1) as captured by the entropy satisfying scheme (3.1)–(3.3). Equation (3.18) can be interpreted as an ordinary differential equation (ODE) for \(z_{j}(t)\). If we take the initial position \(x_{j}^0\) of particle j into account then we have the following Cauchy initial data to (3.18):

    $$\begin{aligned} z^0_{j}(t=t_0) = x_{j}^0, \qquad j=1,\ldots ,N_{\textrm{par}}. \end{aligned}$$
    (3.19)

    To discretize (3.18), we use the Euler method [2], which can be written as follows:

    $$\begin{aligned} z_{j}^{n+1} = z_{j}^{n} + w(u(z_j^{n}, t^{n})) \Delta t^{ode} \end{aligned}$$
    (3.20)

    where \(\Delta t^{ode}\) refers to the time stepping used to solve the ODE. Our observational data is the trajectories of \(N_{par}\) particles. Specifically, we will denote the initial positions of these \(N_{par}\) particles at time \(t_0\) by

    $$\begin{aligned} Z^{0}=\Bigl \{x^{0}_1, x^{0}_2, ..., x^{0}_{N_{par}} \Bigr \} = \Bigl \{z^{0}_1, z^{0}_2, ..., z^{0}_{N_{par}} \Bigr \}. \end{aligned}$$
    (3.21)

    Based on the solution U derived from Algorithm 2, we interpolate position \(Z^{0}\) to obtain the corresponding solution \(U^{Z^{0}}\) at the time \(t_0\). According to (3.20), we obtain the particle positions at the subsequent time point \(t_0 + \Delta t^{ode}\),

    $$\begin{aligned} Z^{1}=Z^{0} + w(U^{Z^{0}}) \Delta t^{ode}. \end{aligned}$$
    (3.22)

    Additionally, by interpolating position \(Z^{1}\), we derive the corresponding solution \(U^{Z^{1}}\) at the time \(t_0 + \Delta t^{ode}\). Hence, we can get the particle positions at the subsequent time point \(t_0 + 2 \Delta t^{ode}\)

    $$\begin{aligned} Z^{2}=Z^{1} + w(U^{Z^{1}}) \Delta t^{ode}. \end{aligned}$$
    (3.23)

    It should be noted that we set \(\Delta t^{ode} = \Delta t\) gotten from Algorithm 2, i.e., we use the same time step as for the evolution of the discrete scheme that solves (1.1). We also set \(t_0=0\). This iterative process proceeds, enabling us to derive the trajectories of all particles

    $$\begin{aligned} Z = \Bigl \{Z^i = \{z^{i}_1, z^{i}_2, ..., z^{i}_{N_{par}}\} | i = t_0 + \hat{i} \Delta t^{\textrm{ode}}, \hat{i}=0,1,2,... \Bigr \}. \end{aligned}$$
    (3.24)

    For the evaluation of \(U^{Z^n}\) associated with time \(t^n=n\Delta t\) based on the discrete solution \(\{u_j^n\}\) generated by Algorithm 2, we use linear interpolation. We consider observation data at fixed times \(\{t_i^*\}_{i=1}^{N_{\textrm{obs}}}\) as given by (3.16) extracted from Z as follows:

    $$\begin{aligned} Z_{\textrm{sub}} = \Bigl \{Z^{t_1^*},Z^{t_2^*},\ldots , Z^{t_{N_{\textrm{obs}}}^*}\Bigr \}. \end{aligned}$$
    (3.25)

    Equation (3.25) is utilized to select synthetic observation data denoted by \(Z_{\textrm{sub}}\) as well as predictions based on the learned \(f_{ \theta _{f}}(u)\) and \(a_{\theta _{a}}(u)\) written as \(\hat{Z}_{\textrm{sub}}\).

Standard Observation (Data I) and Particle Trajectories (Data II) represent two data sets that study conservation laws from different perspectives. Furthermore, Particle Trajectories utilize a significantly smaller amount of data for model training compared to Standard Observation data. Specifically, while Standard Observation data employs \(N_{x}=200\) grid points when the number of observation time points is the same, Particle Trajectories only utilize \(N_{par}=10\) particle trajectories.

During model training, \(f_{\theta _{f}}(u)\) and \(a_{\theta _{a}}(u)\), in place of f(u) and a(u), are fed into Algorithm 2 to obtain the predicted solution \(\hat{U}\). Similar to \(U_{sub}\) and \(Z_{sub}\), we select predicted solutions \(\hat{U}_{sub}\) and \(\hat{Z}_{sub}\) according to (3.15) and  (3.25). The difference between \(U_{sub}\) (\(Z_{sub}\)) and \(\hat{U}_{sub}\) (\(\hat{Z}_{sub}\)), denoted as \(L^{data}\), serves as the loss function,

$$\begin{aligned} L^{data} =\frac{1}{K_{init} N_x N_{obs}} \sum _{k = 1}^{K_{ini}} \sum _{j = 1}^{N_x} \sum _{i = 1}^{N_{obs}} \left( U_{\textrm{sub},k}(x_j,t_i^*) -\hat{U}_{\textrm{sub},k}(x_j,t_i^*) \right) ^{2}, \end{aligned}$$
(3.26)

or

$$\begin{aligned} L^{data} =\frac{1}{K_{init} N_{par} N_{obs}} \sum _{k = 1}^{K_{ini}} \sum _{j = 1}^{N_{par}} \sum _{i = 1}^{N_{obs}} \left( Z_{\textrm{sub},k}(x_j,t_i^*) - \hat{Z}_{\textrm{sub},k}(x_j,t_i^*) \right) ^{2}. \end{aligned}$$
(3.27)

To update the parameter vectors \(\theta _{f}\) and \(\theta _{a}\), we employ the second-order quasi-Newton method, L-BFGS-B [77, 90]. Our implementation leverages the PyTorch machine learning framework, simplifying the process to merely invoking the LBFGS() function provided within PyTorch. By inputting our loss function into LBFGS(), we efficiently optimize the model’s parameters. The training process of this model based on standard observation data is depicted in Algorithm 3.

Algorithm 3
figure c

ConsLaw-Net

4 Experiment 1

This section explores a group of nonlinear conservation laws that arises naturally from the vertical displacement of one fluid by another [82]. In this case the the ground truth flux function f(u) is given by the following flux function [53]

$$\begin{aligned} f(u) = \frac{1}{2}u(3 - u^2)+\frac{300}{12}u^2 \left( \frac{3}{4} - 2u + \frac{3}{2}u^2 - \frac{1}{4}u^4 \right) . \end{aligned}$$
(4.1)

Our focus is on examining identification of two distinct diffusion functions, namely \(a_1(u)\) given by

$$\begin{aligned} a_1(u)={\left\{ \begin{array}{ll} 0, &{} \text {if } u \in [0, 0.5] \\ 1.0, &{} \text {otherwise} , \end{array}\right. } \end{aligned}$$
(4.2)

and \(a_2(u)\) given by

$$\begin{aligned} a_2(u)={\left\{ \begin{array}{ll} 0, &{} \text {if } u \in [0, 0.5] \\ (u-0.5)u, &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$
(4.3)

with \(\alpha =0.5\). In [22], R. Diaz-Adame et al. presented a discussion on fast and optimal WENO schemes for degenerate parabolic conservation laws with similar diffusion functions.

We examine a one-dimensional spatial domain of length \(L=10\), where \(x \in [0,10]\), and a time interval [0, T] with \(T=2\). We gather observation data in the form of (3.15) and (3.25), with \(N_{\textrm{obs}}=9\), \(\Delta t^{\textrm{obs}}=0.1\), and a numerical grid consisting of \(N_{x}=200\) grid cells. The goal is to identify the unknown flux function f(u) and diffusion function a(u) for \(u\in [0,1]\). To learn f(u) and a(u), we consider a set of initial states \(\{u_{k}^{0}\}_{k=1}^{K_{init}}\) such that \(0\le u_k^{0}(x)\le 1\). Table 1 displays the initial states utilized in the following experiments.

Table 1 The initial states

To represent unknown functions using S-Nets, we employ an S-Net to learn the flux function f(u). We evaluate the effectiveness of representing a(u) using either one or more S-Nets or piecewise affine functions. When using multiple S-Nets, we partition the u-axis into four equidistant intervals: \(u \in [0, 0.25)\), \(u \in [0.25, 0.5)\), \(u \in [0.5, 0.75)\), and \(u \in [0.75, 1.0)\), and train four distinct S-Nets on each interval. The initial values of the neural networks are randomly generated. In addition, we set \(n=10\) in (3.11) for the piecewise affine function approach.

4.1 Identification of f(u) and \(a_1(u)\) using standard observations (Data I) with a single S-Net for representing \(a_1(u)\)

Figure 3a and b depict the learned \(f_{\theta _f^*}(u)\) and \({a_1}_{\theta _a^*}(u)\), respectively. Since the derivative of \(f_{\theta _f^{*}}(u)\) is what matters for f(u), we plot the translated function \(f_{\theta _f^{*}}(u)-f(0)\). The learned \({A_1}_{\theta _a^*}(u)\) based on \({a_1}_{\theta _a^*}(u)\) is presented in Fig. 3c. Notably, the predicted f(u) deviates significantly from the true value at \(u \in [0.6, 1.0]\), and there is a significant difference between the actual and learned functions for both \(a_1(u)\) and \(A_1(u)\). Therefore, the piecewise nature of \(a_1(u)\) makes it unsuitable for an S-Net representation as S-Net learns only smooth functions.

Fig. 3
figure 3

The graphical result of identification of flux function \(f_{\theta _f^*}(u)\), diffusion function \({a_1}_{\theta _a^*}(u)\) and \({A_1}_{\theta _a^*}(u)\). The red line represents the true function, while the blue line illustrates the learned function (color figure online)

4.2 Identification of f(u) and \(a_1(u)\) using standard observations (Data I) with four S-Nets for representing \(a_1(u)\)

To overcome the limitations of using a single S-Net to represent \(a_1(u)\), we employed multiple S-Nets to learn \(a_1(u)\). Specifically, we divided the u-axis into four intervals, namely \(u \in [0, 0.25)\), \(u \in [0.25, 0.5)\), \(u \in [0.5, 0.75)\), and \(u \in [0.75, 1.0)\), and allocated one S-Net with a single hidden layer for each interval. In contrast, the S-Net used to represent f(u) consisted of six hidden layers.

Figure 4a–c display the learned functions for \(f_{\theta _f^*}(u)\), \({a_1}_{\theta _a^*}(u)\) and \({A_1}_{\theta _a^*}(u)\), respectively. The learned \(f_{\theta _f^{*}}(u)\) approximates the true function f(u) closely, but for the diffusion function \(a_1(u)\), noticeable discrepancies can be observed between the learned and the true functions on \(u \in [0.5, 0.75)\) and \(u \in [0.75, 1.0)\). However, since the primitive form \(A_1(u)\) is employed in the numerical scheme, we examine the learned function \({A_1}_{\theta _a^*}(u)\) based on \({a_1}_{\theta _a^*}(u)\) in Fig. 4c, which remarkably fits the true function \(A_1(u)\). Clearly, utilizing multiple S-Nets to learn \(a_1(u)\) can improve overall model performance. In addition, we utilized the learned functions to solve the solutions of (1.1), and the comparison of the true and predicted solutions based on the initial states \(u^{0}_{1}\) and \(u^{0}_{5}\) in Table 1 is presented in Fig. 5. These results demonstrate the ability of the proposed method to accurately identify unknown functions based on standard observation data.

Fig. 4
figure 4

The graphical result of identification of flux function \(f_{\theta _f^*}(u)\), diffusion function \({a_1}_{\theta _a^*}(u)\) and \({A_1}_{\theta _a^*}(u)\). The red line represents the true function, while the blue line illustrates the learned function (color figure online)

Fig. 5
figure 5

u(xt) at \(t=2.0\) based on the initial states of \(u_1^0\) and \(u_5^0\) in Table 1. The left and right subplots are based on \(u_1^0\) and \(u_5^0\), respectively

4.3 Identification of f(u) and \(a_2(u)\) using standard observations (Data I) with four S-Nets for representing \(a_2(u)\)

We learn f(u) and \(a_2(u)\) using the same experimental setting as in Sect. 4.2. Figure 6a–c show the learned functions \(f_{\theta _f^*}(u)\), \({a_2}_{\theta _a^*}(u)\) and \({A_2}_{\theta _a^*}(u)\). While \(f_{\theta _f^*}(u)\) is very similar to f(u), there are significant differences between \({a_2}_{\theta _a^*}(u)\) and \(a_2(u)\) for \(u \in (0.75, 1.0)\). These differences also affects the learning of \({A_2}_{\theta _a^*}(u)\) as compared to \({A_1}_{\theta _a^*}(u)\) in Sect. 4.2. Even with an increase in observation data, we are unable to achieve a good result for \(a_2(u)\). However, the error associated with \({a_2}_{\theta _a^*}(u)\) or \({A_2}_{\theta _a^*}(u)\) does not appear to impact much the predicted solutions, as demonstrated in Fig. 7. This explains why there is room for this error in the identification of \(A_2(u)\). In Sect. 4.6 we will discuss further why this happens.

Fig. 6
figure 6

The graphical result of identification of flux function \(f_{\theta _f^*}(u)\), diffusion function \({a_2}_{\theta _a^*}(u)\) and \({A_2}_{\theta _a^*}(u)\). The red line represents the true function, while the blue line illustrates the learned function (color figure online)

Fig. 7
figure 7

u(xt) at \(t=2.0\) based on the initial states of \(u_1^0\) and \(u_5^0\) in Table 1. The left and right subplots are based on \(u_1^0\) and \(u_5^0\), respectively

4.4 Identification of f(u) and \(a_1(u)\) using standard observations (Data I) with the piecewise affine functions for representing f(u) and \(a_1(u)\)

In this section, we replicate the experimental settings of Sect. 4.1 with the exception of the representation of unknown functions which is now based on affine functions as given by (3.11) and (3.12). We set \(n = 10\) in (3.11). We present the learned functions of \(\hat{f}(u)\), \(\hat{a_1}(u)\), and \(\hat{A_1}(u)\) in Fig. 8a–c, respectively. Additionally, Fig. 9 displays the real and predicted solutions based on the initial states \(u^{0}_{1}\) and \(u^{0}_{5}\) in Table 1.

While \(\hat{f}(u)\) generally fits f(u) well, there is a significant discrepancy for \(u\in (0.7, 1.0)\). In addition, since f(u) is learned by piecewise linear functions, \(\hat{f}(u)\) involve some non-smooth points. These non-smooth points have a large impact on the solutions, causing the volatility observed in Fig. 9a and b. \(\hat{a_1}(u)\) differs significantly from \(a_1(u)\), especially for \(u\in (0.5, 1.0)\), see Fig. 8b. As \(A_1(u)\) is an integral function of \(a_1(u)\), the difference between \(A_1(u)\) and \(\hat{A_1}(u)\) is less apparent than that between a(u) and \(\hat{a}(u)\). When comparing the optimization times between the method utilizing piecewise affine functions and the S-Net method, the former required 1879s to complete, whereas the latter took 4901 s. This underscores that achieving good results often necessitates more time.

Fig. 8
figure 8

The graphical result of identification of flux function \(\hat{f}(u)\), diffusion function \(\hat{a}(u)\) and \(\hat{A}(u)\). The red line represents the true function, while the blue line illustrates the learned function (color figure online)

Fig. 9
figure 9

u(xt) at \(t=2.0\) based on the initial states of \(u_1^0\) and \(u_5^0\) in Table 1. The left and right subplots are based on \(u_1^0\) and \(u_5^0\), respectively

4.5 Identification of f(u) and \(a_1(u)\) using particle trajectories (Data II) with four S-Nets for representing \(a_1(u)\)

In this experiment, we use a six-layer S-Net to represent f(u), while the S-Nets used for a(u) have only one hidden layer. We select ten particles and use their trajectories as observation data. The initial positions of these particles at \(t=0\) are chosen as follows

$$\begin{aligned} Z^{0} = \Bigl \{0.025, 1.025, 2.025, 3.025, 4.025, 5.025, 6.025, 7.025, 8.025, 9.025 \Bigr \}. \end{aligned}$$
(4.4)

Figure 10 shows the trajectories of different particles in the observation period under different initial states. The three sub-pictures on the left, middle, and right show the real and predicted trajectories of the particles with initial positions of 7.025, 4.025, and 2.025 under the conditions of initial states of \(u_1^0\), \(u_2^0\) and \(u_6^0\) respectively.

Fig. 10
figure 10

The trajectories of different particles in the observation period under different initial states. In each subgraph, the red solid line represents the real trajectory, and the blue dashed line represents the learned trajectory (color figure online)

Figures 11a–c display the learned functions \(f_{\theta _f^*}(u)\), \({a_1}_{\theta _a^*}(u)\) and \({A_1}_{\theta _a^*}(u)\). While there is a slight difference between \(f_{\theta _f^{*}}(u)\) and f(u) for \(u\in [0.8, 1.0]\), the approximation is generally close. For \(a_1(u)\), noticeable discrepancies can be observed between the learned and true functions for \(u \in [0.5, 0.75)\) and \(u \in [0.75, 1.0)\). However, \({A_1}_{\theta _a^*}(u)\) based on \({a_1}_{\theta _a^*}(u)\) generally fits the true function \(A_1(u)\) well, except for some smaller fluctuations in \(u \in (0.7, 0.8) \cup (0.9, 1.0)\). The predicted solutions in Fig. 12 demonstrate that these small deviations in \(f_{\theta _f^*}(u)\), \({a_1}_{\theta _a^*}(u)\) and \({A_1}_{\theta _a^*}(u)\) have little effect on the predicted solutions. In this section, we use only 5% of the data used in the previous section, but we still get the same good results as seen in Sect. 4.2.

Fig. 11
figure 11

The graphical result of identification of flux function \(f_{\theta _f^*}(u)\), diffusion function \({a_1}_{\theta _a^*}(u)\) and \({A_1}_{\theta _a^*}(u)\). The red line represents the true function, while the blue line illustrates the learned function (color figure online)

Fig. 12
figure 12

u(xt) at \(t=2.0\) based on the initial states of \(u_1^0\) and \(u_5^0\) in Table 1. The left and right subplots are based on \(u_1^0\) and \(u_5^0\), respectively

4.6 Discussion

4.6.1 Analyzing mean and maximum errors in f(u), A(u), and u(xT)

We discretize \(u \in [0,1]\) into \(N_u + 1\) points \(\{u_i\}_{i=0}^{N_u}\), where \(u_i = i \Delta u\) for \(i=0,\ldots ,N_u\), and \(\Delta u=1/N_u\). Let \(N_u =400 \). We then proceed to quantitatively evaluate the error between the learned and true functions using two specific metrics:

$$\begin{aligned} Mean&= \frac{1}{N_u} \sum _{i=0}^{N_u + 1}|f(u_i)-f_{\theta _f^*}(u_i)| \end{aligned}$$
(4.5)
$$\begin{aligned} Max&= max\{|f(u_i)-f_{\theta _f^*}(u_i)|, i=0, 1, 2,...\} \end{aligned}$$
(4.6)

Similarly to f(u), we also compute the mean and maximum values of the error about the function A(u). Additionally, we compute the mean and maximum values of the absolute difference between the true and predicted solutions at the final time point T:

$$\begin{aligned} Mean&= \frac{1}{N_x} \sum _{i=1}^{N_x}|u(x_i, T) - \hat{u}(x_i, T)| \end{aligned}$$
(4.7)
$$\begin{aligned} Max&= max\{|u(x_i, T) - \hat{u}(x_i, T)|, i=0, 1, 2,...\}, \end{aligned}$$
(4.8)

where \(x_i\) refers to the \(N_x\) number of grid cells and u and \(\hat{u}\) are true and predicted solution, respectively. In the Table 3, we analysis Mean and Maximum Errors in f(u), A(u), and u(xT) learned in Sects. 4.1, 4.2, 4.3, 4.4 and 4.5.

4.6.2 Robustness analysis of the model

The flux and diffusion functions are indeed independent of the initial data. In our study, we consistently apply the same set of initial states (referenced as Table 1) across various sub-experiments. The reason is the observation that the data derived from this initial state encompasses a broader spectrum of information pertinent to the unidentified functions. If an alternative set of initial states and their corresponding observations equally encapsulate the essence of these unknown functions, they too could serve as viable starting points for our model training, promising similarly satisfactory outcomes. The distribution of observations is often used to judge whether the observations reflect enough information about the unknown functions [57]. To demonstrate this point, we use a set of other initial states as shown in Table 2. The row labeled “learned functions in Sect. 4.2” in Table 3 (see 2nd row) showcases the outcomes from models trained using observations derived from Table 1. When comparing these results to those obtained from initial states in Table 2 (see 6th row), the error margins for both sets in terms of functions f(u), A(u), or solutions u(xT) are comparable and fall within an acceptable range.

We adjust the parameters from \(L= 10\), \(T=2.0\) to \(L= 20\), \(T=5.0\), while maintaining the same settings as described in Sect. 4.2. The outcomes of these adjustments are detailed in the row marked “Learned functions with changed L and T” in Table 3 (see 7th row). The performance of f(u) surpasses that achieved in Sect. 4.2, although it does not excel for A(u). Nevertheless, f(u) demonstrates superior effectiveness in the solutions compared to A(u). The errors associated with u(xT) in both scenarios are comparable, indicating the algorithm’s resilience to changes in T and L.

We tried to randomly add 3% noise to the observation data generated by initial states on Table 1, and then train the model with the noisy data. The result is shown in the row labeled “Learned functions with adding 3% noise to observations” in Table 3. f(u) and A(u) are slightly affected by the noise, but the effect on u(xT) is ultimately small and within acceptable limits of the error perturbation. If the noise is large, it has a relatively large impact on the results of the model, as shown in the row labeled “Learned functions with adding 9% noise to observations”. In this case, we can minimize the effect of noise on the model by increasing the number of observations [57].

Selecting the optimal number of layers involves a hyperparameter tuning process, guided by the principle of model simplicity. We incrementally adjust the layer count from low to high until the loss function’s final value ceases to improve. It’s important to note that excessively increasing the model’s layers can deteriorate training outcomes due to the complexities introduced in the optimization direction during parameter updates. In Sect. 4.2, the S-Net for representing f(u) featured six hidden layers. We now employ seven hidden layers to learn f(u), keeping all other settings consistent with Sect. 4.2. The outcomes are detailed in the row labeled “Learned functions with seven hidden layers”. Further increasing the number of layers is unlikely to benefit the model significantly.

In Sect. 4.5, the function \(w(u)=u\) was utilized. In this instance, we evaluate \(w(u) = u^2\), maintaining all other experimental parameters as described in Sect. 4.5. The outcomes are presented in the row titled “Learned functions with \(w(u) = u^2\).” The performance of these models is nearly identical to those discussed in Sect. 4.5.

Table 2 The initial states
Table 3 Analyzing mean and maximum errors in f(u), A(u), and u(xT)

4.6.3 The impact of the diffusion functions on solutions

Through the above experiments, we have observed that the learning effect of f(u) is better than that of a(u) or A(u). While a poorer learning of a(u) seems to have little impact on the predicted solution, as seen in the experiment in Sect. 4.3, poor learning for f(u) can cause significant fluctuations in the solution, as shown in the experiment in Sect. 4.4. Additionally, the learning effect of \(a_2\) (\(A_2\)) differs significantly from that of \(a_1\) (\(A_1\)), even under the same experimental settings, as seen in the experiments in Sects. 4.2 and 4.3. These experimental phenomena have motivated us to investigate further the relative impact of f(u) and a(u) on the solution of (1.1).

To demonstrate the impact of the diffusion function on the solution, we vary the value of \(\alpha \) in \(A_1(u) =\alpha \int _{0}^{u} a_1(v) dv\). Setting \(\alpha =0.0\) corresponds to omitting the diffusion function in (1.1). The solution of (1.1) at \(t = 2.0\) based on the initial states \(u^0_1\) and \(u^0_5\) with different \(\alpha \) is displayed in Fig. 13. The red line shows the solution when there is no diffusion function a(u). Increasing \(\alpha \) gradually enhances the influence of \(A_1(u)\) on the solutions. However, despite increasing \(\alpha \) to 1.0, the solutions of the equation remain largely unaffected. Taking Fig. 13a as an example, the diffusion function weakly affects the solutions only in the range of \(x \in (7.0, 10.0)\). This may indicate why the learning effect of f(u) is superior to that of a(u) or A(u). Even if the learning effect of a(u) or A(u) is poor, it does not have a substantial influence on the solutions.

Fig. 13
figure 13

u(xt) at \(t=2.0\) based on the initial states \(u^0_1\) and \(u^0_5\) in Table 1 with different values of \(\alpha \). The left and right subplots are based on \(u_1^0\) and \(u_5^0\), respectively

To investigate why the learning effect of \(a_2(u)\) or \(A_2(u)\) is inferior to that of \(a_1(u)\) or \(A_1(u)\), we plotted the data for several observation time points (\(t = 0.3, 0.5, 0.8\)) in Fig. 14. In each graph, the red and blue lines denote the inclusion and exclusion of the diffusion function (\(a_1(u)\) or \(a_2(u)\)) in (1.1), respectively, and the shaded area illustrates the difference between them. The observation data in the top row is utilized to learn \(a_1(u)\), whereas the observation data in the bottom row is employed to learn \(a_2(u)\). The shaded area reflects that \(a_1(u)\) has a greater effect (see upper row) than \(a_2(u)\) (see lower row) on the solutions.

Fig. 14
figure 14

Visualization of observed data at different time points. The upper and lower rows represent the observations generated by \(a_1(u)\) and \(a_2(u)\), respectively

5 Experiment 2

In this section, we explore (1.1) and identification of the flux function f(u) given by

$$\begin{aligned} f(u) = \frac{u^2}{u^2 + 0.5(1-u)^4} \left( 1-5(1 - u)^4 \right) \end{aligned}$$
(5.1)

and the diffusion function a(u)

$$\begin{aligned} a(u)= {\left\{ \begin{array}{ll} 0, &{} \text {if } u \in [0, 0.5] \\ (u - 0.5)u, &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(5.2)

with \(\alpha =1\).

We investigate a one-dimensional spatial domain with \(x \in [0,10]\) and a time interval [0, T] with \(T=2\). Observations are collected in the form of (3.15) and (3.25), with a numerical grid of \(N_{x}=200\) grid cells, \(N_{\textrm{obs}}=9\), and \(\Delta t^{\textrm{obs}}=0.1\). Our objective is to identify the unknown flux function f(u) and diffusion function a(u) for \(u\in [0,1]\). We use a set of initial states \(\{u_{k}^{0}\}_{k=1}^{K_{init}}\) such that \(0\le u_k^{0}(x)\le 1\) to learn f(u) and a(u). Table 4 shows the initial states used in the experiments that follow.

Table 4 The initial states
Fig. 15
figure 15

The graphical result of identification of flux function \(f_{\theta _f^*}(u)\), diffusion function \(a_{\theta _a^*}(u)\) and \(A_{\theta _a^*}(u)\). The red line represents the true function, while the blue line illustrates the learned function (color figure online)

As in Sect. 4, we employ one S-Net to learn f(u), and four separate S-Nets corresponding to \(u \in [0, 0.25)\), \(u \in [0.25, 0.5)\), \(u \in [0.5, 0.75)\), and \(u \in [0.75, 1.0)\) to learn a(u). The initial values of the neural networks are randomly generated. Additionally, for the piecewise affine functions method that represents unknown functions, we set \(n=10\) in (3.11).

5.1 Identification of f(u) and a(u) using standard observations (Data I) with four S-Nets for representing a(u)

The S-Net used to represent f(u) consists of six hidden layers, while the S-Nets used to represent a(u) have only one hidden layer. The learned functions of \(f_{\theta _f^*}(u)\), \({a}_{\theta _a^*}(u)\) and \({A}_{\theta _a^*}(u)\) are shown in Fig. 15a–c, respectively. The learned function \(f_{\theta _f^{*}}(u)\) accurately approximates the true function f(u), but for the diffusion function a(u), noticeable discrepancies are observed for \(u \in [0.8, 1.0)\). Consequently, some error exists between \({A}_{\theta _a^*}(u)\) and A(u).

5.2 Identification of f(u) and a(u) using standard observations (Data I) with the piecewise affine functions for representing a(u)

In this section, the experimental settings mirror those in Sect. 5.1, except for the representation of the unknown function. The learned functions of \(\hat{f}(u)\), \(\hat{a}(u)\), and \(\hat{A}(u)\) are presented in Fig. 16a, 17a, and 17b, respectively. Aside from the interval \(u \in (0.2, 0.35)\), \(\hat{f}(u)\) is generally close to f(u). However, due to the numerous piecewise affine functions used in learning \(\hat{f}(u)\), it contains multiple non-smooth points. This non-smoothness results in considerable fluctuations in the predicted solutions, as depicted in Fig. 18. The learning performance of \(\hat{a}(u)\) is suboptimal, especially for \(u \in [0.75, 1.0]\). However, \(\hat{A}(u)\) can capture the overall trend of A(u). Figure 18 demonstrates that regardless of the chosen method for representing f(u), maintaining its smoothness is essential.

Is it so that the lack of accuracy in the predicted solutions comes essentially from a lack of precision in f(u), and not a(u)? To prove it, we use the predicted \(\hat{f}(u)\) and true a(u) to solve the solutions. Figure 18b illustrates that the accuracy of solutions comes from f(u). Increasing the number of piecewise affine functions improves the model’s performance to a certain extent. Figure 16b demonstrates the learning effect of \(\hat{f}(u)\) based on \(n=20\), which is double the value used in Fig. 16a. However, due to computing resource limitations, n cannot be increased indefinitely. The non-smooth nature of piecewise affine functions leads to fluctuations in the predicted solution, as illustrated in Fig. 18c.

Fig. 16
figure 16

Learned \(\hat{f}(u)\) based on the piecewise affine functions with \(n=10\) and \(n=20\), respectively

Fig. 17
figure 17

The graphical result of identification of diffusion function \(\hat{a}(u)\) and \(\hat{A}(u)\). The red line represents the true function, while the blue line illustrates the learned function (color figure online)

Fig. 18
figure 18

u(xt) at \(t=2.0\) based on the initial states of \(u_1^0\) in Table 4. The left subplot displays the solution generated by \(\hat{f}(u)\) and \(\hat{a}(u)\), both with \(n=10\). The middle subplot presents the solution derived from \(\hat{f}(u)\) with \(n=10\) and the true a(u). The right subplot features the solution obtained from \(\hat{f}(u)\) with \(n=20\) and \(\hat{a}(u)\) with \(n=10\)

5.3 Identification of f(u) and a(u) using particle trajectories (Data II) with four S-Nets for representing a(u)

In this experiment, we use a six-layer S-Net to represent f(u), while the S-Nets used for a(u) have only one hidden layer. We select ten particles and use their trajectories as observation data. The initial positions of these particles at \(t=0\) are denoted by (5.3)

$$\begin{aligned} X^{0} = \Bigl \{0.025, 1.025, 2.025, 3.025, 4.025, 5.025, 6.025, 7.025, 8.025, 9.025 \Bigr \}. \end{aligned}$$
(5.3)

The learned functions \(f_{\theta _f^*}(u)\), \({a_1}_{\theta _a^*}(u)\), and \(A_{1\theta _{a^*}}(u)\) are displayed in Fig. 19a–c, respectively. There is a slight difference between \(f_{\theta _f^{*}}(u)\) to f(u). However, given the small amount of data, such an error seems acceptable.

Fig. 19
figure 19

The graphical result of identification of flux function \(f_{\theta _f^*}(u)\), diffusion function \(a_{\theta _a^*}(u)\) and \(A_{\theta _a^*}(u)\). The red line represents the true function, while the blue line illustrates the learned function (color figure online)

5.4 Discussion

5.4.1 Analyzing mean and maximum errors in f(u), A(u), and u(xT)

We analyze Mean and Maximum Errors in f(u), A(u), and u(xT) learned in different situations. The results are shown in Table 5. The main observation is that the findings from Experiment 1 largely carry over to the case studied in Experiment 2 with the rational flux function (5.1) and diffusion function (5.2) as the unknown functions.

Table 5 Analyzing mean and maximum errors in f(u), A(u), and u(xT)

6 Conclusion

In this paper we have explored how to learn both the flux function f(u) and the diffusive flux A(u) for a degenerate convection-diffusion model (1.1). Since \(a(u)=A'(u)\) may vanish on intervals, solutions evolve into discontinuous solutions which must be interpreted in the sense of entropy solutions (1.3). Building on experience with how to learn f(u) for the case when \(a(u)=0\) is not present [53], we here extend the methodology to also account for the unknown a(u). The essential ingredients for successful identification of f(u) and a(u) is a proper combination of acquiring observation data based on several randomly selected initial data and a certain regularity imposed on the candidate functions for f and a by using the symbolic neural network structure. We conducted experiments to evaluate the effectiveness of the proposed approach, which showed promising results. To provide a comprehensive assessment, we explored both standard observational datasets and particle trajectory-based observational datasets. Main findings are:

  • The method shows a strong ability to learn the correct flux function f(u) and A(u), however, there is more uncertainty related to the identification of the correct a(u) as the impact from this term is typically much weaker.

  • The use of trajectory data \(z_j(t)\) generated by solving the ODE (3.18) which depends on the solution u(xt) of (1.1), gives the interesting finding that the quality of the identification of f and A is as good as the one achieved by using observations directly of u(xt). However, in the first case we only need a small fraction of the number of data points used in the latter case. This suggests that there is room for improved learning by using clever choices of observation data.

  • The symbolic neural network based functions impose regularity when we search for candidate functions f(u) and A(u) which appears to be a robust and convenient approach when we work with hidden PDEs where solutions are discontinuous.

In conclusion, the proposed method offers a reliable and robust method to recover hidden degenerate convection-diffusion equations of the form (1.1) from observation data that have been collected, for instance, from an experimental setup.