Identification of nonlinear conservation laws for multiphase flow based on Bayesian inversion

Conservation laws of the generic form ct+f(c)x=0\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$c_t+f(c)_x=0$$\end{document} play a central role in the mathematical description of various engineering related processes. Identification of an unknown flux function f(c) from observation data in space and time is challenging due to the fact that the solution c(x, t) develops discontinuities in finite time. We explore a Bayesian type of method based on representing the unknown flux f(c) as a Gaussian random process (parameter vector) combined with an iterative ensemble Kalman filter (EnKF) approach to learn the unknown, nonlinear flux function. As a testing ground, we consider displacement of two fluids in a vertical domain where the nonlinear dynamics is a result of a competition between gravity and viscous forces. This process is described by a multidimensional Navier–Stokes model. Subject to appropriate scaling and simplification constraints, a 1D nonlinear scalar conservation law ct+f(c)x=0\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$c_t+f(c)_x=0$$\end{document} can be derived with an explicit expression for f(c) for the volume fraction c(x, t). We consider small (noisy) observation data sets in terms of time series extracted at a few fixed positions in space. The proposed identification method is explored for a range of different displacement conditions ranging from pure concave to highly non-convex f(c). No a priori information about the sought flux function is given except a sound choice of smoothness for the a priori flux ensemble. It is demonstrated that the method possesses a strong ability to identify the unknown flux function. The role played by the choice of initial data c0(x)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$c_0(x)$$\end{document} as well various types of observation data is highlighted.

tions ranging from pure concave to highly non-convex f (c). No a priori information about the sought flux function is given except a sound choice of smoothness for the a priori flux ensemble. It is demonstrated that the method possesses a strong ability to identify the unknown flux function. The role played by the choice of initial data c 0 (x) as well various types of observation data is highlighted.
Keywords Nonlinear conservation law · Multiphase model · Data-driven mathematical modelling · Gaussian random process · Iterative ensemble (Kalman filter) smoother · Bayesian inversion 1 Introduction

Learning of PDE models from data in fluid mechanics
Scalar nonlinear conservation laws of the form represent a rich class of partial differential equations (PDEs) that arise in many different applications including multiphase flow in porous media and wellbore operations relevant for subsurface flow. Here, c = c(x, t) is the unknown variable which varies in a onedimensional (1D) domain x ∈ [0, L] as time t elapses. The conservation law (1) must be augmented with an initial state c(x, t = 0) = c 0 (x) and appropriate assumptions about boundary behavior at x = 0 and x = L. Starting with general formulations based on first physical principles in terms of conservation of mass, momentum, and energy, one can derive scalar nonlinear conservation laws in the form (1) with an explicit expression for the flux function f (c). However, there are many situations where such derivations are intractable or even impossible as they become too complicated or the governing physical laws are unknown. On the other hand, one often nowadays has access to spatio-temporal observation data c(x i , t j ) from full numerical solution of the underlying conservation equations, or from controlled experiments. A natural question then is to what extent are we able to identify the unknown flux function f (c) from such data. Currently, there is much focus on discovering hidden PDE models from spatio-temporal observation data [1]. A prototype example of PDE model to learn from data is Burgers' viscous equation (2) where the main purpose is to identify the unknown parameters λ 1 and λ 2 [2,3]. However, to the best of our knowledge, such methods have not been used to identify the unknown nonlinear function f (c) in (1) which appears to be a more complex problem due to the loss of regularity and the need for considering weak solutions supplemented with suitable entropy conditions [4][5][6]. More precisely, the unique weak entropy solution c can be characterized by the following integral inequality when (1) is considered on R (Cauchy problem) for all nonnegative ϕ ∈ C ∞ c (R × [0, T )) and any constant k ∈ R. The fact that solutions of (1) cannot be understood in a classical sense but in the sense of (3) hampers use of many modern neural network-based methods.
In this paper, we will focus on combining Gaussian random processes to represent the unknown f (c) through a continuous piecewise linear (affine) function combined with an iterative ensemble Kalman filter (EnKF) approach to update the parameter vector and thereby learn the nonlinear flux function. The EnKF method we use is similar to the one employed in reservoir engineering setting [7,8]. The EnKF was first introduced as a state estimation approach within oceanography [9], and it was superior to older Kalman filter methods for large-scale nonlinear problems. Subsequently, it was introduced as an approach for parameter estimation in reservoir engineering [10][11][12][13]. A number of iterative approaches has since been suggested, e.g., [7,14,15]. An independent line of research was started in [16] which suggested to use the ensemble Kalman filter update equation for solving general inverse problems similar to the ensemble smoother approach. This approach was denoted as Ensemble Kalman Inversion (EKI) and represents a collection of algorithms [16][17][18][19] that is motivated by ideas from the EnKF [9].

Previous studies of learning nonlinear flux function from data
The problem of flux function identification related to the nonlinear conservation law (1) has been studied by different researchers the last two decades. Mathematical oriented works typically focus on detecting a precise understanding of the relation between initial data, observation data and underlying flux function. Kang and Tanuma investigated strictly convex flux functions and problems where the solution evolves to a single shock in finite time [20]. Their work proved that the flux function can be uniquely determined on an interval [0, δ] with δ > 0 from late-time observation of the shock corresponding to a given initial datum [20]. James and Sepúlveda formulated the inverse problem of flux identification as that of minimizing a suitable cost function [21]. The flux identification problem by cost function minimization was studied also by Castro and Zuazua [22] who proved the existence of minimizers for solutions with discontinuities and developed appropriate descent algorithms for such problems. Holden et al. used the front-tracking algorithm to reconstruct the flux function from observed solutions to problems with suitable initial data [23]. Recent studies have addressed the reconstruction of the flux function for sedimentation problems that involve the separation of a flocculated suspension into a clear fluid and a concentrated sediment [24]. As discussed by Diehl, estimation of the batch settling flux function is sensitive to the choice of initial condition, and performing experiments with two different ini-tial conditions can enable identification of both convex and concave parts of the function [25]. Subsequently, Bürger and Diehl showed that the inverse problem of identifying the batch flux density function has a unique solution and derived an explicit formula for the flux function [26]. Betancourt et al. applied the flux identification approach of [26] to experimental data by solving a constrained least-squares minimization problem [27]. The flux identification approach of [26] with different initial conditions (the Kynch test and the Diehl test, respectively) was used in the context of cylindrical vessel combined with data from experiments [28]. An approached based on symbolic multi-layer neural networks were proposed in [29] as a method to recover hidden nonlinear conservation laws. Work on Bayesian inversion in the context of hyperbolic conservation laws is limited. We refer readers to the work [30,31] for more theoretical aspects of wellposedness related to such Bayesian inversion relevant for identification of initial data and flux function.

The contribution of this work
The inverse problem associated with (1) we are interested in takes the form where h is the solution operator associated with (1), θ is the parameter vector (i.e., a representation of the unknown flux function f ), and d is the selected observation data with noise η. The main contribution of this paper is to explore the EnKF-based approach for general scalar conservation laws occurring in a multiphase flow setting which involves a family of non-convex flux functions. The EnKF method allows us to extract statistical information about the unknown flux f through the parameter vector θ from noisy observation data d. To the best of our knowledge, this is a first work which combines EnKF, sparse observation data, and entropy solutions. The method we use is flexible regarding observation data d to include in the cost functional (loss function) and is gradient-free. Our aim is to obtain a practical and efficient algorithm which has a potential to identify the physical relevant flux function behind the observation data. A useful and unique feature of the assimilation method is that lack of uniqueness in the determination of the flux function is signaled by showing a larger spread in the updated ensemble of the candidate flux functions, suggesting that more observation data may be added. The fluid mechanical setting we focus on involves displacement of one fluid by another in an annular geometry. In the following, we briefly describe the more general context of Newtonian fluid displacements in a multidimensional setting as described by the Navier-Stokes model. More details are provided in Appendix A.

Mathematical multiphase model for Newtonian fluid displacements
As the model problem for the current study, we consider laminar fluid displacement within the annular space formed by two concentric cylinders. More specifically, we consider the fluids to be incompressible and Newtonian, and that the displacing fluid is injected from the top. We focus on iso-viscous displacements, i.e., assume the fluids have the same constant viscosity and consider the cases where the displacing fluid may be either denser or lighter than the displaced fluid [32,33]. The laminar displacement flows are governed by the equations of mass and momentum conservation, and we will assume that the concentration of the respective fluids follow an advection-diffusion equation. Specifically, the assumption of incompressibility ensures that the velocity u is solenoidal, The momentum conservation equation for fluid phase i is as follows: Here, μ is the common viscosity of the two fluids, ρ i is the mass density of fluid i, and g corresponds to the gravitational acceleration. We let i = 1 denote the displacing fluid, and i = 2 the displaced fluid. We refer to Fig. 1 for an illustration of the physical setting of the displacement process. Letting C ∈ [0, 1] denote the volume fraction of displacing fluid, the evolution of this scalar field is given by the advection equation In Appendix A, we briefly show that, subject to certain simplifying assumptions, Eqs. (5), (6) and (7) can be combined to obtain a one-dimensional conservation equation for the annular cross-sectional averaged concentration c(x, t) = A C(r, t) d A: The derivation shown in that section produces an explicit expression for the flux function f (c; χ), which depends on c and the dimensionless number χ which represents the balance between buoyant and viscous stresses.

Problem statement
Motivated by the above derivation of a conservation law (8) from the general model (5)- (7), subject to certain constraints, the problem we explore in this work is: How can we reconstruct the unknown flux function f (c) where we only have information about the initial state c 0 (x) and a relatively sparse set of noisy observation data d? A special challenge with identification of the nonlinear flux function of a conservation law is that solutions typically contain discontinuities, i.e., a jump described by a left and right state (c l (t), c r (t)) with c l = c r moving along a certain path x(t) in the x-t-space, where there is no information used about the specific form of f (c) in the interval [min(c l , c r ), max(c l , c r )] [4]. Under what circumstances, e.g., knowledge about initial state and corresponding observation data are we able to recover a reliable approximation of the flux function f (c)?
Observation data can be given in terms of states at specific times, e.g., c obs ( The natural initial data c 0 (x) to consider are one which contains an initial jump from 1 to 0 to mimic the situation depicted in Fig. 1 where fluid 1 displaces fluid 2. Consequently, there will be a "mixing zone" that evolves over time, i.e., a region where the volume fraction c(x, t) stays between and 1 − as defined by the set E given by for some chosen > 0. One may envision to use data like position and velocity related to this mixing zone as well.
Our framework is based on an ensemble Kalman filter (EnKF) approach combined with use of Gaussian random processes to approximate the unknown function f (c). Our starting point is that we seek to uncover the unknown flux function f (c) by relying only on time-dependent observations at fixed positions. We assume that "sensors" at four positions x * i , for i = 1, . . . , 4 have been placed in the region where the interface between fluid 1 and fluid 2 is located. We typically start exploring how well we can learn the hidden flux function from data based on the case with an initial jump from 1 to 0. Then, we proceed by modifying this initial state such that it includes some intermediate states between 0 and 1. This allows for collecting data that potentially contains more information about the unknown flux function. We seek to keep the amount of data used for the learning process at a minimum. We will address questions like: (i) How well can we learn f (c) from noisy data of the volume fraction {c(x * i , t)} 4 i=1 over a training time period [0, T 1 ]? We will consider observation data corresponding to different flow regimes as represented by the parameter χ in (8). This amounts to different shapes of the involved flux function, ranging from essentially concave to highly nonconvex forms. A representative set of different examples of f (c; χ) is shown in Fig. 19.
(ii) How does the choice of initial condition impact the identified flux function? Can different initial conditions result in improved flux function identification? (iii) How can information about the position and velocity of the mixing zone be used? (iv) Are we able to also capture information about the role played by the parameter χ ? I.e., are we able to extract a good approximation of f (u; χ) over a range of different choices of χ ?

Structure of this work
The ensemble-based method we explore in this work is described in Sect. 2. In Sect. 3, we give further background information related to the implementation of the identification method and which observation data to use. Finally, in Sect. 4 we carry out a range of different tests of the ability to learn the unknown flux function based on synthetically generated noisy data. We test the method for nonlinear flux functions that contain one or several inflection points.

Method based on Gaussian random processes and EnKF
Based on the given observation data, our aim is to identify a conservation law (8) now written in the form for where f (c) is the unknown, possible nonlinear flux function and γ is an unknown constant. The introduction of the parameter γ is motivated by the identification of PDEs as discussed in [2,3] and represented by (2). It also seems convenient from the point of view of the ensemble-based method where γ serves the role as global unknown quantity, whereas the parametrization of f (c) accounts for the local variations as a function of c. To estimate f (c), we have chosen an approach motivated by the success of large-scale parameter estimation from reservoir engineering [8] where ensemblebased approaches are used to identify spatial varying fields of unknown parameters. The complexity in representing these fields might vary, but quite often they can be represented as Gaussian random fields. Here, we will adopt this approach by assuming that f (c) can be identified by representing it as a Gaussian process (a one-dimensional Gaussian random field). For numerical implementation, f (c) is represented as a piecewise linear function on a designated equidistant grid {c p } P p=1 .

Gaussian random processes
We will represent f (c) as a Gaussian process with the property that the correlation between f 1 = f (c 1 ) and is given as where σ is standard deviation and l is correlation length, i.e., a scaling parameter that determines how fast the correlation decays as function of the distance between c 1 and c 2 . To calculate such a Gaussian process on P grid points {c p } P p=1 , we first sample a P-vector ζ of independently zero-mean normally distributed random numbers with variance 1. A Gaussian process ψ with the required properties is then obtained by letting ψ = Lζ (12) where C = L L T . A common choice is to obtain L by taking the Cholesky decomposition of C. Note that the Cholesky decomposition is only required once for all samples of f (c) that should be generated. For modest size P to describe the unknown f (c) as used in this work, this leads to an accurate algorithm with sufficient efficiency.

Ensemble smoother-multiple data assimilation
The initial ensemble E f might be obtained using the algorithm presented in Sect. 2.1. The updated ensemble E a is obtained using Algorithm 1. The ensemble smoother-multiple data assimilation (ES-MDA) was first presented in [7]. Algorithm 1 deviates slightly from their work by replacing the stochastic ensemble Kalman filter (see, e.g., [34]) for the update part with a square-root-filter. The requirement N it i=1 1 α i = 1 on the scaling factors ensures that for a linear problem with a Gaussian prior, the posterior solution will be the correct one. For nonlinear problems, the solution will Solve (10) with N ens parameter vectors as given by N ens × N ens -identity matrix Output: E a : updated ensemble Step 1: Step 2: ; We use the square-root-filter presented in Algorithm 2. It involves two main steps: (i) compute updated mean ψ a based on K ; (ii) compute corresponding updated ensemble E a based on T . The basic idea behind this square-root-filter was introduced in [35] and further developed by [36]. The square-root-filters are based on the basic idea of getting a posterior ensemble with the correct posterior covariance matrix, and a transformation matrix T achieving this has to be found. It was not immediately realized that this transformation matrix should be mean preserving (which is equivalent to preserving the zero-mean property of A f ). This was pointed out in [37,38]. Our choice of transformation matrix T fulfills both the two requirements. A discussion, and historical account, of square-root-filters are also included in [34]. In Algorithm 2, the singular value decomposition of a symmetric matrix is computed. The singular value decomposition of a symmetric matrix, A has the form A = U U T where U is an unitary matrix and is a diagonal matrix. Calculating this singular value decomposition is feasible as the size of the matrix is N ens × N ens .

Discrete numerical scheme
We base our discrete version of (10) on the Rusanov scheme [4] which takes the form with j = 2, . . . , N x − 1 and where the Rusanov flux takes the form For the first and last grid cell, the boundary condition gives c n+1 We use a slightly modified version of the Rusanov flux by relying on a global estimate of [4]. The CFL condition [4] determines the magnitude of t for a given x through the relation We have used t x |γ |M = C F L ≤ 3 4 < 1 when we compute solutions based on the ensemble of flux functions involved in the learning process.

Preliminaries for numerical investigations
We focus on the flux function f (c; χ) obtained in (A. 3) which takes the form when we generate synthetic observation data. Initially, we are interested in an initial state of the form where we have a pure interface separating the two fluids in the domain [0, 10]. This is motivated by the physical setting for carrying out laboratory experiments. However, we may expect to see limitations in identification of the true flux function when it is based on data which has been collected as a result of using (17). That is, the possibility for discontinuities to evolve may prevent collecting observation data for all c in [0, 1]. This motivates us to also consider the following initial state: where c I and c I I are appropriate chosen intermediate states. This form of the initial data may imply that more information about the flux function can be obtained when we collect time-dependent data at the positions The motivation for using (18) instead of (17) is to seek recovering more details of the unknown flux function f (c) recalling that for a shock wave (c l , c r ) with c l = c r , it is only the linear slope that connects (c l , f (c l )) and (c r , f (c r )) that is involved. I.e., the part of f (c) within the interval [min(c l , c r ), max(c l , c r )] is not used in the construction of the exact entropy solution [4]. In our investigations, we will also consider modified versions of (18) which involve more/fewer intermediate states connecting the initial jump from 1 to 0.

Observation data set
We consider the data set corresponding to collecting time dependent data at fixed positions {x * Herein, t c(x * I , t) refers to a discrete difference along time axis that involve specified times where observation data are collected. Note that time derivative dc dt does not make sense as the true solution typically involve a discontinuous function with respect to time variable. We also explore the effect of augmenting the observation data with information about the mixing zone as expressed by where we have defined and E (t) is defined by (9). We have used = 0.025 throughout the rest of this paper.
We consider a domain of length L = 10 and consider simulations over the time period [0, T ] with T = 2. We apply a numerical grid composed of N x = 1000 grid cells when we compute numerical solutions of (8) based on the numerical scheme (13) and (14). This is used both for obtaining the true solution and corresponding synthetic observation data (which we denote by m) by as well as when we compute predictions based on the ensemble of estimated flux functions (which we denote by d). We specify times for collecting the time dependent data We have set N obs t = 40, i.e., t obs = 0.05. Hence, the number of collected data points in time is sparse compared to the number of local time steps t we need to compute numerical solutions through the discrete scheme (13) and (14). As training data, we will use points in the time interval [0, T 1 ] with T 1 = 1.

Main algorithm
The main algorithm, based on Algorithm 1 and Algorithm 2, is given through the following steps: 1. Generating initial parameter vector θ k . Generate an initial ensemble of samples Herein, γ k is a scalar generated from the uniform random distribution of numbers in [0, 2] and f k is a vector that represents the piecewise linear function generated as a Gaussian random process on the the uniform grid where μ k is a mean value randomly sampled from a uniform distribution in the interval [0, 1] and the Gaussian process ψ k = Lζ k is constructed from the correlation matrix C, see (11) and (12) in Sect. 2.1. We have used σ = 0.25 and l = 20 c as standard deviation and correlation length, respectively, in (11). We have used P = 52 such that the length of the parameter vector θ k is 53 (length of f k and the scalar γ k ).

Computing predicted observation
Using the discrete scheme (13) and (14) and the initial data (e.g., (17) or (18)), we obtain the discrete approximation denoted by c k = c(x j , t n ; θ k ) (to indicate its dependence on the flux function described by θ k through γ k f k ) until time T 1 = 1 and collect the predicted data set d k = h(θ k ) where h(θ k ) refers to the data set (20). More precisely, it refers to the discrete version Similarly, we consider the discrete version of (21) given by As default, we include observation data of the form (25) as well as (26). I.e., Herein, λ 1 and λ 2 are scaling parameters to ensure a proper balance between observation data of different types (i.e., order of magnitude should be comparable). We have used consistently λ 1 = 1 and λ 2 = 1 2 throughout the simulation results if nothing else is specified.

Initial data for Algorithm 1 (ES-MDA) Based on
Step 1 and 2, we can form the initial ensemble E f given by where d k and θ k are written as column vectors.
Together with the measurement data vector m, which refers to the measured (observed) data of the form (27)

Results
An illustration of the initial ensemble of samples {θ k } 100 k=1 is found in Fig. 2  c Updated ensemble {θ k }. e The mean θ * = mean{θ (5) k }. f Observed data versus predicted behavior. Solid line represents observed data, circles show predicted result. Note that in panels to make the comparison of the different ensemble flux functions clearer iterations (Fig. 3c), there is still quite much uncertainty related to the region c > 0.2, whereas after 5 iterations, the more uncertain region has been reduced to c > 0.4. However, as seen from the mean flux function in Fig. 3e there is a systematic error in the learned f (c) for c > 0.4. What is the discrepancy between the predicted data based on the learned flux function in Fig. 3e and observed data? This is visualized in Fig. 3f where solid line represents observed data (shown here without noise) and circles are used to show predicted behavior based on the mean flux in Fig. 3e. We observe that the time-dependent curves at x * 1 (blue) and x * 2 (green) coincide and contain no variations. This clearly makes the learning process challenging. Taking a closer look at data associated with x * 4 (light blue) we also see a small discrepancy between observed and predicted which is associated with the erroneous part of the learned f (u).
In order to collect more characteristic information, we now replace the initial data (17) by (18) specified as c 0 (x; 0.6, 0.3). The result is shown in Fig. 4. Most strikingly, after 3 iterations there seems to be quite good agreement on the shape of the flux function f (c) (see Fig. 4c), and the resulting mean of the ensemble vector {θ (5) k } after 5 iterations indeed fits well with the true f (c; χ = 120), see panel 4e. The observed data associated with x * 1 , . . . , x * 4 are shown in Fig. 4f from which we see a larger part of the interval c ∈ [0, 1] is covered through the time-dependent data. This explains why the true flux f (c; χ = 120) can be identified much better than for the case with initial data (17) shown in Fig. 3f. This is also visualized in the histograms shown in Fig. 5 where we compare the observation data set for the two different initial states. Clearly, the use of c 0 (x; 0.6, 0.3) (right figure) shows that observed c-values are more uniformly distributed in [0, 1] although there are relatively large gaps between them.
An interesting aspect of the method is that it provides insight into which regions the learning process, based on the used data, find it more challenging to decide on the shape of f (c). This information may be used to suggest other type of observation data that can help improving the learning of the flux function.
What is the observation data we have used in this learning process? In Fig. 6, we have illustrated the timedependent data collected at positions {x * i } 4 i=1 given by (19) (left plot), whereas collected data of the positions of the mixture zone as given by (22) with = 0.025 are shown in the right figure. 3% noise has been added to the observation data before used for the learning process, i.e., we account for measurement error η ∼ N (0, γ 2 I) associated with observations d in (4) where γ = 0.03 and I is the identity matrix. Similarly for d mix,k A in (26).
We may wonder what is the role of the information used about the position of the mixing zone through To test this, we run an identification simulation by only using the data set (25), i.e., we set the scaling parameter λ 2 = 0 in (27). The result is shown in Fig. 7. Main observation is that there is a larger degree of uncertainty in the learning after 3 iterations (panel a) as compared to the case when we used the data set (27) illustrated in Fig. 4c. However, after 5 iterations shown in Fig. 7b results are quite comparable. And the mean flux function illustrated in Fig. 7c to a large extent coincides with the true flux f (c; χ = 120) apart for c close to 1. For the sake of robustness, for example, to quicker detect the possibility for backflow for larger c values, we use the data set (27) in the continuation.
A main objective of this work is to demonstrate that the proposed methodology is general enough to learn the relevant nonlinear flux function for a large range of χ values in (A.3). Hence, we now use the same type of observation data as above, i.e., based on (27), and consider synthetic data for flux functions in the range χ ∈ [50, 300] combined with initial data c 0 (x; 0.6, 0.3). In Fig. 8, results are shown for synthetic data based on the conservation law with flux function f (u; χ = 50) (column a), f (u; χ = 150) (column b), f (u; χ = 200) (column c), and f (u; χ = 300) (column d). All cases reflect that the true flux function to a large extent is captured well, also when χ increases which gives rise to a stronger convex dip for u ≥ 0.5. This suggests that the identification process covers well the different flow regimes corresponding to χ ∈ [0, 300].
We find that there is virtually no difference from the case with initial data c 0 (x; 0.6, 0.3) (results not shown) used above. Looking at the corresponding histograms in Fig. 9 for the case when we try to identify f (c; χ = 200), we see that they are quite similar, thereby suggesting that the learning processes are of similar quality.
k }. e The mean θ * = mean{θ (5) k }. f Observed data versus predicted behavior. Solid line represents observed data, and circles show predicted result Observed data based on initial data c 0 (x) given by (17). Right: Observed data based on initial data c 0 (x; 0.6, 0.3) corresponding to (18) Moreover, we also check the initial data As a test, we consider (A.3) with χ = 120 and find an excellent match between learned f (c) and the true (results not shown). Next, we want to increase the num- We used data based on (A.3) with χ = 120. The result is shown in Fig. 10 (upper row) and reflects a poorer  k }. c The mean θ * = mean{θ (5) k } approximation for c > 0.4 (loss of some finer details of f (u)) as compared to, e.g., Fig. 4 suggesting that these smaller initial jumps relatively densely located make the identification process slightly less efficient. Further testing based on (A.3) with χ = 200 shown in Fig. 10 (middle row) shows the same trend: The higher number of smaller initial jumps included in the initial data slightly impedes the identification of the flux function, see Fig. 10 (column c). In Fig. 10 (panel d and e), we also see the distribution of the c-values involved in the observation data corresponding to, respectively, learning of f (c; χ = 120) (left) and f (c; χ = 200) (right). The lack of observation data for c ∼ (0.6, 0.9) appears to be a natural reason for a slightly poorer identification of f in this interval.
What is the effect of using fewer intermediate states? We consider the initial c 0 (x) given by From Fig. 11, column (a) and column (b), we see that through the learning process the updated ensemble of flux functions centers around the true flux, both for f (c; χ = 120) (upper row) and f (c; χ = 200) (middle row) and identifies the part of the flux which is the harder one to learn. As for the case in Fig. 10, we see that it is challenging to capture the very precise shape of f (c) for c > 0.6 (column c). The corresponding histograms are shown in Fig. 11d, e, and the lack of observation data for higher values of c is suggested as a natural reason for the poorer estimation of the flux functions in this region for both cases.

Temporal variation and mixing zone observation data can compensate for sparse distribution of c-values
To what extent is there room for a good identification of the unknown flux function when the observed cvalue distribution in the interval [0, 1] is sparse? In the  (18). Right: Initial data c 0 (x; 0.7, 0.4) based on (18) following, we explore the effect of using three different observation data sets d A k , d B k , and d C k defined as follows: Hence, d B k ignores information about the mixing zone behavior as expressed by (26), whereas d C k Fig. 10 Illustration of the learning process where synthetic data are generated from the flux function (16) with χ = 120 (upper row) and χ = 200 (middle row) and the data set (27) is used in combination with (30). a Updated ensemble {θ ignores information about the temporal gradients in the time-dependent observation data as expressed by We want to explore the learning process based on, respectively, d A k , d B k , and d C k for a case when the true flux function involves a rather deep convex "dip" for smaller values of c. For that purpose, we consider f (c; χ = −200) and we refer to Fig. 13 (third row) for illustration of the flux function. Clearly, this is a potential challenge for the identification process since shock wave solutions (decreasing jumps) depend on the upper concave envelope [4] which ignores information about finer details related to this convex "dip". We consider the initial state The resulting observation data distribution is shown in Fig. 12. It reveals that there is lack of observation data both for smaller c, intermediate, as well as higher.

What is the role played by the selected a priori representation of f (c)?
So far we have focused on investigations that can reveal insight into the interplay between the selected obser-vation data and the possibility to identify the ground truth flux function f (c). This issue has largely to do with the special nature of the conservation law (1) and the ambiguity between observation data and the underlying flux function. This is reflected by the entropy solution of (1) where the convex hull of the given flux is employed in the construction of the solution [4]. We first provide some further characterization of the entropy solution before we proceed with testing of different a priori processes to represent f (c). For a discontinuity (c l (t), c r (t)) moving along a certain path in the x − t space, the speed s of this jump is given by the Rankine-Hugoniot condition [4,39] Furthermore, from (3) another equivalent formulation of the entropy condition can be derived for the valid discontinuity (c l , c r ). This states that for all numbers v between c l and c r , the following two inequalities must hold where s is the shock speed velocity (34). It is well known that (8) generally will generate shock wave solutions c(x, t) in finite time, i.e., solutions that contain one or several discontinuities expressed as a jump (c l , c r ) with c l = c r , despite the fact that initial data c 0 (x) are smooth [4,39]. . As jumps arise and disappear in the solution over the time period for which observation data are collected, the data may lack information about f (c) [29,40]. An illustration of this situation is given in Fig. 14. In the left panel, we plot the flux function f (c) = c 2 /(c 2 + (1 − c) 2 ). In the right panel, the entropy solution after a time T = 0.5 is shown. At time t = 0, the initial data c 0 (x) involve one jump at x = 0 and another jump at x = 1. The initial jump at x = 0 is instantly transformed into a solution that is a combination of a continuous wave solution (rarefaction wave) and a discontinuous wave (c l , c r ) ≈ (0.3, 1.0), as dictated by the lower convex envelope shown in left panel (green curve) [4]. Similarly, the initial jump at x = 1 is transformed into a solution that is a combination of a continuous wave solution (rarefaction wave) and a discontinuous wave (c l , c r ) ≈ (0.7, 0), in accordance with the upper concave envelope illustrated in left panel k } after 5 iterations. Third row. The mean θ * = mean{θ (5) k }. Bottom row. Observed data versus predicted behavior. Solid line represents observed data, and circles show predicted result (magenta curve) [4]. From this example, we see that we have no observation data that directly can reveal the shape of f (c) in the interval c ∈ [0.3, 0.7] (approximately). However, as demonstrated in Sect. 4 we can still obtain effective identification of the unknown flux function by careful combination of the type of observation data and sound regularity imposed on the a priori ensemble to represent f (c).
In the following, we explore in more detail the role of the a priori distribution through the Gaussian process as described in Sect. 2.1. In particular, we will explore the effect of using an a priori ensemble θ k = {γ k , f k } for k = 1, . . . , N ens that is less regular and/or contains stronger variations. In order to pay special focus on the role of the smoothness/regularity of f k (c), we fix γ k = 1. For the investigations in this section, we employ as our testing ground the case with true f given by f (c; χ = 300) and with initial data c 0 (x; 0.6, 0.3).

5.1
What is the effect of using a priori ensemble with less smoothness/regularity?
We tested use of a shorter correlation length l = 10 c in (11), see Fig. 15, column (a). Results are comparable to that of using l = 20 c and reported in Fig. 8 (right column). However, there is a mismatch of the identified flux versus the ground truth for c ∈ [0.4, 0, 6]. Looking at the discrepancy between predicted data and true data, we see that the error is very low in the interval t ∈ [0, 1] from which observation data are extracted. However, at later times t ∈ (1, 2] the consequences of the difference in learned flux versus the true comes to the surface as seen by the observation data at x * 1 and x * 4 . This shows that for the selected observation data, there is clearly room for this discrepancy between identified and true f without being revealed by the loss function error. A natural remedy would be to include more observation data. The case with l = 5 c in (11) (or lower values) becomes impractical to deal with since the different members of the a priori ensemble involve flux functions where f (c) can take very high values. As seen from the description of the discrete scheme (14) and (15), it is clear that high values of f put strong constraints on the size of the discretization step t. Noting that f (c) has a clear physical interpretation in terms of representing the speed by which c-values travel with, it is reasonable to use this information when we decide how to generate the a priori ensemble. For the fluid displacement problem, we know from the experimental set up that f (c) cannot take such higher values and therefore, account for this information when we design the a priori process to represent f . This serves as a justification of using the a priori ensemble with higher correlation length.
What is the effect of using not so smooth a priori distribution, e.g., the Matérn kernel with ν = 3/2 or ν = 5/2, whereas l = 20 c as before? This test will be quite similar to the above case with shorter correlation length. The Matérn kernel with ν = 5/2 [41] takes the following form using the same notation as in (11) C ν=5/2 Results are shown in Fig. 15, column (b). We found that the identified flux f (c) contained a systematic difference to the true flux for c ∈ [0.4, 1.0]. However, looking at the predicted data curves (bottom row) we find that they largely fit the observed data curves in the time period t ∈ [0, 1]. This example again illustrates the challenge when the entropy solution, as characterized by (35) and illustrated in Fig. 14, is the source of the observation data. An improved learning of the flux function could most likely be obtained by incorporating more observation data that potentially can reveal the gap between the entropy solution generated by the learned and true flux function, respectively. We also tested the use of a priori Matérn kernel with ν = 3/2 Results are illustrated in Fig. 15, column (c). The deviation between identified flux and true is clearly larger, whereas the difference between predicted behavior and observed data is still low in the time period [0, 1] (bottom row). This illustrates that adding more variations and/or less regularity to the a priori distribution may increase mismatch pertaining to the underlying flux functions without showing this in the loss function behavior. This is due to the construction of the entropy solution which only uses the convex hull of the involved f (c) as indicated by the example in Fig. 14. We have consistently used an ensemble of size N ens = 100. We also explore the identification by using a lower number N ens = 50 and N ens = 25. We use the Gaussian process with l = 20 c and explore identification for the same case as in Sect. 5.1. In Fig. 16

5.3
How robust is the identification to larger noise in the observation data?
We explore how identification of the flux is affected when noise η in (4) is increased from 0.03 to 0.06, 0.09, and 0.15, respectively. We use the Gaussian process with l = 20 c and explore identification for the same case as in Sect. 5.1. Results are shown in Fig. 17.
We see that increasing noise in the observation data is manifested in the updated parameter vector representing the flux through a wider band width. However, the mean of the updated parameter vector gives the same accurate identification, see third row. Finally, we also tested sensitivity to different realizations of the a prior ensemble generated to represent f (c) for the case with l = 20 c. We found that the identified flux deviated minorly from one realization to another (results not shown).

Comparison with standard Monte Carlo Markov Chain (MCMC) algorithm
It is instructive and interesting to do a comparison of EnKF with the standard MCMC algorithm. Following [30,42], we use the Metropolis-Hastings method to approximate the posterior flux. Let where Herein, is the covariance matrix for the measurement error η ∼ N (0, γ 2 I) associated with observations d      (5) k } after 5 iterations. Third row. The mean θ * = mean{θ (5) k }. Bottom row. Observed data versus predicted behavior in (4) where γ = 0.03 and I is the identity matrix. For the sake of comparison, we focus on the the same case as explored in Sect. 5.1. In particular, we use the Gaussian process with l = 20 c for the initial a priori distribution to represent f as well as for the random walk update step. The MCMC algorithm is then given by the following standard random walk steps: 1. Set k = 0 and select f (0) (c i ) from a priori distribution N (0, C) where C amounts to (11) with l = 20 c and σ = 0.25. 2. Propose g (k) = f (k) + βξ (k) , ξ (k) ∼ N (0, C) where C amounts to (11) with l = 20 c and σ = 0.25. 3. Compute a( f (k) , g (k) ). Then, draw a number i (k) in [0, 1] from a uniform distribution.
The algorithm has three scalar hyperparameters (β, b, τ ) which need to be specified [30,42]. First, the stepsize β which controls the size of the move, second the burn-in b, i.e., the number of samples which are discarded in order to minimize the contribution of the initial value f 0 , and the sample interval τ which is the number of states which are discarded between two observations. For the presented simulation results shown in Fig. 18, we have used (β, b, τ ) = (0.05, 400, 20). After the burn-in period, we generate a Markov chain of length 2000 and obtain an ensemble of 100 samples using the sample interval τ = 20, see Fig. 18a. The main finding is that MCMC gives mean posterior flux (Fig. 18b) which is essentially of a similar quality as the one EnKF gave, see for example, Fig. 16 (left column). In particular, the predicted data shown in panel 18c are captured well. The main difference is that MCMC has involved 2400 simulation steps, whereas EnKF has employed 600. We may expect that results are to some extent sensitive to different parameters. That is, other choices of a priori distribution and/or covariance structure of ξ (k) most likely could have some impact on the result. E.g., a higher burn-in value b would ensure that a larger portion of the posterior samples lies closer to the true flux.

Conclusion
In this work, we have explored a methodology that combines representing the unknown flux function as a Gaussian random process through a parameter vector and updating this parameter vector based on the ES-MDA method. The physical setting we consider is displacement of one fluid by another in a vertical domain. The parameter χ is a key parameter that expresses the ratio of buoyant to viscous stresses. As indicated in Sect. 2, it is possible to derive a nonlinear conservation law of the form (8) where a functional form f (c; χ) of the flux function is obtained. Depending on the parameter χ , different flow regimes are involved ranging from stable to density-unstable displacement. Our aim has been to explore an ensemble-based approach as a means to extract from observation data the unknown nonlinear flux function. Main findings from the investigations are: • By relying on a sparse set of time-dependent data extracted from a few fixed positions in space, combined with some care in the choice of the initial state, we can to a large extent provide an effective identification of the nonlinear unknown flux functions f (c; χ) for different flow regimes corresponding to χ ∈ [−200, 300]. This includes highly non-convex flux functions. • The use of Gaussian random fields to represent the unknown flux function f (c) for c ∈ [0, 1] adds additional regularity and information which is exploited in the ensemble-based learning process. That is, we have seen that a relative effective identification of f (c) is possible despite the fact that the time series observation data only partly cover the domain [0, 1] on which f (c) is described (due to the occurrence of discontinuities in the solutions).
We have found that this is a direct consequence of the inclusion of additional observation data related to temporal variations, as expressed by the second part (second line) in (25), and information about the front and back position of the mixing zone as expressed by (26). • Further investigations showed that the regularity of the a priori ensemble distribution was quite essential for a good identification in the current study where a sparse amount of observation data was assumed available. The assumed smoothness of the a priori ensemble we applied is reasonable from the setup of the physical process under consideration. By including more observation data, one may expect to see higher robustness with respect to a prior regularity/smoothness. More precise understanding of this topic seems to be an interesting direction for further research. The identification was robust regarding level of noise in the observation data and to a little extent sensitive to different realizations of the a prior ensemble generated to represent f (c). Finally, comparison with a standard MCMC algorithm to compute posterior flux confirmed the effectiveness of EnKF as a reliable method to identify the unknown flux function for the chosen set of observation data.
In conclusion, we have seen that the ensemble-based method explored in this work provides a practical tool for recovering an unknown nonlinear conservation law from observation data through point observations extracted over a time period. This setting is realistic in view of the fact that sensor data are commonly obtained in terms of time series. In this study, the choice of the observation data was motivated by the specific fluid mechanical process under consideration. More generally, there seems to be many unexplored aspects of the use of Bayesian methods to learn a nonlinear conservation law from observations due to the entropy solution concept which depends on the underlying flux function in a highly non-trivial manner.
Funding Open access funding provided by University of Stavanger & Stavanger University Hospital. The authors declare that no funds, grants, or other support were received during the preparation of this manuscript.
Data Availability All data generated or analyzed during this study are included in this article.

Declarations
be constant viscosity fluids, with a common viscosity μ. Further, the fluids are assumed to be incompressible, with constant mass densities ρ 1 and ρ 2 for the displacing and displaced fluid, respectively. In what follows, a subscript i = 1 refers to displacing fluid and i = 2 to the displaced fluid, in accordance with the above and Fig. 1.
A lubrication scaling equal to that by Lajeunesse et al. [43] is next invoked, where it is assumed that the flow is essentially unidirectional and parallel to the walls of the geometry, and that the gap width D is small compared to a characteristic axial length scale of the duct, L * , [44]. Defining δ = D/L * as the aspect ratio of the channel, the above assumption implies that δ 1. The problem is non-dimensionalized by taking L * and D as length scales for the x and y coordinate, respectively, and using U * as a characteristic scale for the velocity component in the vertical direction. Then, t * = L * /U * is a natural time scale for the displacement process. As per Lajeunesse et al., we focus on displacements that occur at sufficiently high flow rates for diffusion to be negligible, i.e., it is assumed that Pe = δU D/D 1, with Pe the Péclet number. Further, τ * = μU * /D is used as characteristic scale for viscous stresses. The pressure is non-dimensionalized by the viscous pressure scale P * = τ * /δ, [44]. Taking the fluids to be incompressible and denoting by u x and u y the axial and transverse dimensionless velocity components, one can show that the zero-order governing equations are Finally, we point out that a generalization to displacements involving pairs of yield stress and shear thinning fluids was presented by Zare et al. [46]. In Fig. 19, the impact of varying the sign and magnitude of χ on the flux function and the characteristic interface profile is illustrated.
For iso-dense fluids, χ = 0, and this limit is equivalent to that of single-phase flow, since now both density and viscosity are equal for the two fluids. The flux function has f < 0 for all c, resulting in a rarefaction wave that is advected according to the Newtonian velocity profile in a plane channel. Density-stable displacements were considered by Lajeunesse et al. [43] and are illustrated in the top row of Fig. 19 for the case of χ = −100. The flux function now contains a convex region for intermediate values of c, and this manifests in a jump discontinuity (shock) in the profile c, seen in the top right panel of Fig. 19. For a positive nonzero value of χ , the flux function is convex in the lower end of the c axis and, for increasing values of χ , eventually develops a second convex region at the opposite side of the c axis. The resulting interface profile therefore consists of either one or two jump discontinuities, as shown in the right panels in Fig. 19. For χ ≈ 117.8, the second jump discontinuity becomes stationary, due to the exact balance of buoyant and viscous stresses. At larger values of χ , buoyant stresses are sufficiently Fig. 19 Flux functions and corresponding concentration profiles for iso-viscous displacements, and different density differences between the fluids. The density-unstable configuration results in a discontinuity in c for low values of the concentration. This corresponds to fast propagation of displacing fluid through the center of the channel. At larger density differences, a second discontinuity emerges at high values of c. For χ > 117.8, the second discontinuity reverses direction, resulting in back-flow of the less dense fluid large to overcome the viscous drag of the displacing fluid, resulting in reverse flow of the lighter fluid. This is illustrated for χ = 150 in Fig. 19.