1 Introduction

Nonlinear Partial Differential Equations (PDEs) depending on high- or even infinite-dimensional parametric inputs are ubiquitous in applied physics and engineering and appear in the context of several problems such as model calibration and validation or model-based design/optimization/control. In all these cases, they must be solved repeatedly for different values of the input parameters which poses an often insurmountable obstacle as each of these simulations can imply a significant computational cost. An obvious way to overcome these difficulties is to develop less-expensive but accurate surrogates which can be used on their own or in combination with a reduced number of runs of the high-fidelity, expensive, reference solver. The construction of such surrogates has been based on physical/mathematical considerations or data i.e. input-output pairs (and sometimes derivatives). Our contribution belongs to the latter category of data-driven surrogates which has attracted a lot of attention in recent years due to the significant progress in the fields of statistical or machine learning [18, 24]. We emphasize however that unlike typical supervised learning problems in data sciences, in the context of computational physics there are several distinguishing features. Firstly, surrogate construction is by definition a Small (or smallest possible) Data problem. The reason we want to have a surrogate in the first place is to avoid using the reference solver which is the one that generates the training data. Secondly, pertinent problems are rich in domain knowledge which should be incorporated as much as possible, not only in order to reduce the requisite training data but also to achieve higher predictive accuracy particularly in out-of-distribution settings. In the context of Bayesian inverse problems which we investigate in this paper, one does not know a priori where the posterior might be concentrated in the parametric space and cannot guarantee that all such regions will be sufficiently represented in the training dataset. Nevertheless the surrogate learned must be accurate enough in these regions in order to resolve the sought posterior.

Data-driven surrogates which are trained in an offline phase and are subsequently used for various downstream tasks have attracted a lot of attention in recent years [5]. Most of these surrogates are constructed by learning a non-linear operator, e.g. a mapping between function spaces and thus between the inputs and the outputs of the PDE, which may depend on additional input parameters. A notable such strategy based on Deep Learning are the Physics-informed Neural Networks (PINNs) [25, 33]. An alternative is offered by Deep Operator Networks (DeepONets, [31, 36]), which in contrast to PINNs, not only take the spatial and temporal location as an input but can also account for the dependence of the PDE solution on input parameters such as the viscosity in the Navier-Stokes equation. Furthermore, Fourier Neural Networks [28] have shown promising results by parametrizing the integral kernel directly in Fourier Space and thus restricting the operator to a convolution. Finally, the Learning Operators with Coupled Attention (LOCA) framework [20] builds upon the well-known attention mechanism that has already shown promising results in natural language processing.

We note that all of the Deep Learning frameworks mentioned fulfill the universal approximation theorem and, under certain conditions, can approximate the non-linear operator to arbitrary accuracy. Another option, is offered by the Optimizing a Discrete Loss (ODIL, [17]) framework. It does not rely on Deep Learning and was shown to be faster than PINNs due to the reduced number of tunable parameters but can only approximate the solution on a discrete grid.

Apart from the aforementioned techniques and for time-dependent PDEs in particular, the solution can be approximated by methods based on Koopman-operator theory [22] which identifies a transformation of the original system that gives rise to linear dynamics [21]. Nevertheless, these methods [8, 12, 27] usually require a large set of reduced-order coordinates or an effective encoder/decoder structure. Especially for physical systems, the restricted dynamics can be endowed with stability and physical, inductive bias [14,15,16].

A common limitation of the aforementioned architectures is that they usually learn only the forward operator whereas for the solution of an inverse problem, its inverse would be more useful. In this work, we extend the DeepONet framework by replacing parts of the previously proposed neural-network architecture with an invertible one. To the authors’ best knowledge, we are thus presenting the first invertible Neural Operator framework. This allows one to perform both forward and inverse passes with the same neural network and the forward and inverse operators can be learned simultaneously. In particular, we make use of the RealNVP architecture [10] which has an analytical inverse.

Furthermore we make use of both labeled and unlabeled (i.e. only inputs and residuals) training data in a physics-aware, semi-supervised approach. While the use of labeled training data is straight-forward, unlabeled training data are incorporated by using the governing equations and minimizing the associated residuals, similarly to the physics-informed DeepONet [36]. Since it is easier and less-expensive to procure unlabeled data in comparison to labeled ones, this leads to significant efficiency gains. Even though our algorithm can produce accurate predictions without any labeled training data and by using only a physics-informed loss, we observe empirically that the addition of labeled training data generally improves the results.

Finally, we show that the proposed invertible DeepONet can be used to very efficiently solve Bayesian inverse problems, i.e. to approximate the whole posterior distribution, without any need for multiple likelihood evaluations and cumbersome iterations as required by alternative inference schemes such as Markov Chain Monte Carlo (MCMC, [4]) or Sequential Monte Carlo (SMC, [23]) or Stochastic Variational Inference (SVI, [9]). In particular, we propose a novel approximation that employs a mixture of Gaussians, the parameters of which are computed semi-analytically. When the proposed Neural Operator framework is trained solely on unlabeled data, this means that we can obtain the solution to the (forward and) inverse problem without ever solving the underlying PDE. While Deep Learning has been successfully applied to inverse problems before [1, 2, 32], our work differs by making use of a fully invertible, operator-learning architecture which leads to a highly efficient approximation of the whole posterior.

The rest of the paper is structured as follows. In Sect. 2 we review the basic elements of invertible neural networks (NNs) and DeepONets and subsequently illustrate how these can be combined and trained with labeled and unlabeled data. Furthermore we present how the resulting invertible DeepONet can be employed in order to approximate the posterior of a model-based, Bayesian inverse problem at minimal additional cost. We illustrate several features of the proposed methodology and assess its performance in Sect. 3 where it is applied to a reaction-diffusion PDE and a Darcy-diffusion problem. The cost and accuracy of the posterior approximation in the context of pertinent Bayesian inverse problems are demonstrated in Sect. 3.4. Finally, we conclude in Sect. 4 with a summary of the main findings and a discussion on the (dis)advantages of the proposed architecture and potential avenues for improvements.

2 Methodology

We first review some basic concepts of invertible neural networks and DeepONets. We subsequently present our novel contributions which consist of an invertible DeepONet architecture and its use for solving efficiently Bayesian inverse problems.

2.1 Invertible neural networks

Neural Networks are in general not invertible which restricts their application in problems requiring inverse operations. Invertibility can be achieved by adding a momentum term [34], restricting the Lipschitz-constant of each layer to be smaller than one [3] or using special building blocks [10]. These formulations have primarily been developed for flow-based architectures but we will apply them to operator learning in this work. In particular, we make use of the RealNVP [10] as this architecture enables an analytical inverse which ensures efficient computations. Each RealNVP building block consists of the transformation below which includes two neural networks denoted by k(.) and r(.). Given a D dimensional input \(\varvec{x}=\{x_i\}_{i=1}^D\) of an invertible layer, the output \(\varvec{y}=\{y_i\}_{i=1}^D\) is obtained as follows:

$$\begin{aligned}{} & {} y_{1:d}=x_{1:d} \end{aligned}$$
(1)
$$\begin{aligned}{} & {} y_{d+1:D}=x_{d+1:D} \circ exp(k(x_{1:d})) + r(x_{1:d}), \end{aligned}$$
(2)

where \(d < D\). Here, \(\circ \) is the Hadamard or element-wise product and d is usually chosen to be half of the dimension of the input vector i.e. \(d=D/2\).

As only d of the components are updated, the input entries after each building block are permuted, e.g. by reversing the vector, to ensure that after a second building block all of them are modified. Therefore, for \(d=D/2\), at least two building blocks are needed in order to modify all entries. We note, that the dimension of the input cannot change and it needs to be identical to the dimension of the output. The two neural networks involved can consist of arbitrary layers as long as their output and input dimensions are consistent with Eq. (2).

The maps defined can be easily inverted which leads to the following equations:

$$\begin{aligned}{} & {} x_{1:d}=y_{1:d} \end{aligned}$$
(3)
$$\begin{aligned}{} & {} x_{d+1:D}=(y_{d+1:D}-r(x_{1:d})) \circ exp(-k(x_{1:d})) \end{aligned}$$
(4)

We note that due to this structure, the Jacobian is lower-triangular and its determinant can be obtained by multiplying the diagonal entries only.

Fig. 1
figure 1

(Left) Classical DeepONet [30] and (Right) proposed Invertible DeepONet architecture

2.2 DeepONets

Before presenting our novel architecture for invertible DeepONets, we briefly review the original DeepONet formulation by [31]. DeepONets have been developed to solve parametric PDEs and significantly extend the Physics-Informed Neural Network (PINNs, [33]) framework as no additional training phase is required if the input parameters of the PDE are changed. We consider a, potentially nonlinear and time-dependent, PDE with an input function \(u \in \mathcal {U}\) and solution function \(s \in \mathcal {S}\) where \(\mathcal {U,S} \) are appropriate Banach spaces. The former can represent e.g. source terms, boundary or initial conditions, material properties. Let:

$$\begin{aligned} \mathcal {N}(u,~s)(\varvec{\xi })=0 \end{aligned}$$
(5)

denote the governing PDE where \(\mathcal {N}:\mathcal {U}\times \mathcal {S} \rightarrow \mathcal {V}\) is an appropriate differential operator and \(\varvec{\xi }\) the spatio-temporal coordinates. Furthermore, let:

$$\begin{aligned} \mathcal {B}(u,~s)(\varvec{\xi })=0 \end{aligned}$$
(6)

denote the operator \(\mathcal {B}:\mathcal {U}\times \mathcal {S} \rightarrow \mathcal {V}\) associated with the boundary or initial conditions. Assuming that the solution s for each \(u \in \mathcal {U}\) is unique, we denote with \(\mathcal {G}: \mathcal {U} \rightarrow \mathcal {S}\) the solution operator that maps from any input u to the corresponding solution s. The goal of DeepONets is to approximate it with an operator \(G_{\varvec{\theta }}\) that depends on tunable parameters \(\varvec{\theta }\). The latter can yield an approximation to the actual solution at any spatio-temporal point \(\varvec{\xi }\) which we denote by \(G_{\varvec{\theta }}(\varvec{\xi })\). It is based on a separated representation [31]Footnote 1:

$$\begin{aligned} G_{\varvec{\theta }}(u)(\varvec{\xi })=\sum _{j=1}^Q b_{j}\left( ~\underbrace{u(\varvec{\eta }_1),...,u(\varvec{\eta }_F)}_{\varvec{u}} \right) ~t_{j}(\varvec{\xi }) \end{aligned}$$
(7)

and consists of the so-called branch network whose terms \(b_{j}\) depend on the values of the input function u at F fixed spatio-temporal locationsFootnote 2\(\{\varvec{\eta }_l\}_{l=1}^F\) which we summarily denote with the vector \(\varvec{u} \in \mathbb {R}^F\), and the so-called trunk network whose terms \(t_{j}\) depend on the spatio-temporal coordinates \(\varvec{\xi }\) (see Fig. 1b). Both networks have trainable weight and bias parameters which we denote collectively by \(\varvec{\theta }\). We emphasize that, once trained, the DeepONet can provide predictions of the solution at any spatio-temporal location \(\varvec{\xi }\), a feature that is very convenient in the context of inverse problems as the same DeepONet can be used for solving problems with different sets of observations.

We note that in the next section, we will use a vectorized formulation of Eq. (7) and process various spatio-temporal coordinate datapoints together as this is needed to ensure invertibility of the DeepONet.

Labeled data can be used for training which consist of pairs of u and corresponding solutions \({s}=\mathcal {G}({u})\) evaluated at certain spatio-temporal locations. Unlabeled training data (i.e. only inputs) can also be employed in a physics-informed approach as introduced in [36], by including the governing PDE in Eq. (5) in an additional loss term as discussed Sect. 2.4.

2.3 Invertible DeepONets

The invertible RealNVP introduced in Sect. 2.1 is employed exclusively on the branch network i.e. we assume that:

$$\begin{aligned} D=F=Q \end{aligned}$$
(8)

and the input \(\varvec{x}\) of Sect. 2.1 is the vector \(\varvec{u} \in \mathbb {R}^F\) containing the values of the PDE-input at \(D=F\) spatio-temporal locations whereas the output \(\varvec{y}\) of Sect. 2.1 is now the \(D=Q\) values of the branch net \(\varvec{b}=[b_1,\ldots ,b_Q]^T \in \mathbb {R}^D\). We note that this restriction regarding the equality of the dimension of the input \(\varvec{u}\) and the output of the branch network \(\varvec{b}\) is due to the use of an invertible architecture. As a consequence, the dimension of the trunk-network output i.e. \(\{t_j(\varvec{\xi })\}_{j=1}^Q\) is also the same as the dimension of \(\varvec{u}\). This requirement does not reduce the generality of the methodology advocated as Q is a free parameter in the definition of the operator \(G_{\varvec{\theta }}\) in Eq. (7).

In view of the inverse problems we would like to address, we consider K spatio-temporal locations, \(\{ \varvec{\xi }_k \}_{k=1}^K\) and we denote with \(\varvec{s} \in \mathbb {R}^K\) the vector containing the PDE-solution’s values at these locations i.e. \(\varvec{s}=\left[ s(\varvec{\xi }_1), \ldots , s(\varvec{\xi }_K)\right] ^T\). Finally we denote with \(\varvec{Y}\) the \(K \times D\) matrix constructed by the values of the trunk network outputs at the aforementioned locations, i.e.:

$$\begin{aligned} {\varvec{Y}}=\begin{bmatrix} t_1(\varvec{\xi }_1)&{} ... &{}t_D(\varvec{\xi }_1) \\ ...&{} &{}...\\ t_1(\varvec{\xi }_K) &{}...&{}t_D(\varvec{\xi }_K) \end{bmatrix}. \end{aligned}$$
(9)

As a result of Eq. (7), we can write that:

$$\begin{aligned} {\varvec{s}}={\varvec{Y}} {\varvec{b}} \end{aligned}$$
(10)

As the matrix \(\varvec{Y}\) is in general non-invertible, one can determine \(\varvec{b}\) given \(\varvec{s}\) by solving a least-squares problem, i.e.:

$$\begin{aligned} \min _{\varvec{b}} ~\Vert \varvec{s} -\varvec{Y} \varvec{b} \Vert ^2_2 \end{aligned}$$
(11)

or a better-behaved, regularized version thereof:

$$\begin{aligned} \min _{\varvec{b}} ~\Vert \varvec{s} -\varvec{Y} \varvec{b} \Vert ^2_2 + \epsilon \Vert \varvec{b} \Vert ^2_2 \end{aligned}$$
(12)

where a small value is generally sufficient for the regularization parameter \(\epsilon<<1\). We note that given \(\varvec{s}\) and once \(\varvec{b}\) has been determined by solving Eqs. (11) or eqrefeq:min2, we can make use of the invertibility of the branch net in order to obtain the input vector \(\varvec{u}\). While other approaches are possible in order to determine \(\varvec{b}\), we recommend using the regularized, least-squares formulation, as this led to robust results in our experiments. It is nevertheless important to use the same method during training and when deterministic predictions are sought, since different methods can lead to different \(\varvec{b}\)’s for the same \(\varvec{s}\). We note that in the proposed method for the solution of Bayesian inverse problems (see Sect. 2.5), no use of Eq. (12) is made except for the training of the DeepONet (see Sect. 2.4).

For the ensuing equations we denote the forward map implied by equation (10) as:

$$\begin{aligned} \varvec{s}=\mathcal {F}_{\varvec{\theta }}({\varvec{u}},{\varvec{Y}}) \end{aligned}$$
(13)

and the inverse obtained by the two steps described above as:

$$\begin{aligned} \varvec{u}=\mathcal {I}_{\varvec{\theta }}({\varvec{s}},{\varvec{Y}}) \end{aligned}$$
(14)

where we explicitly account for the NN parameters \(\varvec{\theta }\).

2.4 A semi-supervised approach for invertible DeepONets

As mentioned earlier and in order to train the invertible DeepONet proposed, i.e. to find the optimal values for the parameters \(\varvec{\theta }\), we employ both labeled (i.e pairs of PDE-inputs u and PDE-outputs s) and unlabled data (i.e. only PDE-inputs u) in combination with the governing equations. The loss function L employed is therefore decomposed into two parts asFootnote 3:

$$\begin{aligned} L = L_{labeled} + L_{unlabeled} \end{aligned}$$
(15)

The first term \(L_{labeled}\) pertains to the labeled data and is further decomposed as:

$$\begin{aligned} L_{labeled}=L_{l,forward}+L_{l,inverse} \end{aligned}$$
(16)

Without loss of generality and in order to keep the notation as simple as possible we assume that \(N_l\) pairs of labeled data are available, each of which consists of the values of the PDE-input u at D locations which we denote with \(\varvec{u}^{(i)} \in \mathbb {R}^D, ~i=1, \ldots N_l\) and the values of the PDE-output at K spatio-temporal locations which we denote with \(\varvec{s}^{(i)} \in \mathbb {R}^K,~i=1, \ldots N_l\). If the \(K\times D\) matrix \(\varvec{Y}\) is defined as in Eq. (9) and in view of the forward (Eq. (13)) and inverse (Eq. (14)) maps defined earlier, we write:

$$\begin{aligned} L_{l,forward}=\frac{1}{N_l}\sum _{i=1}^{N_l} \Vert \varvec{s}^{(i)}-\mathcal {F}_{\varvec{\theta }}(\varvec{u}^{(i)},\varvec{Y}) \Vert ^2_2 \end{aligned}$$
(17)

and:

$$\begin{aligned} L_{l,inverse}=\frac{1}{N_l} \sum _{i=1}^{N_l} \Vert \varvec{u}^{(i)}-\mathcal {I}_{\varvec{\theta }}({\varvec{s}}^{(i)},{\varvec{Y}} ) \Vert ^2_2. \end{aligned}$$
(18)

By employing both loss terms, the NN parameters \(\varvec{\theta }\) can balance the accuracy of the approximation in both maps.

Furthermore and assuming \(N_u\) PDE-inputs are available each of which is evaluated at D spatio-temporal points \(\{ \varvec{\xi }^{(l)} \}_{l=1}^D\) with \(\varvec{u}^{(i)} \in \mathbb {R}^D\) denoting these values, we express the \(L_{unlabeled}\) loss term as:

$$\begin{aligned} L_{unlabeled}=L_{BC}+L_{res}+L_{u,inverse}. \end{aligned}$$
(19)

The first \(L_{BC}\) and second \(L_{res}\) terms are physics-informed [36] and account for the residuals in the boundary (and/or initial) conditions and the governing PDE respectively. In the case of \(L_{BC}\) we select \(N_B\) (uniformly distributed) points along the boundary, say \(\varvec{\xi }_B^{(j)}, l=1,\ldots , N_B\). Then, in view of Eq. (6), we employ:

$$\begin{aligned} L_{BC}=\frac{1}{N_u N_B}\sum _{i=1}^{N_u} \sum _{l=1}^{N_B} \Vert \mathcal {B} ( u^{(i)}, {G}_{\varvec{\theta }}(u^{(i)}) )(\varvec{\xi }_B^{(l)} ) \Vert ^2_2 \end{aligned}$$
(20)

In the interior of the problem domain and in view of Eq. (5), we employ a loss:

$$\begin{aligned} L_{res}= \frac{1}{N_u~N_{res}} \sum _{i=1}^{N_u} \sum _{l=1}^{N_{res}} \Vert \mathcal {N}( u^{(i)}, {G}_{\varvec{\theta }}({u}^{(i)})(\varvec{\xi }^{(l)} ) \Vert ^2_2 \end{aligned}$$
(21)

which involves \(N_{res}\) collocation points.

The third term \(L_{u,inverse}\) pertains to the forward and inverse maps in Eqs. (13), (14) and can be expressed as:

$$\begin{aligned} L_{u,inverse}= \frac{1}{N_u} \sum _{i=1}^{N_u} \Vert \varvec{u}^{i}-\mathcal {I}_{\varvec{\theta }}(\mathcal {F}_{\varvec{\theta }}(\varvec{u}^{(i)} ,{\varvec{Y}} ),\varvec{Y} )) \Vert ^2_2 \end{aligned}$$
(22)

where the matrix \(\varvec{Y}\) is defined as in Eq. (9).

The minimization of the combined loss L, with respect to the NN parameters \(\varvec{\theta }\) of the branch and trunk network, is performed with stochastic gradient descent and the ADAM [19] scheme in particular. Gradients of the loss were computed using the automatic differentiation tools of the JAX library [7]. We finally note that the D spatioemporal locations need not be the same nor do they need to be equal in number in all data instances as assumed in the equations above. In such cases the vector of the observables and the matrices \(\varvec{Y}\) involved would differ which would further complicate the notation but the same DeepONet parameters \(\varvec{\theta }\) would appear in all terms.

2.5 Invertible DeepONets for Bayesian inverse problems

In this section we present how the invertible DeepONets proposed and trained as previously discussed, can be used to efficiently approximate the solution of a Bayesian inverse problem in the presence of, potentially noisy, observations as well as prior uncertainty about the unknowns. A central role is played by the readily available invertible map which the RealNVP architecture affords. In particular, let \(\hat{\varvec{s}} \in \mathbb {R}^K\) denote a vector of noisy observations of the PDE-solution at certain K spatio-temporal locations. These are assumed to be related to the PDE-solution’s values at these locations, denoted summarily by \(\varvec{s} \in \mathbb {R}^K\), as follows:

$$\begin{aligned} \hat{\varvec{s}}=\varvec{s}+\sigma ~\varvec{\eta }, \qquad \varvec{\eta }\sim \mathcal {N}(\varvec{0,I}). \end{aligned}$$
(23)

where \(\sigma ^2\) is the variance of the observational noise. This in turn defines a conditional density (likelihood) \(p(\hat{\varvec{s}} \mid \varvec{s})\):

$$\begin{aligned} p(\hat{{\varvec{s}}} \mid \varvec{s})= \mathcal {N}(\hat{\varvec{s}} \mid {\varvec{s}}, \sigma ^2\varvec{I}). \end{aligned}$$
(24)

In the context of a Bayesian formulation and given the implicit dependence of the PDE-output \(\varvec{s}\) on \(\varvec{u}\), the likelihood would be combined with the a prior density \(p_u(\varvec{u})\) on the PDE-inputs in order to define the sought posterior:

$$\begin{aligned} p(\varvec{u} \mid \hat{\varvec{s}} ) \propto p(\hat{\varvec{s}} \mid \varvec{s})~p_u(\varvec{u}). \end{aligned}$$

Even if the trained DeepONet were used to infer \(p(\varvec{u} \mid \hat{\varvec{s}} )\) (e.g. using MCMC) several evaluations would be needed especially if the dimension of \(\varvec{u}\) was high. In the sequel we demonstrate how one can take advantage of the invertible NN architecture in order to obtain a semi-analytic approximation of the posterior in the form of a mixture of Gaussians and by avoiding iterative algorithms like MCMC altogether.

We note first that by combining the likelihood with Eq. (10), we can write it in terms of the \(D-\)dimensional, branch-network output vector \(\varvec{b}\) as:

$$\begin{aligned} p(\hat{{\varvec{s}}} \mid \varvec{b})= \mathcal {N}(\hat{\varvec{s}} \mid \varvec{Y~b}, \sigma ^2\varvec{I}). \end{aligned}$$
(25)

Since \(\varvec{u} \in \mathbb {R}^D\) is related to \(\varvec{b}\) through the invertible RealNVP \(\varvec{b}_{NN}: \mathbb {R}^D \rightarrow \mathbb {R}^D\), we can also obtain a prior density \(p_b(\varvec{b})\) on \(\varvec{b}\) as:

$$\begin{aligned} p_b(\varvec{b})= p_u(\varvec{b}_{NN}^{-1}(\varvec{b}))~J(\varvec{b}) \end{aligned}$$
(26)

where \(\varvec{b}_{NN}^{-1}\) denotes the inverse and \(J(\varvec{b}) =|\frac{\partial \varvec{b}_{NN}^{-1}}{\partial \varvec{b}} |\) is the determinant of its Jacobian. The latter, as mentioned in Sect. 2.1, is a triangular matrix and its determinant can be readily computed at a cost \(\mathcal {O}(D)\).

We choose not to directly operate with the prior \(p_b(\varvec{b})\), but construct an approximation \(p_{b,G}(\varvec{b})\) to this in the form of a mixture of \(D-\)dimensional Gaussians as this allows as to facilitate subsequent steps in finding the posterior. In particular:

$$\begin{aligned} p_{b,G}(\varvec{b})=\sum _{m=1}^M w_j~\mathcal {N}(\varvec{b} \mid \varvec{m}_{b,m}, \varvec{S}_{b,m}) \end{aligned}$$
(27)

where M denotes the number of mixture components and \(\varvec{m}_{b,m}\), \(\varvec{S}_{b,m}\) the mean vector and covariance matrix of the \(m^{th}\) component respectively. Such an approximation can be readily computed, e.g. using Variational Inference [35] and without any forward or inverse model evaluations by exploiting the fact that samples from \(p_b\) can be readily drawn using ancestral sampling i.e. by drawing samples of \(\varvec{u}\) from \(p_u\) and propagating those with \(\varvec{b}_{NN}\). We note that finding this representation can become more diffucult in case M is large but the complexity of the algorithms involved in general scales linearly with M [6].

By combining the (approximate prior) \(p_{b,G}(\varvec{b})\) above with the Gaussian likelihood \(p(\varvec{\hat{s}} \mid \varvec{b})\) of Eq. (25) we obtain an expression for the posterior \({\tilde{p}} (\varvec{b} \mid \varvec{\hat{s}} )\) using Bayes’ theorem:

$$\begin{aligned} {\tilde{p}} (\varvec{b} \mid \varvec{\hat{s}} ) \propto p(\varvec{\hat{s}} \mid \varvec{b}) p_{b,G}(\varvec{b}) \end{aligned}$$
(28)

Due to the conjugacy of prior and likelihood, we can directly conclude that the (approximate) posterior is also a mixture of Gaussians [6]. Therefore, using expressions for the aforementioned likelihood/prior pair, we obtain a closed-form posterior \({\tilde{p}} (\varvec{b} \mid \varvec{\hat{s}} )\) on \(\varvec{b}\) of the form:

$$\begin{aligned} {\tilde{p}}(\varvec{b}\mid \varvec{\hat{s}})= \sum _{m=1}^M {\tilde{w}}_j ~\mathcal {N}(\varvec{b} \mid \varvec{\mu }_{b,m}, \varvec{C}_{b,m}) \end{aligned}$$
(29)

where the mean \(\varvec{\mu }_{b,m}\) and covariance \(\varvec{C}_{b,m}\) of each mixture component can be computed as:

$$\begin{aligned} \begin{array}{l} \varvec{C}_{b,m}^{-1}=\sigma ^{-2} \varvec{Y}^T\varvec{Y}+\varvec{S}_{b,m}^{-1} \\ \varvec{C}_{b,m}^{-1}\varvec{\mu }_{b,m}=\sigma ^{-2}\varvec{Y}^T\varvec{\hat{s}}+\varvec{S}_b^{-1} \varvec{m}_{b,m} \end{array} \end{aligned}$$
(30)

The weights \({\tilde{w}}_m\) (\(\sum _{m=1}^M {\tilde{w}}_m =1\)) would be proportional to:

$$\begin{aligned} \begin{array}{ll} {\tilde{w}}_m \propto w_m \mid \varvec{D}_m \mid ^{-1/2} &{}\exp (-\frac{1}{2} (\hat{\varvec{s}}-\varvec{Y} \varvec{m}_{b,m})^T \\ {} &{}\varvec{D}_m^{-1} (\hat{\varvec{s}}-\varvec{Y} \varvec{m}_{b,m}) ) \end{array} \end{aligned}$$
(31)

where:

$$\begin{aligned} \varvec{D}_m =\sigma ^2 \varvec{I} + \varvec{Y} \varvec{S}_{b,m} \varvec{Y}^T \end{aligned}$$
(32)

Therefore inference tasks on the sought \({\varvec{u}}\) can be readily carried out by sampling \(\varvec{b}\) from the mixture-of-Gaussians posterior above and propagating those samples through the inverse map \(\varvec{b}_{NN}^{-1}\) to obtain \(\varvec{u}\)-samples. We note that by employing a mixture of Gaussians with sufficient components M, one can approximate with arbitrary accuracy any non-Gaussian density as well as capture multimodal posteriors, a task that is extremely cumbersome with standard, Bayesian inference schemes [11].

Table 1 Main dimensions for each numerical illustration
Table 2 Relative test errors and their standard deviations depending on the amount of labeled training data for the anti-derivative operator. The percentage of labeled data is the amount of data used in comparison to unlabeled training data, e.g. in the \(10\%\) case we used ten times more unlabeled training data whereas in the \(100\%\) case the amount of labeled and unlabeled training data was the same

3 Numerical illustrations

We applied the proposed framework to three examples, i.e. the antiderivate operator , a reaction-diffusion PDE as well as a Darcy-type elliptic PDE. In each of these cases, we report the relative errors of forward and inverse maps (on test data) when trained with varying amounts of labeled and unlabeled training data. For the reaction-diffusion PDE and the Darcy-type elliptic PDE, we also use the proposed invertible-DeepONet-surrogate to solve pertinent Bayesian inverse problems. The code for the aforementioned numerical illustrations is available hereFootnote 4 In Table 1, we summarize the most important dimensions for each of the following examples, namely D: the dimension of the PDE-input, K: the dimension of the observed PDE-output, \(N_l\): number of labeled data (Eqs. (17), (18)), \(N_u\): the number of unlabeled data (e.g. Eq. (22)), \(N_{res}\): the number of interior collocation points (Eq. (21)) and \(N_{BC}\) the number of boundary collocation points (Eq. (20)).

3.1 Anti-derivative operator

As a first test case we considered the antiderivative operator on the interval \(\xi \in [0,1]\) with:

$$\begin{aligned} \frac{d s( \xi )}{d \xi } = u(\xi ) \text { with } s(0)=0 \end{aligned}$$
(33)

i.e. when the input u corresponds to the right-hand-side of this ODE and the operator \(\mathcal {G}(u)\) that we attempt to approximate is simply the integral operator \(\mathcal {G}(u)(\xi )=\int _0^{\xi } ~u(t)~dt\). We generated \(N_u=10000\) unlabeled training data by sampling inputs u from a Gaussian process with zero mean and exponential quadratic covariance kernel with a length scale \(\ell =0.2\). Their values at the same \(D=100\) uniformly-distributed locations in [0, 1] were recorded. We subsequently randomly choose \(N_{res}=200\) collocation points to evaluate the residuals (see Eq. (21)).

Moreover, we used up to \(N_l=10000\) labeled training data, for which the inputs were generated as for the unlabeled training data, and the outputs were obtained by solving the ODE above and evaluating it at \(K=200\) randomly chosen points. We trained the invertible DeepONet on \(N_u=10000\) unlabeled training data with a batch size of 100. In each batch we added 1, 10 or 100 labeled training data points per batch (i.e. \(N_l=100, 1000, 10000\) respectively in Eqs. (17), (18)). A minimum of one labeled datapoint is required in order to set the initial condition correctly as we did not enforce this separately in the unlabeled loss part. With regards to the architecture of the networks used, we employed a MLP with four layers and 100 neurons each for the trunk network and 6 RealNVP building blocks for the branch network which were parametrized by a two-layered MLP. Variations around these values in the number of neurons, layers were also explored (in the subsequent examples as well) and did not impact significantly the performance.

Using the ADAM optimizer and an initial learning rate of \(10^{-3}\), we run the model training for \(4\times 10^4\) iterations with an exponential learning rate decay with rate 0.9 every 1000 iterations. As test data, we used 1000 new (i.e. not included in the training data) input-output pairs and compared the predicted forward and inverse solutions with the actual ones. The results obtained in terms of the relative errors are summarized in Table 2.

The error values indicate that both the forward as well as the inverse maps are very well approximated by the proposed invertible DeepONet. The addition of more labeled training data results in even lower errors especially for the inverse map for which the relative error is decreased from almost \(\sim 4\%\) to \(\sim 2\%\).

In order to visualize the results we plot for four randomly-chosen test cases the predictions (when trained with \(10\%\) labeled data) of both the forward (Fig. 2) and inverse (Fig. 3) operator. In all cases, the predictions are indistinguishable from the reference functions.

Fig. 2
figure 2

Forward map - Comparison of the true PDE-output/solution s (given a PDE-input u) with the one predicted by the proposed invertible DeepONet and for the anti-derivative operator

Fig. 3
figure 3

Inverse map - Comparison of the true PDE-input u (given the PDE-output/solution s) with the one predicted by the proposed invertible DeepONet and for the anti-derivative operator

In Appendix A we include additional results for this problem with varying amounts of unlabeled and labeled training data in order to further show their influence.

Table 3 Relative errors on test data depending on the amount of labeled training data for the reaction-diffusion case

3.2 Reaction-diffusion dynamics

The second illustrative example involves the reaction-diffusion equation on the space-time domain \( \varvec{\xi }=(x,t) \in [0,1]\times [0,1]\):

$$\begin{aligned} \frac{ \partial s}{\partial t} = D_s \frac{ \partial ^2 s}{\partial x^2} + k s^2 + u(x) \end{aligned}$$
(34)

Here, \(D_s=0.01\) is the diffusion constant, \(k=0.01\) the reaction rate and the source-term u(x) is chosen to be the PDE-input. We used zero values as initial conditions and boundary conditions. We generated random source terms by sampling from a Gaussian process with zero mean and and exponential quadratic covariance kernel with a length scale \(\ell =0.2\) which were then evaluated at \(D=100\) uniformly distributed points over [0, 1]. The PDE was subsequently solved using an implicit Finite-Difference scheme and evaluated at 200 randomly chosen points to generate the labeled training data.

We trained our model with \(N_u=5000\) unlabeled data which were processed in batches of 100 samples and to which varying amounts of labeled data were added. Since for this problem the boundary conditions were enforced separately, the amount of labeled training data used could also be zero. All unlabeled training data points were evaluated at \(N_{res}=200\) randomly selected collocation points. With regards to the network architecture, we employed a MLP with five layers and 100 neurons each for the trunk network and 3 RealNVP building blocks for the branch network which were parametrized by a three-layered MLP. Using the ADAM optimizer and an initial learning rate of \(10^{-3}\), we run the model training for \(12\times 10^4\) iterations with an exponential learning rate decay with rate 0.9 every 2000 iterations. For our test dataset, we generated 1000 new (unseen) source terms u and corresponding solutions s. A summary of the relative errors obtained is contained in Table 3.

We note that again for all three settings we achieve very low error rates, which decrease as the amount of labeled training data increases. In Figs. 4 and 5 we show the predictions (trained with 500 i.e. \(10\%\) labeled data) of both forward and inverse map for three randomly chosen test cases.

Fig. 4
figure 4

Forward map - Comparison of the true PDE-output/solution s (given the PDE-input u) with the one predicted by the proposed invertible DeepONet and for Reaction-Diffusion PDE

Fig. 5
figure 5

Inverse map - Comparison of the true PDE-input u (given the PDE-output/solution s) with the one predicted by the proposed invertible DeepONet and for Reaction-Diffusion PDE

3.3 Flow through porous media

In the final example we considered the Darcfy-flow elliptic PDE in the two-dimensional domain \(\varvec{\xi }=(x_1,x_2) \in [0,1]^2\)

$$\begin{aligned} \nabla \cdot (u( \varvec{\xi }) \nabla s(\varvec{\xi }))= 10 \end{aligned}$$
(35)

where the PDE-input u corresponds to the permeability field. We assumed zero values for the solution s along all boundaries which we a-priori incorporated in our operator approximation by multiplying the DeepONet expansion in Eq. (7) with the polynomial \(x_1(1-x_1)~x_2(1-x_2)\). We used \(N_u=1000\) unlabeled training data points with \(N_{res}=3844\) collocation points (Eq. (21)) during training and added either no labeled training data at all (i.e. \(N_l=0\)) or \(N_l=1000\). In order to obtain the latter we solved Eq. (35) with the Finite Element library FEniCS [29] on a \(128\times 128\) mesh with linear elements and evaluated the solution at 3844 regularly distributed points. We represent the PDE-input u as followsFootnote 5:

$$\begin{aligned} \ln (u)=&\sum _{f_1=1}^4 \sum _{f_2=1}^4 c_{f_1,f_2,1} \sin (f_1 x_1) \cos (f_2 x_2) \nonumber \\&\quad +c_{f_1,f_2,2}\sin (f_1 x_1) \sin (f_2 x_2) \nonumber \\&\quad +c_{f_1,f_2,3}\cos (f_1 x_1) \sin (f_2 x_2) \nonumber \\&\quad +c_{f_1,f_2,4}\cos (f_1 x_1) \cos (f_2 x_2) \end{aligned}$$
(36)

using 64 feature functions and corresponding coefficients c. In order to generate the training data, we sampled each of the aforementioned 64 coefficients from a uniform distribution in [0, 1]. In this example the 64-dimensional vector of the c’s serves as the input in the branch network (i.e. \(D=64\)). With the help of the c’s and of Eq. (36), one can reconstruct the full permeability field.

With regards to the network architecture, we employed a MLP with five layers and 64 Neurons each for the trunk network and 3 RealNVP building blocks for the branch network which were parametrized by a three-layered MLP. Using the ADAM optimizer and an initial learning rate of \(10^{-3}\), we run the model training for \(10^5\) iterations with an exponential learning rate decay with rate 0.9 every 2000 iterations. We tested the trained model on 2500 unseen test data and obtained the results in Table 4. As in the previous examples, the inclusion of labeled data significantly improves the predictive accuracy of the trained model. For the case without data the predictive accuracy of the forward map is slightly lower but the accuracy in the inverse map is comparably low. The addition of labeled data improves the predictive accuracy for both maps.

Table 4 Relative errors on test data depending on the amount of labeled training data for the Darcy example with feature coefficients as inputs

In Fig. 6 we compare the reference solution for two illustrative test cases with the the forward map learned with labeled training data. As suggested by the cumulative results in Table 4 the two predictions are very close to the reference and the accuracy is very high. In Figs. 7 (without labeled training) and 8 (with labeled training) the results for two illustrative inverse test cases are shown. While locally the error can be significant, the main characteristics of the PDE-input field u can be captured.

Fig. 6
figure 6

Forward map - Comparison of the true PDE-output/solution s (given feature coefficients as the PDE-input u) with the one predicted by the proposed invertible DeepONet and for Darcy-type PDE

Fig. 7
figure 7

Inverse map - Comparison of the reconstructed PDE-input (given the PDE-output/solution s) with the one predicted by the proposed invertible DeepONet and for Darcy-type PDE with zero labeled training data

Fig. 8
figure 8

Inverse map - Comparison of the reconstructed PDE-input (given the PDE-output/solution s) with the one predicted by the proposed invertible DeepONet and for Darcy-type PDE

We discuss in the next section the case where the input permeability field u is not represented with respect to some feature functions but rather as a discretized continuous field [13].

3.3.1 Coarse-grained (CG) input parameters

In this sub-case, we modeled the permeability field u with an exponentiated (to ensure positivity) Gaussian Process with mean zero and exponential quadratic covariance with length scale \(\ell =0.1\). The PDE was then again solved on a \(128 \times 128\) FE mesh and the values of the solution s were assumed to be observed at 3844 regularly distributed points. We moreover sub-sampled the generated PDE input on a regular \(8\times 8\) grid and its \(D=64\) values represented the branch network input \(\varvec{u}\). We generated \(N_u=1000\) unlabeled fields u in total and used \(N_{res}=3844\) collocation points (Eq. (21)) during training. We also trained the model with \(N_l=1000\) labeled training data.

The results obtained can be found in Table 5. The test data in this table consists of 2500 unseen, discretized, permeability fields and their respective solutions. The error rates are computed with respect to the coarse-grained reference input. As in the previous setting, we observe a significant improvement in the accuracy of the inverse map when labeled data are used in training.

In Fig. 9 we compare the reference solution for two illustrative test cases with the the forward map learned with labeled training data. As suggested by the cumulative results in Table 5 the two predictions are very close to the reference and the accuracy is very high.

In Figs. 10 (without labeled training) and 11 (with labeled training) the results for two illustrative inverse test cases are shown. We note again that the main features of the PDE-input’s spatial variability are captured, despite the presence of localized errors.

3.4 Bayesian inverse problems

In this section we demonstrate the utility of the invertible DeepONet proposed in the solution of Bayesian inverse problems and in obtaining accurate approximations of the posterior without any need for additional reference model runs nor for any costly and asymptotically-exact sampling. For each of the examples considered, only one observed output \(\hat{\varvec{s}}\) was assumed to be given. The variance of the observational noise \(\sigma ^2\) was assumed to be given although this could readily be inferred, especially if a conjugate inverse-Gamma prior was used for it. In this manner, any deviations from the actual posterior could be attributed to inaccuracies of the DeepONet-based surrogate. Errors due to the approximation of the prior with a mixture of Gaussians as in Eq. (27) can be made arbitrarily small by increasing the number of mixture components M.

3.4.1 Reaction-diffusion dynamics

We employed the trained model of the reaction-diffusion system (with \(10\%\) labeled training data), in combination with the formulation detailed in Sect. 2.5 for approximating the posterior. We use a prior \(p_u(\varvec{u})\) arising from the discretization of Gaussian Process with zero mean and exponential quadratic covariance kernel with a length scale \(\ell =0.2\). For the Gaussian mixture models involved for the prior and subsequently the posterior on \(\varvec{b}\) we used two components i.e. \(M=2\) in Eqs. (27), (29). The results can be seen in the following Figures. The obtained posterior encapsulates the true parameter input for all three cases.

Table 5 Relative errors on test data depending on the amount of labeled training data for the Darcy example with coarse-grained input parameters

In Fig. 12 we used test cases with 100 observed solution data points for each parameter input and a noise level of \(\sigma ^2=0.001\) (see Eq. (23)). In Fig. 13 we increased the noise level ten-fold, to \(\sigma ^2=0.01\) and, as expected, so did the posterior uncertainty. In Fig. 14 we used \(\sigma ^2=0.001\) but decreased the number of observations of the PDE-solution to 25 points (instead of 100). As expected, this led to an increase in posterior uncertainty.

Fig. 9
figure 9

Forward map - Comparison of the true PDE-output/solution s (given the coarse-grained PDE-input u) with the one predicted by the proposed invertible DeepONet and for Darcy-type PDE

Fig. 10
figure 10

Inverse map - Comparison of the coarse-grained PDE-input u (given the PDE-output/solution s) with the one predicted by the proposed invertible DeepONet and for Darcy-type PDE with zero labeled training data

Fig. 11
figure 11

Inverse map - Comparison of the coarse-grained PDE-input (given the PDE-output/solution s) with the one predicted by the proposed invertible DeepONet and for Darcy-type PDE

Our method can therefore be used as a fast approach without any need for optimization and MCMC sampling to generate an approximate posterior. We note that the posterior uncertainty increases if number of observations decreases or if the observation noise \(\sigma ^2\) increases. In Appendix B, we show the excellent agreement of the approximate posterior computed with the actual one as obtained by costly and time-consuming MCMC simulations.

3.4.2 Flow through porous media

We also solved a Bayesian inverse problem in the context of the Darcy-type PDE by using our trained model of Sect. 3.3 with added labeled training data. We computed an approximate posterior based on the algorithm presented in Sect. 2.5 and compared it with the true PDE-input. For the Gaussian mixture models involved for the prior and subsequently the posterior on \(\varvec{b}\) we used two mixture components i.e. \(M=2\) in Eqs. (27), (29).

Fig. 12
figure 12

Bayesian Inverse Problem for Reaction-Diffusion PDE: 100 observed data points with \(\sigma ^2=0.001\) and a 100-dimensional parametric input

Fig. 13
figure 13

Bayesian Inverse Problem for Reaction-Diffusion PDE: 100 observed data points with \(\sigma ^2=0.01\) and a 100-dimensional parametric input

Fig. 14
figure 14

Bayesian Inverse Problem for Reaction-Diffusion PDE: 25 observed data points with \(\sigma ^2=0.001\) and a 100-dimensional parametric input

Firstly, we considered permeability fields represented with respect to 64 known feature functions as described in Sect. 3.3. The 64 coefficients c (Eq. (36)) represented the sought PDE-inputs and a uniform prior in \([0,1]^{64}\) was employed. The results in terms of the permeability field u can be seen in the following Figures. The obtained posterior is in good agreement with the ground truth, e.g. the PDE-input field used to generate the data with the PDE-solver.

In particular, in Fig. 15 we assumed that 3844 observations of the PDE-output were available, on a \(62 \times 62\) regular grid. The data that was synthetically generated was contaminated with Gaussian noise with \(\sigma ^2=0.001\) (see Eq. (23)). In Fig. 16 we increased the noise level and subsequently the posterior uncertainty was slightly higher but the posterior mean is still close to the ground truth. In Fig. 17 we used \(\sigma ^2=0.01\) but decreased the number of observations by \(50\%\) to 1922. As expected, the posterior uncertainty increased again but still encapsulated the ground truth.

Finally, we considered the case where the PDE-input is represented on a regular \(8\times 8\) grid as in Sect. 3.3.1. The discretized GP described therein was used as the prior. In Fig. 18 we compare the ground truth with the posterior mean and standard deviation as obtained from 3844 observations on a \(62 \times 62\) regular grid and for a noise level of \(\sigma ^2=0.01\) (see Eq. (23)). In Fig. 19 we used lower noise with \(\sigma ^2=0.001\) level and, as expected, the posterior uncertainty was lower and the posterior mean was closer to the ground truth. In Fig. 20 we again choose the previous noise level but decreased the number of observations by half, to 1922. As expected, the posterior uncertainty increased but still encapsulated the ground truth.

4 Conclusions

Fig. 15
figure 15

Bayesian Inverse Problem for Darcy-type PDE: 3844 observed data points with \(\sigma ^2=0.001\) and a 64-dimensional parametric input representing feature coefficients

We introduced an invertible DeepONet architecture for constructing data-driven surrogates of PDEs with parametric inputs. The use of the RealNVP architecture in the branch-network enables one to obtain simultaneously accurate approximations of both the forward and the inverse map (i.e. from PDE-solution to PDE-input). The latter is particularly useful for deterministic and stochastic (Bayesian), PDE-based, inverse problems for which accurate solutions can be readily obtained once the proposed DeepONet has been trained offline. The training framework can make use of expensive, labeled data (i.e. PDE input-output pairs) as well as inexpensive, unlabeled data (i.e. only PDE-inputs) by incorporating residuals of the governing PDE and its boundary/initial conditions into the loss function. The use of labeled data was generally shown to improve predictive accuracy and especially in terms of the inverse map which is something that warrants further investigation.

Fig. 16
figure 16

Bayesian Inverse Problem for Darcy-type PDE: 3844 observed data points with \(\sigma ^2=0.01\) and a 64-dimensional parametric input representing feature coefficients

Fig. 17
figure 17

Bayesian Inverse Problem for Darcy-type PDE: 1922 observed data points with \(\sigma ^2=0.01\) and a 64-dimensional parameter input representing feature coefficients

Fig. 18
figure 18

Bayesian Inverse Problem for Darcy-type PDE: 3844 observations with \(\sigma ^2=0.01\) and a 64-dimensional, discretized permeability field

Fig. 19
figure 19

Bayesian Inverse Problem for Darcy-type PDE: 3844 observations with \(\sigma ^2=0.001\) and a 64-dimensional discretized permeability field

Fig. 20
figure 20

Bayesian Inverse Problem for Darcy-type PDE: 1922 observations with \(\sigma ^2=0.01\) and a 64-dimensional discretized permeability field

In the case of Bayesian formulations in particular, we showed that the availability of the inverse map can lead to highly-efficient approximations of the sought posterior without the need of additional PDE solves and without any cumbersome sampling (e.g. due to MCMC, SMC) or iterations (e.g. due to SVI).

The performance of the proposed strategy was demonstrated on several PDEs with modest- to high-dimensional parametric inputs and its efficiency was assessed in terms of the amounts of labeled vs unlabeled data. Furthermore, the approximate posterior obtained was in very good agreement with the exact posterior obtained with the reference solver and MCMC. The accuracy persisted for various levels of noise in the data as well as when changing the number of available observations. We note finally that unbiased estimates with respect to the exact posterior could be readily obtained with Importance Sampling and by using the approximate posterior as the importance sampling density. This would nevertheless imply additional PDE solves which we would expect to be modest in number given the accuracy of the approximation i.e. the proximity of the Importance Sampling density with the actual posterior.

Fig. 21
figure 21

Relative errors on test data for the forward and inverse map depending on the amount of labeled and unlabeled training data