Semi-supervised invertible neural operators for Bayesian inverse problems

Neural Operators offer a powerful, data-driven tool for solving parametric PDEs as they can represent maps between infinite-dimensional function spaces. In this work, we employ physics-informed Neural Operators in the context of high-dimensional, Bayesian inverse problems. Traditional solution strategies necessitate an enormous, and frequently infeasible, number of forward model solves, as well as the computation of parametric derivatives. In order to enable efficient solutions, we extend Deep Operator Networks (DeepONets) by employing a RealNVP architecture which yields an invertible and differentiable map between the parametric input and the branch-net output. This allows us to construct accurate approximations of the full posterior, irrespective of the number of observations and the magnitude of the observation noise, without any need for additional forward solves nor for cumbersome, iterative sampling procedures. We demonstrate the efficacy and accuracy of the proposed methodology in the context of inverse problems for three benchmarks: an anti-derivative equation, reaction-diffusion dynamics and flow through porous media.


Introduction
Nonlinear Partial Differential Equations (PDEs) depending on high-or even infinite-dimensional parametric inputs are ubiquitous in applied physics and engineering and appear in the context of several problems such as model calibration and validation or model-based design/optimization/control. In all these cases, they must be solved repeatedly for different values of the input parameters which poses an often insurmountable obstacle as each of these simulations can imply a significant computational cost. An obvious way to overcome these difficulties is to develop less-expensive but accurate surrogates which can be used on their own or in combination with a reduced number of runs of the high-fidelity, expensive, reference solver. The construction of such surrogates has been based on physical/mathematical considerations or data i.e. input-output pairs (and sometimes derivatives). Our contribution belongs to the latter category of data-driven surrogates which has attracted a lot of attention in recent years due to the significant progress in the fields of statistical or machine learning (Koutsourelakis et al, 2016;Karniadakis et al, 2021). We emphasize however that unlike typical supervised learning problems in data sciences, in the context of computational physics there are several distinguishing features. Firstly, surrogate construction is by definition a Small (or smallest possible) Data problem. The reason we want to have a surrogate in the first place is to avoid using the reference solver which is the one that generates the training data. Secondly, pertinent problems are rich in domain knowledge which should be incorporated as much as possible, not only in order to reduce the requisite training data but also to achieve higher predictive accuracy particularly in out-of-distribution settings. In the context of Bayesian inverse problems which we investigate in this paper, one does not know a priori where the posterior might be concentrated in the parametric space and cannot guarantee that all such regions will be sufficiently represented in the training dataset. Nevertheless the surrogate learned must be accurate enough in these regions in order to resolve the sought posterior.
Data-driven surrogates which are trained in an offline phase and are subsequently used for various downstream tasks have attracted a lot of attention in recent years (Bhattacharya et al, 2020). Most of these surrogates are constructed by learning a non-linear operator, e.g. a mapping between function spaces and thus between the inputs and the outputs of the PDE, which may depend on additional input parameters. A notable such strategy based on Deep Learning are the Physics-informed Neural Networks (PINNs) (Lagaris et al, 1998;Raissi et al, 2019). An alternative is offered by Deep Operator Networks (DeepONets, Wang et al, 2021)), which in contrast to PINNs, not only take the spatial and temporal location as an input but can also account for the dependence of the PDE solution on input parameters such as the viscosity in the Navier-Stokes equation. Furthermore, Fourier Neural Networks (Li et al, 2020) have shown promising results by parametrizing the integral kernel directly in Fourier Space and thus restricting the operator to a convolution. Finally, the Learning Operators with Coupled Attention (LOCA) framework (Kissas et al, 2022) builds upon the well-known attention mechanism that has already shown promising results in natural language processing. We note that all of the Deep Learning frameworks mentioned fulfill the universal approximation theorem and, under certain conditions, can approximate the non-linear operator to arbitrary accuracy. Another option, is offered by the Optimizing a Discrete Loss (ODIL, Karnakov et al (2022)) framework. It does not rely on Deep Learning and was shown to be faster than PINNs due to the reduced number of tunable parameters but can only approximate the solution on a discrete grid. Apart from the aforementioned techniques and for time-dependent PDEs in particular, the solution can be approximated by methods based on Koopman-operator theory (Koopman, 1931) which identifies a transformation of the original system that gives rise to linear dynamics (Klus et al, 2018). Nevertheless, these methods (Lee and Carlberg, 2020;Gin et al, 2019;Champion et al, 2019) usually require a large set of reducedorder coordinates or an effective encoder/decoder structure. Especially for physical systems, the restricted dynamics can be endowed with stability and physical, inductive bias (Kaltenbach and Koutsourelakis, 2021;Kalia et al, 2021;Kaltenbach and Koutsourelakis, 2020).
A common limitation of the aforementioned architectures is that they usually learn only the forward operator whereas for the solution of an inverse problem, its inverse would be more useful. In this work, we extend the DeepONet framework by replacing parts of the previously proposed neural-network architecture with an invertible one. To the authors' best knowledge, we are thus presenting the first invertible Neural Operator framework. This allows one to perform both forward and inverse passes with the same neural network and the forward and inverse operators can be learned simultaneously. In particular, we make use of the RealNVP architecture (Dinh et al, 2016) which has an analytical inverse. Furthermore we make use of both labeled and unlabeled (i.e. only inputs and residuals) training data in a physics-aware, semi-supervised approach. While the use of labeled training data is straight-forward, unlabeled training data are incorporated by using the governing equations and minimizing the associated residuals, similarly to the physics-informed DeepONet (Wang et al, 2021). Since it is easier and less-expensive to procure unlabeled data in comparison to labeled ones, this leads to significant efficiency gains. Even though our algorithm can produce accurate predictions without any labeled training data and by using only a physics-informed loss, we observe empirically that the addition of labeled training data generally improves the results. Finally, we show that the proposed invertible DeepONet can be used to very efficiently solve Bayesian inverse problems, i.e. to approximate the whole posterior distribution, without any need for multiple likelihood evaluations and cumbersome iterations as required by alternative inference schemes such as Markov Chain Monte Carlo (MCMC, Beskos et al (2017)) or Sequential Monte Carlo (SMC, Koutsourelakis (2009)) or Stochastic Variational Inference (SVI, Detommaso et al (2018)). In particular, we propose a novel approximation that employs a mixture of Gaussians, the parameters of which are computed semianalytically. When the proposed Neural Operator framework is trained solely on unlabeled data, this means that we can obtain the solution to the (forward and) inverse problem without ever solving the underlying PDE. While Deep Learning has been successfully applied to inverse problems before (Adler andÖktem, 2017;Ardizzone et al, 2018;Mo et al, 2019), our work differs by making use of a fully invertible, operator-learning architecture which leads to highly efficient approximation of the whole posterior.
The rest of the paper is structured as follows. In section 2 we review the basic elements of invertible neural networks (NNs) and DeepoNets and subsequently illustrate how these can be combined and trained with labeled and unlabeled data. Furthermore we present how the resulting invertible DeepONet can be employed in order to approximate the posterior of a model-based, Bayesian inverse problem at minimal additional cost. We illustrate several features of the proposed methodology and assess its performance in section 3 where it is applied to a reaction-diffusion PDE and a Darcy-diffusion problem. The cost and accuracy of the posterior approximation in the context of pertinent Bayesian inverse problems are demonstrated in section 3.4. Finally, we conclude in section 4 with a summary of the main findings and a discussion on the (dis)advantages of the proposed architecture and potential avenues for improvements.

Methodology
We first review some basic concepts of invertible neural networks and DeepONets. We subsequently present our novel contributions which consist of an invertible DeepONet architecture and its use for solving efficiently Bayesian inverse problems.

Invertible Neural Networks
Neural Networks are in general not invertible which restricts their application in problems requiring inverse operations. Invertibility can be achieved by adding a momentum term (Sander et al, 2021), restricting the Lipschitz-constant of each layer to be smaller than one (Behrmann et al, 2019) or using special building blocks (Dinh et al, 2016). These formulations have primarily been developed for flow-based architectures but we will apply them to operator learning within this work. In particular, we make use of the RealNVP (Dinh et al, 2016) as this architecture enables an analytical inverse which ensures efficient computations. Each RealNVP building block consists of the transformation below which includes two neural networks denoted by k(.) and r(.). Given a D dimensional input x = {x i } D i=1 of an invertible layer, the output y = {y i } D i=1 is obtained as follows: Here, • is the Hadamard or elementwise product and d is usually chosen to be half of the dimension of the input vector i.e. d = D/2. As only d of the components are updated, the input entries after each building block are permuted, e.g. by reversing the vector, to ensure that after a second building block all of them are modified. Therefore, for d = D/2, at least two building blocks are needed in order to modify all entries. We note, that the dimension of the input cannot change and it needs to be identical to the dimension of the output. The two neural networks involved can consist of arbitrary layers as long as their output and input dimensions are consistent with Equation (2). The maps defined can be easily inverted which leads to the following equations: We note that due to this structure, the Jacobian is lower-triangular and its determinant can be obtained by multiplying the diagonal entries only.

DeepONets
Before presenting our novel architecture for invertible DeepONets, we briefly review the original DeepONet formulation by Lu et al (2021). Deep-ONets have been developed to solve parametric PDEs and significantly extend the Physics-Informed Neural Network (PINNs, Raissi et al (2019)) framework as no additional training phase is required if the input parameters of the PDE are changed. We consider a, potentially nonlinear and time-dependent, PDE with an input function u ∈ U and solution function s ∈ S where U, S are appropriate Banach spaces. The former can represent e.g. source terms, boundary or initial conditions, material properties. Let: denote the governing PDE where N : U × S → V is an appropriate differential operator and ξ the spatio-temporal coordinates. Furthermore, let: denote the operator B : U × S → V associated with the boundary or initial conditions. Assuming that the solution s for each u ∈ U is unique, we denote with G : U → S the solution operator that maps from any input u to the corresponding solution s. The goal of DeepONets is to approximate it with an operator G θ that depends on tunable parameters θ. The latter can yield an approximation to the actual solution at any spatio-temporal point ξ which we denote by G θ (ξ). It is based on a separated representation  1 : (7) and consists of the so-called branch network whose terms b j depend on the values of the input function u at F fixed spatio-temporal locations 2 {η l } F l=1 which we summarily denote with the vector u ∈ R F , and the so-called trunk network whose terms t j depend on the spatio-temporal coordinates ξ (see Figure 2.2). Both networks have trainable weight and bias parameters which we denote collectively by θ. We emphasize that, once trained, the DeepONet can provide predictions of the solution at any spatio-temporal location ξ, a feature that is very convenient in the context of inverse problems as the same DeepONet can be used for solving problems with different sets of observations. We note that in the next section, we will use a vectorized formulation of Equation (7) and process various spatio-temporal coordinate datapoints together as this is needed to ensure invertibility of the DeepONet. Labeled data can be used for training which consist of pairs of u and corresponding solutions s = G(u) evaluated at certain spatio-temporal locations. Unlabeled training data (i.e. only inputs) can also be employed in a physics-informed approach as introduced by Wang et al (2021), by including the governing PDE in Equation (5) in an additional loss term as discussed section 2.4.

Invertible DeepONets
The invertible RealNVP introduced in section 2.1 is employed exclusively on the branch network i.e. we assume that: and the input x of section 2.1 is the vector u ∈ R F containing the values of the PDE-input at D = F We note that this restriction regarding the equality of the dimension of the input u and the output of the branch network b is due to the use of an invertible architecture. As a consequence, the dimension of the trunk-network output i.e. {t j (ξ)} Q j=1 is also the same as the dimension of u. This requirement does not reduce the generality of the methodology advocated as Q is a free parameter in the definition of the operator G θ in Equation (7).
In view of the inverse problems we would like to address, we consider K spatio-temporal locations, {ξ k } K k=1 and we denote with s ∈ R K the vector containing the PDE-solution's values at these locations i.e. s = [s(ξ 1 ), . . . , s(ξ K )] T .
Finally we denote with Y the K × D matrix constructed by the values of the trunk network outputs at the aforementioned locations, i.e.: As a result of Equation (7), we can write that: As the matrix Y is in general non-invertible, one can determine b given s by solving a least-squares problem, i.e.: or a better-behaved, regularized version thereof: where a small value is generally sufficient for the regularization parameter << 1. We note that given s and once b has been determined by solving Equation (11) or Equation (12), we can make use of the invertibility of the branch net in order to obtain the input vector u. While other approaches are possible in order to determine b, we recommend using the regularized, least-squares formulation, as this led to robust results in our experiments. It is nevertheless important to use the same method during training and when deterministic predictions are sought, since different methods can lead to different b's for the same s. We note that in the proposed method for the solution of Bayesian inverse problems (see Section 2.5), no use of Equation (12) is made except for the training of the DeepONet (see Section 2.4).
For the ensuing equations we denote the forward map implied by Equation (10) as: and the inverse obtained by the two steps described above as: where we explicitly account for the NN parameters θ.

A Semi-supervised Approach for Invertible DeepONets
As mentioned earlier and in order to train the invertible DeepONet proposed, i.e. to find the optimal values for the parameters θ, we employ both labeled (i.e pairs of PDE-inputs u and PDE-outputs s) and unlabled data (i.e. only PDE-inputs u) in combination with the governing equations. The loss function L employed is therefore decomposed into two parts as 3 : The first term L labeled pertains to the labeled data and is further decomposed as: Without loss of generality and in order to keep the notation as simple as possible we assume that N l pairs of labeled data are available, each of which consists of the values of the PDE-input u at D locations which we denote with u (i) ∈ R D , i = 1, . . . N l and the values of the PDE-output at K spatio-temporal locations which we denote with (9) and in view of the forward (Equation (13)) and inverse (Equation (14)) maps defined earlier, we write: and: By employing both loss terms, the NN parameters θ can balance the accuracy of the approximation in both maps. Furthermore and assuming N u PDE-inputs are available each of which is evaluated at D spatiotemporal points {ξ (l) } D l=1 with u (i) ∈ R D denoting these values, we express the L unlabeled loss term as: The first L BC and second L res terms are physicsinformed Wang et al (2021) and account for the residuals in the boundary (and/or initial) conditions and the governing PDE respectively. In the case of L BC we select N B (uniformly distributed) points along the boundary, say ξ (j) 3 All loss functions depend on θ which we omit in order to simplify the notation.
Then, in view of Equation (6), we employ: In the interior of the problem domain and in view of Equation (5), we employ a loss: The third term L u,inverse pertains to the forward and inverse maps in Equations (13), (14) and can be expressed as: (9).
The minimization of the combined loss L, with respect to the NN parameters θ of the branch and trunk network, is performed with stochastic gradient descent and the ADAM (Kingma and Ba, 2014) scheme in particular. Gradients of the loss were computed using the automatic differentiation tools of the JAX library (Bradbury et al, 2018). We finally note that the D spatioemporal locations need not be the same nor do they need to be equal in number in all data instances as assumed in the equations above. In such cases the vector of the observables and the matrices Y involved would differ which would further complicate the notation but the same DeepONet parameters θ would appear in all terms.

Invertible DeepONets for Bayesian inverse problems
In this section we discuss how the invertible Deep-ONets proposed and trained as previously discussed, can be used to efficiently approximate the solution of a Bayesian inverse problem in the presence of, potentially noisy, observations as well as prior uncertainty about the unknowns. A central role is played by the readily available invertible map which the RealNVP architecture affords. In particular, letŝ ∈ R K denote a vector of noisy observations of the PDE-solution at certain K spatio-temporal locations. These are assumed to be related to the PDE-solution's values at these locations, denoted summarily by s ∈ R K , as follows:ŝ where σ 2 is the variance of the observational noise. This in turn defines a conditional density (likelihood) p(ŝ | s): In the context of a Bayesian formulation and given the implicit dependence of the PDE-output s on u, the likelihood would be combined with the a prior density p u (u) on the PDE-inputs in order to define the sought posterior: Even if the trained DeepONet were used to infer p(u |ŝ) (e.g. using MCMC) several evaluations would be needed especially if the dimension of u was high. In the sequel we demonstrate how one can take advantage of the invertible NN architecture in order to obtain a semi-analytic approximation of the posterior in the form of a mixture of Gaussians and by avoiding iterative algorithms like MCMC altogether.
We note first that by combining the likelihood with Equation (10), we can write it in terms of the D−dimensional, branch-network output vector b as: Since u ∈ R D is related to b through the invertible RealNVP b N N : R D → R D , we can also obtain a prior density p b (b) on b as: ∂b | is the determinant of its Jacobian. The latter, as mentioned in section 2.1, is a triangular matrix and its determinant can be readily computed at a cost O(D).
We choose not to directly operate with the prior p b (b), but construct an approximation p b,G (b) to this in the form of a mixture of D−dimensional Gaussians as this allows as to facilitate subsequent steps in finding the posterior. In particular: where M denotes the number of mixture components and m b,m , S b,m the mean vector and covariance matrix of the m th component respectively. Such an approximation can be readily computed, e.g. using Variational Inference (Wainwright and Jordan, 2008) and without any forward or inverse model evaluations by exploiting the fact that samples from p b can be readily drawn using ancestral sampling i.e. by drawing samples of u from p u and propagating those with b N N . We note that finding this representation can become more diffucult in case M is large but the complexity of the algorithms involved in general scales linearly with M (Bishop and Nasrabadi, 2006). By combining the (approximate prior) p b,G (b) above with the Gaussian likelihood p(ŝ | b) of Equation (25) we obtain an expression for the posteriorp(b |ŝ) using Bayes' theorem: Due to the conjugacy of prior and likelihood, we can directly conclude that the (approximate) posterior is also a mixture of Gaussians (Bishop and Nasrabadi, 2006). Therefore, using expressions for the aforementioned likelihood/prior pair, we obtain a closed-form posteriorp(b |ŝ) on b of the form: where the mean µ b,m and covariance C b,m of each mixture component can be computed as: The weightsw m ( M m=1w m = 1) would be proportional to: where: 32) Therefore inference tasks on the sought bsu can be readily carried out by sampling b from the mixture-of-Gaussians posterior above and propagating those samples through the inverse map b −1 N N to obtain u-samples. We note that by employing a mixture of Gaussians with sufficient components M , one can approximate with arbitrary accuracy any non-Gaussian density as well as capture multimodal posteriors, a task that is extremely cumbersome with standard, Bayesian inference schemes (Franck and Koutsourelakis, 2017).

Numerical Illustrations
We applied the proposed framework to three examples, i.e. the antiderivate operator , a reaction-diffusion PDE as well as a Darcy-type elliptic PDE. In each of these cases, we report the relative errors of forward and inverse maps (on test data) when trained with varying amounts of labeled and unlabeled training data. For the reaction-diffusion PDE and the Darcy-type elliptic PDE, we also use the proposed invertible-DeepONet-surrogate to solve pertinent Bayesian inverse problems. The code for the aforementioned numerical illustrations is available here 4 . In Table  1, we summarize the most important dimensions for each of the following examples, namely D: the dimension of the PDE-input, K: the dimension of the observed PDE-output, N l : number of labeled data (Equations (17), (18)), N u : the number of unlabeled data (e.g. Equation (22)), N res : the number of interior collocation points (Equation (21)) and N BC the number of boundary collocation points (Equation (20)).

Anti-derivative Operator
As a first test case we considered the antiderivative operator on the interval ξ ∈ [0, 1] with: i.e. when the input u corresponds to the righthand-side of this ODE and the operator G(u) that 4 URL https://github.com/pkmtum/Semi-supervised Invertible Neural Operators we attempt to approximate is simply the integral operator G(u)(ξ) = ξ 0 u(t) dt. We generated N u = 10000 unlabeled training data by sampling inputs u from a Gaussian process with zero mean and exponential quadratic covariance kernel with a length scale = 0.2. Their values at the same D = 100 uniformly-distributed locations in [0, 1] were recorded. We subsequently randomly choose N res = 200 collocation points to evaluate the residuals (see Equation (21)).
Moreover, we used up to N l = 10000 labeled training data, for which the inputs were generated as for the unlabeled training data, and the outputs were obtained by solving the ODE above and evaluating it at K = 200 randomly chosen points. We trained the invertible DeepONet on N u = 10000 unlabeled training data with a batch size of 100. In each batch we added 1, 10 or 100 labeled training data points per batch (i.e. N l = 100, 1000, 10000 respectively in Equations (17), (18)). A minimum of one labeled datapoint is required in order to set the initial condition correctly as we did not enforce this separately in the unlabeled loss part. With regards to the architecture of the networks used, we employed a MLP with four layers and 100 neurons each for the trunk network and 6 RealNVP building blocks for the branch network which were parametrized by a two-layered MLP. Variations around these values in the number of neurons, layers were also explored (in the subsequent examples as well) and did not impact significantly the performance.
Using the ADAM optimizer and an initial learning rate of 10 −3 , we run the model training for 4 × 10 4 iterations with an exponential learning rate decay with rate 0.9 every 1000 iterations. As test data, we used 1000 new (i.e. not included in the training data) input-output pairs and compared the predicted forward and inverse solutions with the actual ones. The results obtained in terms of the relative errors are summarized in Table 2.
The error values indicate that both the forward as well as the inverse maps are very well approximated by the proposed invertible DeepONet. The addition of more labeled training data results in even lower errors especially for the inverse map for which the relative error is decreased from almost ∼ 4% to ∼ 2%. In order to visualize the results we plot for four  Table 2 Relative test errors and their standard deviations depending on the amount of labeled training data for the anti-derivative operator. The percentage of labeled data is the amount of data used in comparison to unlabeled training data, e.g. in the 10% case we used ten times more unlabeled training data whereas in the 100% case the amount of labeled and unlabeled training data was the same.
randomly-chosen test cases the predictions (when trained with 10% labeled data) of both the forward ( Figure 2) and inverse (Figure 3) operator. In all cases, the predictions are indistinguishable from the reference functions.
In Appendix A we include additional results for this problem with varying amounts of unlabeled and labeled training data in order to further show their influence.

Reaction-Diffusion dynamics
The second illustrative example involves the reaction-diffusion equation on the space-time domain ξ = (x, t) ∈ [0, 1] × [0, 1]: Here, D s = 0.01 is the diffusion constant, k = 0.01 the reaction rate and the source-term u(x) is chosen to be the PDE-input. We used zero values as initial conditions and boundary conditions. We generated random source terms by sampling from a Gaussian process with zero mean and and exponential quadratic covariance kernel with a length scale = 0.2 which were then evaluated at D = 100 uniformly distributed points over [0, 1]. The PDE was subsequently solved using an implicit Finite-Difference scheme and evaluated at 200 randomly chosen points to generate the labeled training data.
We trained our model with N u = 5000 unlabeled data which were processed in batches of 100 samples and to which varying amounts of labeled data were added. Since for this problem the boundary conditions were enforced separately, the amount of labeled training data used could also be zero. All unlabeled training data points were evaluated at N res = 200 randomly selected collocation points. With regards to the network architecture, we employed a MLP with five layers and 100 neurons each for the trunk network and 3 RealNVP building blocks for the branch network which were parametrized by a three-layered MLP. Using the ADAM optimizer and an initial learning rate of 10 −3 , we run the model training for 12×10 4 iterations with an exponential learning rate decay with rate 0.9 every 2000 iterations. For our test dataset, we generated 1000 new (unseen) source terms u and corresponding solutions s. A summary of the relative errors obtained is contained in Table 3.
We note that again for all three settings we achieve very low error rates, which decrease as the amount of labeled training data increases. In Figure 4 and 5 we show the predictions (trained with 500 i.e. 10% labeled data) of both forward and inverse map for three randomly chosen test cases.

Flow through porous media
In the final example we considered the Darcfyflow elliptic PDE in the two-dimensional domain labeled data [%] 0 10 100 relative error for s 0.00925 ± 0.00492 0.0105 ± 0.00519 0.00813 ± 0.00445 relative error for u 0.024 ± 0.01021 0.0184 ± 0.00578 0.0162 ± 0.00592 Table 3 Relative errors on test data depending on the amount of labeled training data for the reaction-diffusion case. ξ = (x 1 , x 2 ) ∈ [0, 1] 2 ∇ · (u(ξ)∇s(ξ)) = 10 (35) where the PDE-input u corresponds to the permeability field. We assumed zero values for the solution s along all boundaries which we a-priori incorporated in our operator approximation by multiplying the DeepONet expansion in Equation (7) with the polynomial x 1 (1 − x 1 ) x 2 (1 − x 2 ). We used N u = 1000 unlabeled training data points with N res = 3844 collocation points (Equation (21)) during training and added either no labeled training data at all (i.e. N l = 0) or N l = 1000. In order to obtain the latter we solved Equation (35) with the Finite Element library FEniCS (Logg et al, 2012) on a 128 × 128 mesh with linear elements and evaluated the solution at 3844 regularly distributed points. We represent the PDE-input u as follows 5 : using 64 feature functions and corresponding coefficients c. In order to generate the training data, we sampled each of the aforementioned 64 coefficients from a uniform distribution in [0, 1]. In this example the 64-dimensional vector of the c's serves as the input in the branch network (i.e. D = 64).
With the help of the c's and of Equation (36), one can reconstruct the full permeability field. With regards to the network architecture, we employed a MLP with five layers and 64 Neurons each for the trunk network and 3 RealNVP building blocks for the branch network which were parametrized by a three-layered MLP. Using the ADAM optimizer and an initial learning rate of 10 −3 , we run the model training for 10 5 iterations with an exponential learning rate decay with rate 0.9 every 2000 iterations. We tested the trained model on 2500 unseen test data and obtained the results in Table 4. As in the previous examples, the inclusion of labeled data significantly improves the predictive accuracy of the trained model. For the case without data the predictive accuracy of the forward map is slightly lower but the accuracy in the inverse map is comparably low. The addition of labeled data improves the predictive accuracy for both maps.
labeled data [%] 0 100 relative error for s 0.0134 ± 0.00509 0.0245 ± 0.0108 relative error for u 0.235 ± 0.137 0.0566 ± 0.0198 Table 4 Relative errors on test data depending on the amount of labeled training data for the Darcy example with feature coefficients as inputs.
In Figure 6 we compare the reference solution for two illustrative test cases with the the forward map learned with labeled training data. As suggested by the cumulative results in Table 4 the two predictions are very close to the reference and the accuracy is very high. In Figures 7 (without labeled training) and 8 (with labeled training) the results for two illustrative inverse test cases are shown.
While locally the error can be significant, the main characteristics of the PDE-input field u can be captured.
We discuss in the next section the case where the input permeability field u is not represented with respect to some feature functions but rather as a discretized continuous field.

Coarse-grained (CG) input parameters
In this sub-case, we modeled the permeability field u with an exponentiated (to ensure positivity) Gaussian Process with mean zero and exponential quadratic covariance with length scale = 0.1. The PDE was then again solved on a 128 × 128 FE mesh and the values of the solution s were assumed to be observed at 3844 regularly distributed points. We moreover sub-sampled the generated PDE input on a regular 8 × 8 grid and its D = 64 values represented the branch network input u. We generated N u = 1000 unlabeled fields u in total and used N res = 3844 collocation points (Equation (21)) during training. We also trained the model with N l = 1000 labeled training data. The results obtained can be found in Table 5. The test data in this table consists of 2500 unseen, discretized, permeability fields and their respective solutions. The error rates are computed with respect to the coarse-grained reference input. As in the previous setting, we observe a significant improvement in the accuracy of the inverse map when labeled data are used in training. In Figure   labeled data [%] 0 100 relative error for s 0.0164 ± 0.00712 0.0164 ± 0.00748 relative error for u 0.121 ± 0.041 0.0656 ± 0.0168 Table 5 Relative errors on test data depending on the amount of labeled training data for the Darcy example with coarse-grained input parameters 9 we compare the reference solution for two illustrative test cases with the the forward map learned with labeled training data. As suggested by the cumulative results in Table 5 the two predictions are very close to the reference and the accuracy is very high.
In Figures 10 (without labeled training) and 11 (with labeled training) the results for two illustrative inverse test cases are shown. We note again that the main features of the PDE-input's spatial variability are captured, despite the presence of localized errors.

Bayesian Inverse Problems
In this section we demonstrate the utility of the invertible DeepONet proposed in the solution of Bayesian inverse problems and in obtaining accurate approximations of the posterior without any need for additional reference model runs nor for any costly and asymptotically-exact sampling. For each of the examples considered, only one observed outputŝ was assumed to be given. The variance of the observational noise σ 2 was assumed to be given although this could readily be inferred, especially if a conjugate inverse-Gamma prior was used for it. In this manner, any deviations from the actual posterior could be attributed to inaccuracies of the DeepONet-based surrogate. Errors due to the approximation of the prior with a mixture of Gaussians as in Equation (27) can be made arbitrarily small by increasing the number of mixture components M .

Reaction-Diffusion dynamics
We employed the trained model of the reactiondiffusion system (with 10% labeled training data), in combination with the formulation detailed in section 2.5 for approximating the posterior. We use a prior p u (u) arising from the discretization of Gaussian Process with zero mean and exponential quadratic covariance kernel with a length scale = 0.2. For the Gaussian mixture models involved for the prior and subsequently the posterior on b we used two components i.e. M = 2 in Equations (27), (29). The results can be seen in the following Figures. The obtained posterior encapsulates the true parameter input for all three cases. In Figure 12 we used test cases with 100 observed solution data points for each parameter input and a noise level of σ 2 = 0.001 (see Equation (23) Figure 13 we increased the noise level ten-fold, to σ 2 = 0.01 and, as expected, so did the posterior uncertainty. In Figure 14 we used σ 2 = 0.001 but decreased the number of observations of the PDEsolution to 25 points (instead of 100). As expected, this led to an increase in posterior uncertainty. Our method can therefore be used as a fast approach without any need for optimization and MCMC sampling to generate an approximate posterior. We note that the posterior uncertainty increases if number of observations decreases or if the observation noise σ 2 increases. In Appendix B, we show the excellent agreement of the approximate posterior computed with the actual one as obtained by costly and time-consuming MCMC simulations.

Flow through porous media
We also solved a Bayesian inverse problem in the context of the Darcy-type PDE by using our trained model of section 3.3 with added labeled Firstly, we considered permeability fields represented with respect to 64 known feature functions as described in section 3.3. The 64 coefficients c (Equation (36)) represented the sought PDE-inputs and a uniform prior in [0, 1] 64 was employed. The results in terms of the permeability field u can be seen in the following Figures. The obtained posterior is in good agreement with the ground truth, e.g. the PDE-input field used to generate the data with the PDE-solver.
In particular, in Figure 15 we assumed that 3844 observations of the PDE-output were available, on a 62 × 62 regular grid. The data that was synthetically generated was contaminated with Gaussian noise with σ 2 = 0.001 (see Equation (23)). In Figure 16 we increased the noise level and subsequently the posterior uncertainty was slightly higher but the posterior mean is still close to the ground truth. In Figure 17  0.01 but decreased the number of observations by 50% to 1922. As expected, the posterior uncertainty increased again but still encapsulated the ground truth.
Finally, we considered the case where the PDEinput is represented on a regular 8 × 8 grid as in section 3.3.1. The discretized GP described therein was used as the prior. In Figure 18 we compare the ground truth with the posterior mean and standard deviation as obtained from 3844 observations on a 62 × 62 regular grid and for a noise level of σ 2 = 0.01 (see Equation (23)). In Figure  19 we used lower noise with σ 2 = 0.001 level and, as expected, the posterior uncertainty was lower and the posterior mean was closer to the ground truth. In Figure 20 we again choose the previous noise level but decreased the number of observations by half, to 1922. As expected, the posterior uncertainty increased but still encapsulated the ground truth.

Conclusions
We introduced an invertible DeepONet architecture for constructing data-driven surrogates of PDEs with parametric inputs. The use of the Real-NVP architecture in the branch-network enables one to obtain simultaneously accurate approximations of both the forward and the inverse map (i.e. from PDE-solution to PDE-input). The latter is particularly useful for deterministic and stochastic (Bayesian), PDE-based, inverse problems for which accurate solutions can be readily obtained once the proposed DeepONet has been trained offline. The training framework can make use of expensive, labeled data (i.e. PDE input-output pairs) as well as inexpensive, unlabeled data (i.e. only PDE-inputs) by incorporating residuals of the governing PDE and its boundary/initial conditions into the loss function. The use of labeled data was generally shown to improve predictive accuracy and especially in terms of the inverse map which is something that warrants further investigation.
In the case of Bayesian formulations in particular, we showed that the availability of the inverse map can lead to highly-efficient approximations of the sought posterior without the need of additional PDE solves and without any cumbersome sampling (e.g. due to MCMC, SMC) or iterations (e.g. due to SVI).
The performance of the proposed strategy was demonstrated on several PDEs with modest-to high-dimensional parametric inputs and its efficiency was assessed in terms of the amounts of labeled vs unlabeled data. Furthermore, the approximate posterior obtained was in very good agreement with the exact posterior obtained with the reference solver and MCMC. The accuracy persisted for various levels of noise in the data as well as when changing the number of available observations. We note finally that unbiased estimates with respect to the exact posterior could be readily obtained with Importance Sampling and by using the approximate posterior as the importance sampling density. This would nevertheless imply additional PDE solves which we would expect to be modest in number given the accuracy of the approximation i.e. the proximity of the Importance Sampling density with the actual posterior.

A Influence of the amount of data
This section contains additional results as obtained for the antiderivative example and for different amounts of training data. We chose exactly the same settings as described in Section 3.1 and varied only the amount of labeled and unlabeled training data. In Figure 21 we plot the relative error in the foward and inverse map with regards to the amount of unlabeled training data. The color indicates the amount of labeled training data used, i.e. blue curves correspond to 1% labeled training data, whereas red curves correspond to 100% labeled training data. We observe that although the relative errors decrease with the addition of more data, the benefit is more pronounced with the addition of labeled data.

B Comparison with MCMC
In the main part of this article we already showed that the true parameter input is encapsulated by the posterior. In this section we compare the approximate posterior computed with the reference posterior obtained by MCMC. In particular, for two, randomly-chosen cases in the reaction-diffusion example, the true posterior was computed using the NUTS sampler from the Blackjax library (Lao and Louf, 2020). As is  relative error forward map -100% labeled training data added relative error inverse map -100% labeled training data added relative error forward map -1% labeled training data added relative error inverse map -1% labeled training data added