Surrogate-based Bayesian Comparison of Computationally Expensive Models: Application to Microbially Induced Calcite Precipitation

Geochemical processes in subsurface reservoirs affected by microbial activity change the material properties of porous media. This is a complex biogeochemical process in subsurface reservoirs that currently contains strong conceptual uncertainty. This means, several modeling approaches describing the biogeochemical process are plausible and modelers face the uncertainty of choosing the most appropriate one. Once observation data becomes available, a rigorous Bayesian model selection accompanied by a Bayesian model justifiability analysis could be employed to choose the most appropriate model, i.e. the one that describes the underlying physical processes best in the light of the available data. However, biogeochemical modeling is computationally very demanding because it conceptualizes different phases, biomass dynamics, geochemistry, precipitation and dissolution in porous media. Therefore, the Bayesian framework cannot be based directly on the full computational models as this would require too many expensive model evaluations. To circumvent this problem, we suggest performing both Bayesian model selection and justifiability analysis after constructing surrogates for the competing biogeochemical models. Here, we use the arbitrary polynomial chaos expansion. We account for the approximation error in the Bayesian analysis by introducing novel correction factors for the resulting model weights. Thereby, we extend the Bayesian justifiability analysis and assess model similarities for computationally expensive models. We demonstrate the method on a representative scenario for microbially induced calcite precipitation in a porous medium. Our extension of the justifiability analysis provides a suitable approach for the comparison of computationally demanding models and gives an insight on the necessary amount of data for a reliable model performance.

ysis and assess model similarities for computationally expensive models. We demonstrate the method on a representative scenario for microbially induced calcite precipitation in a porous medium. Our extension of the justifiability analysis provides a suitable approach for the comparison of computationally demanding models and gives an insight on the necessary amount of data for a reliable model performance.
Keywords Microbially induced calcite precipitation · Bayesian model selection · Bayesian justifiability analysis · Arbitrary Polynomial Chaos Expansion · Surrogate-based model selection and comparison 1 Introduction

Biogeochemical Processes in Subsurface Porous Media
Biogeochemical processes in porous media are geochemical processes affected by the activity of microbes [Lovley and Chapelle, 1995]. They profoundly impact ecosystems as they occur ubiquitously in the subsurface and this makes them interesting for applications in engineering. Some examples of biogeochemical processes that engineers tried to manipulate are: enhanced recovery of resources as in microbially enhanced oil recovery [e.g. Bachmann et al., 2014;McInerney et al., 2005;Huang et al., 2018], blocking of preferential flow paths by the accumulation of biomass or minerals precipitated as a result of the microbial metabolism [e.g. Bottero et al., 2013;Suliman et al., 2006] or bioremediation of soils by microbial decomposition of organic pollutants [e.g. Megharaj et al., 2011;Head, 1998;Mulligan and Galvez-Cloutier, 2003].
However, it is challenging to describe these biogeochemical processes in full detail, because many subprocesses interact in a complex manner [Steefel and MacQuarrie, 1996]. Accordingly, it is not easy to control them as desired. A full understanding of these processes is necessary when aiming to control them in order to predict or even regulate the outcome. Thus, modeling is a crucial tool to predict the response of systems under certain conditions [Hunter et al., 1998]. Corresponding models are an essential tool in investigating the coupled transport of fluids and reactive substances through porous media and the resulting chemical reactions in the pores [Steefel et al., 2005;MacQuarrie and Mayer, 2005;Xu et al., 2006].
Biogeochemical models are useful, for example, to design, monitor, and evaluate such applications, e.g. to mitigate leakages from a geological gas reservoir into above aquifers in advance [e.g. Cuthbert et al., 2013;Nassar et al., 2018;Cunningham et al., 2019;Minto et al., 2019;Landa-Marbán et al., 2020]. Our limited knowledge about the interaction of the processes that govern biogeochemical systems leads to several modeling approaches that differ, e.g., in their level of detail. The uncertainty of choosing between these modeling alternatives is known as conceptual uncertainty.

Conceptual Uncertainty
When modeling an environmental process, we have to make assumptions and simplifications because, usually, the real process is too complex to be represented in full detail. Consequently, one has to deal with various types of uncertainty. Besides input and parameter uncertainty, conceptual uncertainty (uncertainty of model choice) has to be taken into account. If we chose a single model and did not consider possible alternatives, we might strongly underestimate the overall prediction uncertainty because the space of potential models is not sufficiently covered [Enemark et al., 2019;Refsgaard et al., 2012;Rojas et al., 2008].
Many studies have identified conceptual uncertainty as a key source of uncertainty in modeling [e.g. Burnham and Anderson, 2002;Neuman, 2003;Højberg and Refsgaard, 2005;Rojas et al., 2008Rojas et al., , 2010Gupta et al., 2012;Troldborg et al., 2007;Refsgaard et al., 2012;Renard et al., 2010;Schöniger et al., 2015b;Enemark et al., 2019]. These studies suggest to treat modeling concepts with different levels of detail and different assumptions as competing hypotheses. By using statistical techniques such as Bayesian model selection (BMS), we can evaluate which model is the most appropriate representation of the system [Raftery, 1995;Wasserman, 2000].
However, two challenges persist. First, it is important to note that there is no existing method which allows to quantify conceptual uncertainty on an absolute level [Nearing and Gupta, 2018;Höge et al., 2019]. Second, biogeochemical modeling, discussed briefly in Section 1.1, is computationally very demanding since it conceptualizes different processes in subsurface porous media. Thus, a direct application of the rigorous probabilistic machinery is not feasible due to a necessity of a high number of model evaluations. In this study, we address the second challenge.

Surrogate Representation of the Underlying Physical Models
In order to assure feasibility of the probabilistic BMS framework, we will construct surrogate models for each corresponding version of the biogeochemical model. The main goal of a surrogate model is to replicate the behavior of the underlying physical model from a limited set of runs without sacrificing a lot of detail and accuracy. For constructing a surrogate the original model should be evaluated by using those sets of modeling parameters out of various possibilities that covers the parametric space as good as possible. Considering very high computational costs of biogeochemical models, whereby one model evaluation requires days, we need to select an approach that will capture the main features of the underlying physical models after a very small number of model evaluations. Following a recent benchmark comparison study by [Köppel et al., 2019], we construct the surrogate model using the arbitrary polynomial chaos expansion technique (aPC) introduced in , which is suitable for our purpose.
In short, the data-driven aPC approach can be seen as a machine learning approach that approximates the model output by its dependence on model parameters via multivariate polynomials. The data-driven feature of aPC offers complete flexibility in the choice and representation of probability distributions. It requires no approximation of a density function, that usually caused additional uncertainties . Based on the original polynomial chaos expansion introduced by [Wiener, 1938], the aPC constructs surrogate models with the help of an orthonormal polynomial basis. Such a reduction of a full biogeochemical model into a surrogate model offers the path to perform a rigorous stochastic analysis at strongly reduced computational costs.

Two-Stage Bayesian Model Selection Procedure
Bayesian model selection [e.g. Raftery, 1995;Wasserman, 2000] has been used in many fields of research to support the choice between competing models [e.g. Mohammadi et al., 2018;Wöhling et al., 2015;Schöniger et al., 2015a;Hooten and Hobbs, 2015;Parkinson et al., 2006;Cremers, 2002;Brunetti et al., 2020]. It ranks competing physical models based on their quality to represent the available measurement data. To be more specific, BMS employs the Bayesian model evidence (BME) as the score indicating the quality of the model against the available data.
Here, we will consider several models describing biogeochemical processes in subsurface porous media. They contain various assumptions helping to simplify the modeling procedure. The BME-based ranking follows the principle of parsimony [Schöniger et al., 2014] or rather "Occam's razor", which tells to "choose the simplest one between competing hypotheses " [Jefferys and Berger, 1992]. The work by Schöniger et al. [2015a] uses this property to find a justifiable level of complexity (variability of the model) for modeling a certain quantity of interest.
Following the framework introduced in Schöniger et al.
[2015a], we will adopt a two-stage approach for model testing. In the first stage, the classical BMS procedure is used, in which models are tested against measurement data. This procedure is complemented by the second stage, the so-called model justifiability analysis. Here, competing models are tested against each other based on "synthetic true" data instead of measurement data. Based on this analysis, one can diagnose similarities between competing models and identify a suitable model that is "affordable" when only a realistic amount of measurement data is given. A joint interpretation of both stages provides insights that help to find the most appropriate model, which represents the observed system best under acceptable computational costs.

Goals and Paper Structure
The overall aim of this study is to set up a rigorous ranking of biogeochemical computationally expensive models introducing the surrogate-based two-stage Bayesian model selection procedure. We extend the Bayesian justifiability analysis introduced by Schöniger et al. [2015a]. Our novel correction factor allows the use of surrogate models and thus, makes this analysis suitable for computationally demanding models.
Section 2 introduces necessary details on Bayesian updating of the aPC expansion and extends the Bayesian model selection of computationally demanding models to the Bayesian model justifiability analysis introducing novel correction factors. Section 3 introduces the biogeochemical process of microbially induced calcite precipitation (MICP) and the corresponding model set. Section 4 performs Bayesian model selection among MICP models and assesses their similarity using the novel surrogate-based justifiability analysis. Section 5 summarizes the results and gives an outlook for further investigation.

Arbitrary Polynomial Chaos Expansion
We will consider computationally demanding models, for which a straightforward application of the Bayesian model selection procedure is infeasible. Therefore, we will construct so-called surrogate models with negligible computational costs to replicate the behavior of the original physical models via the arbitrary polynomial chaos expansion (aPC) approach introduced in . Surrogate models are mapping the modeling parameters to the model output, capturing the main features of the underlying physical model. In what follows, we present the core idea for the construction of these aPC-based surrogate models.
Let ω = (ω 1 , ..., ω Np ) represent the N p -dimensional vector of model parameters, whereby all parameters in ω are assumed to be independent . Let the model responses be given in the form of M k = f (x, t; ω), where M k can be some differential equation, a coupled system of differential equations or just a simple function. Moreover, the model parameters can depend on a certain point in space x = (x 1 , x 2 , x 3 ) and time t. The model responses M k can be approximated with a spectral projection of responses onto orthogonal polynomial bases as follows: with polynomials Ψ s (ω) of the multivariate orthogonal polynomial basis. These polynomials are constructed according to . There are D polynomials needed for the expansion, whereby D is the number of expansion coefficients dependent on the number of model parameters N p and the chosen maximum polynomial degree d: D = (N p +d)!/(N p !d!)−1. The coefficients c s (x, t) depend on space and time since the original model output depends on space and time.
To compute the coefficients c s (x, t) of the polynomial chaos expansion in equation (1), we employ a non-intrusive stochastic collocation method . The non-intrusiveness of this method implies that the model M k can be considered as a black box, so that there is no need of making changes in the governing equations of the original model at hand. Using this method, a finite number of model evaluations D is sufficient to determine the coefficients. The coefficients (using the D evaluations of model M k on D so-called collocation points ω Np , i = 1, ..., D) can be computed by the following system of equations: The D × D matrix Ψ contains the basis polynomials, evaluated on different collocation points, and the vector c of size D × 1 contains the expansion coefficients. The outputs of the model M k on the different collocation points are represented by vector M k of size D × 1. If one aims to compute the surrogate model of M k for different points in time, it is sufficient to compute the matrix Ψ once for a fixed amount of parameters and collocation points and an expansion degree D, since the matrix is space and time independent, unlike both of the vectors c and M k . Accordingly, the coefficients are computed based on the model output using the collocation points for different points in space and time separately (available Matlab code in Oladyshkin [2020a]).
The solution of the system of equations (3) is obviously dependent on the choice of the collocation points ω According to Villadsen and Michelsen [1978] the optimal collocation points are the roots of the univariate polynomials used for the construction of the multivariate polynomial basis of degree D + 1 . Hence, the resulting surrogate model represents the original model at the collocation points exactly while some "polynomial interpolation" is applied between them or rather an extrapolation outside of the range of the collocation points [Mohammadi et al., 2018].

Bayesian Updating of the aPC-Based Surrogate Representation
The procedure described in Section 2.1 can be seen as an initial step, whereby the surrogate representation of the original model makes use of the prior distribution of the modeling parameters and omits the available measurement data. Therefore, the constructed surrogate modelM k could be imprecise and may not necessarily cover well the region of the parameter space where the measurement data is relevant (i.e. posterior). Using a higher expansion degree to improve the surrogate model globally would increase the computational time excessively.
Therefore, to overcome this issue, we employ an iterative Bayesian updating process of the aPC representation (BaPC) that improves the accuracy of the surrogate by incorporating new collocation points at approximate locations of the maximum a posteriori parameter set [Oladyshkin et al., 2013a]. The idea is to evaluate the surrogate modelM k on a high number of parameter realizations, obtained from their prior distribution, to weigh the points by their posterior probability. As the parameter realization with the highest posterior probability is assumed to be in the parameter region of interest, the surrogate model should be refined there. According to the BaPC strategy, we will evaluate the original model M k (x, t; ω) on the suggested new collocation point ω corresponding to the maximum a posteriori parameter set and recalculate the expansion coefficients c(x, t) by solving equation (3). The increasing number of collocation points leads to an overdetermined system of equations for the determination of the coefficients as described in Appendix A. In this way, we iteratively update the aPC representation in equation (1) by incorporating the points where the probability to capture the measurement data is higher. This process is repeated until the surrogate model captures the measurement data sufficiently well, although the number of iterations should be limited to keep the computational costs manageable (Matlab code available in Oladyshkin [2020b]).
The suggested BaPC framework has shown promising results for computationally very demanding models [e.g. Oladyshkin et al., 2013b;Mohammadi et al., 2018;Beckers et al., 2020] and further details are shown in Oladyshkin et al. [2013a]. Alternatively, other Bayesian strategies can be found in Oladyshkin et al. [2020].

Approximation Quality of aPC-based Surrogate Models
To assess the quality of a constructed surrogate model during the iterative Bayesian updating of an aPC expansion, we will estimate the approximation error in equation (1). Since the stochastic collocation belongs to the family of regression methods, only calculating the error at the collocation points would lead to biased results. Yet, computing the validation error via so-called testing parameter sets to assess the accuracy of the model, trained on the training collocation points, is computationally infeasible.
To remedy this problem, one can use the leave-one-out cross validation (LOOCV) as described in Blatman and Sudret [2010] instead. The collocation points are divided P times into two subsets, assuming that the set of collocation points is of size P ≥ D + 1: for the calculation of the coefficients the collocation points are omitted one after the other. After the coefficients have been determined with the help of the remaining collocation points, the resulting surrogate model is evaluated on the omitted collocation point. Then, the difference to M k , evaluated on this point, is computed [Blatman and Sudret, 2010]. This is done for all collocation points and finally the mean value over all quadratic errors is determined: where P is the current number of collocation points, M k ω (i) is the model evaluated on the omitted collocation point ω (i) andM k\ω (i) ω (i) is the surrogate model constructed without the collocation point ω (i) evaluated on the collocation point ω (i) .

Bayesian Model Selection
Bayesian Model Selection allows to rank models based on their probability to be the data-generating process [e.g. Raftery, 1995;Wasserman, 2000;Höge et al., 2019]. For this ranking, prior model weights P (M k ) are updated to posterior model weights P (M k |y 0 ) using Bayes' theorem: with y 0 being the vector of measurements and the models' prior probability P (M k ). The prior probability P (M k ) is a subjective estimation of the investigator or the modeler about which model is an exact representation of the data-generating process, without actually knowing the data yet [Raftery, 1995]. Uniformly distributed priors P (M k ) = 1 Nm with N m competing models are a common choice. The term p(y 0 |M k ) is the so-called Bayesian Model Evidence (BME). The BME value is also known as marginal likelihood, because it can be calculated by averaging (marginalizing) over the parameter space Ω k of each model [Kass and Raftery, 1995;Schöniger et al., 2014]. The marginalization makes BME independent of the parameter choice and hence it is a characteristic of only the model M k . Accordingly, BME is defined as where p(ω|M k ) is the model-specific prior distribution of the model parameter vector ω. The likelihood function p(y 0 |M k , ω) quantifies how well the predictions y k fit the measurement data y 0 and includes assumptions on the measurement error [Raftery, 1995]. Here, we will choose a Gaussian likelihood function with zero mean: where R is the covariance matrix of the measurement error of size N s ×N s (with data set size N s ), and y k (ω) is the prediction made by model M k with the model parameter vector ω.
For most applications, there is no analytical solution of equation (6) and the corresponding integral could be estimated using a brute-force Monte Carlo approach. To perform the Monte Carlo integration, we create a sample set of N MC realizations of the modeling parameter vector ω based on its prior distribution p(ω|M k ). With the corresponding likelihood functions (7), we will obtain the following numerical approximation of the BME value: where ω i is the i-th parameter realization for model M k .

aPC-Based Bayesian Model Selection
Remarking that the surrogate representationM k is only an approximation of the original model M k , we expect that surrogate-based BME values could be misleading for the Bayesian model selection procedure. Therefore, conclusions drawn from BME values based on surrogates are only valid to the degree of the approximation quality of the surrogate model. Such falsified values can be avoided by adapting the calculation of the BME value, as proposed in Mohammadi et al. [2018]. We will consider that the prediction of the surrogate model M k contains an approximation error E k . We consider it to be independent of the measurement error (because E k and have no interaction), so that the BME value in equation (6) can be rewritten as: where p(M k |M k , ω) is the likelihood function that indicates how well the original model prediction based on the model parameter realization ω matches the corresponding surrogate model prediction: with the predictions y k of the original model andỹ k of the surrogate model and the covariance matrix S of approximation errors.
Following the derivation in Mohammadi et al. [2018], we obtain the corrected BME value for the original model, computed on the basis of the reduced model: Equation (11) shows clearly how the BME value of the original model (BME OM ) can be calculated from the BME value of the surrogate model (BME SM ): with BME OM = p(y 0 |M k ), BME SM = p(y 0 |M k ) and where the BME SM value can be computed as described in the previous section, using the surrogate modelM k instead of the original model M k .
The correction factor Weight SM requires an integration over the whole parameter space Ω and its computation via Monte Carlo Integration is not feasible due to the high computational costs of the original model. Therefore, the correction factor can be estimated at those collocation points ω * that were used to construct the surrogate model: where P is the number of collocation points.

Bayesian Model Justifiability Analysis
In order to not only compare the models against the measurement data, Schöniger et al. [2015a] suggested a so-called model justifiability analysis, in which the competing models are tested against each other in a synthetic setup omitting the measurement data. The results of the justifiability analysis can help to decide whether the apparent by most appropriate model from the conventional BMS analysis is really the best model in the set or whether this model is only optimal given the limited amount of available measurement data [Schöniger et al., 2015a]. Additionally, the justifiability analysis provides insights about similarities among the tested models.
To perform the justifiability analysis, we will generate the so-called model confusion matrix [Schöniger et al., 2015a] that is typically used in the field of statistical classification [e.g., Alpaydin, 2004]. Confusion matrices compare the actual and the predicted classification, visualizing whether an object is misclassified ("confused"). In that way, we can recognize whether a model is able to distinguish its own predictions from the ones of its competitors. To do so, we calculate the Bayesian model weights for all models adopting equation (5).
However, instead of using the measurement data y 0 , each of the competing models generates a finite series of prior predictions that serve as realizations of the "synthetic truth". Thus, we generate N MC synthetic data sets of each model based on samples of its prior parameter distributions. Then, each synthetic data set is compared to the competing models by first computing the likelihood function as described in equation (7), for example of the single realization i of model M k based on the data set j of model M l . The BME value can be obtained by calculating the mean of all likelihoods p(M l,j |M k ) of model M k given this single realization j of model M l . The resulting model confusion matrix has the size N m × N m , for N m competing models.
To execute both steps of model testing ((1) BMS testing against measurements and (2) justifiability analysis testing models against each other) simultaneously, we add the measurement data to our model set, i.e. we add it as a new row and column to the confusion matrix.
A schematic illustration of its construction is given in Figure 1, whereby the model confusion matrix is extended by the standard BMS procedure.
The blue box in Figure 1 represents a standard BMS procedure where the model M k has been tested against the measurement data. This entry can be obtained from (6), using Monte Carlo Integration for p(y 0 |M k ) as in (8) green box in Figure 1 reflects the likelihood of a single realization of model M k given a single realization of the reference model M l , which currently serves to stand as synthetic truth. The orange box in Figure 1 shows the average likelihood (BME) of model M k given a single realization of the reference model M l . This BME value is normalized by the sum of the BME values of all models given a single realization of the synthetic truth (red box), yielding a posterior model weight p(M l |M k,j ) with the reference model M k . The bold boxes in Figure 1 illustrate these averaged posterior weights over all synthetic data sets of the reference model M k . The bold boxes of one column contain the expected posterior weights (P W ) of all models given that model M k is true. One entry can be computed as follows: whereby the averaged BME value (16) is not normalized for the sake of readability.
The resulting extended model confusion matrix consists only of these entries, i.e. the bold boxes and therefore has the size (N m + 1) × (N m + 1), for N m competing models and the measurement data.
The main diagonal entries reflect how good each model identifies itself as the data-generating process, given a certain data set size. The values of the diagonal entries should be equal to 1.00 with an infinite data set size. However, for finite data sets, models might "confuse" their own predictions (misclassification) with the ones of competing models due the two following reasons. (1) Two models are actually highly similar.
(2) One model has a high goodness-of-fit to the reference data, but also a high variability in its predictions. The BMS framework punishes this high variability with a lower model weight. Thus, a scenario of a more and a less variable model, which fits the reference data worse than the more variable one, might lead to similar model weights. When more synthetic data is used, the more variable model will receive a higher weight, as its variability becomes more justifiable, while the weight of the less variable model will decrease [Höge et al., 2018[Höge et al., , 2019.
The off-diagonal entries of the model confusion matrix reflect the similarity between pairs of models. This can be useful when comparing possible simplifications to a detailed reference model [Schäfer Rodrigues Silva et al., 2020]. With the aid of the model confusion matrix, it is possible to identify the model that yields results that are most similar to the reference model, at reduced computational costs.

aPC-Based Bayesian Model Justifiability Analysis
We will combine the methodologies from Sections 2.5 and 2.6 towards an aPCbased Bayesian model justifiability analysis, where models are mutually tested against each other. To do so, we will consider two models, model M k and model M l . The comparison of two models implies that one model, M l in this case, is assumed to be the data-generating process. Instead of computing the BME value for the original models p(M l |M k ), we have to calculate the BME value p(M l |M k ) of the surrogate models. Similar to Section 2.5, we assume that each surrogate representation of each analyzed model contains an approximation error: M k =M k + E k and M l =M l + E l . Therefore, equation (11) can be rewritten as: In the next step, we focus on the term p(M l |M k ), considering M l =M l +E l leads us to Multiplying and dividing the right-hand side of (18) with p(M l |M k ) and applying Bayes' theorem yields When inserting (19) into (17), we obtain or with whereby BME OMOM corresponds to the BME value when comparing two original models and BME SMSM to the BME value when comparing two surrogate models. The value of BME SMSM can be computed in the same way as proposed in (6) via Monte Carlo integration in (8) with the likelihood function defined in (7), using the prediction of model M l evaluated on a certain model parameter vector ω instead of the measurement data y 0 . The collocation points ω * can be employed again similarly to Section 2.5 to compute the correction factors for both models: Moreover, since the model confusion matrix in the Bayesian model justifiability framework compares the original models as well, we have to account for the approximation of these models with the surrogates. As the weights Weight SM1 and Weight SM2 are not dependent on a single parameter realization, the overall posterior weights of the model confusion matrix can be corrected in the same way as the BME values. To this end, the posterior values (P W ) of the model confusion matrix from equation (16) need to be multiplied with the two correction factors Weight SM1 and Weight SM2 from (23): where SM1 =M l and SM2 =M k .

Microbially Induced Calcite Precipitation
Microbially induced calcite precipitation (MICP) is a typical biogeochemical process. When conceptualizing MICP in porous media, various phases are involved: there are at least three solid phases (biofilm, calcite and unreactive solid material), water and possibly another fluid phase, e.g. gas. Additionally, at least calcium, inorganic carbon, and urea are considered as dissolved components in the water phase, the complete list of components can be found in Hommel et al. [2015]. MICP is a reactive transport process consisting of three main parts: (1) adhesion of biomass on surfaces, detachment of the biomass from the biofilm as well as growth and decay of the biomass, (2) urea hydrolysis that alters the geochemistry and (3) precipitation and dissolution of calcite.
A visualization of the MICP process is shown in Figure 2. S. pasteurii are bacteria that are able to produce the enzyme urease and to decompose urea into carbonic acid and ammonia with the aid of urease.
In aqueous solution, the ammonia reacts with the contained H + ions. As a result, the pH value increases so that the carbonic acid decomposes into H + ions and carbonate ions, while the concentration of dissolved carbonate increases.
If calcium ions are provided, it comes to a reaction with the carbonate ions and calcite precipitates.
Shortly, all together this leads to the following MICP reaction equation [Hommel et al., 2015]:

Experimental Setup
The analyzed MICP experiment is described in detail in Hommel et al. [2015] (there, see experiment "D1"). It describes a sand-filled column that is 61 cm high with a diameter of 2.54 cm. In the beginning of the experiment, bacteria are injected at the bottom of the column, until a sufficient amount is accumulated and a biofilm is established. Then, the biofilm is fed once again (bacteria are injected again). From now on, calcium and urea are injected repeatedly every 24 hours. This allows the mineralization reactions to take place. That period is followed by another injection of biomass to revive the microorganisms (or rather feed them once again) [Hommel et al., 2015], before the next injections of calcium and urea start over. A full model and experiment development after Cunningham et al. [2019] is shown in Figure 3.
The models predict the calcium and calcite over space and time (3.3). The predictions are compared to measurement data as well as among each other. In order to receive comparable results, only spatial and temporal points where measurement data is available as well are used when comparing models among each other. These data points differ for calcium and calcite. For the calcite content, there is only measurement data available at the end of the experiment, which is after 3 203 460 seconds (about 890 hours or 37 days). The calcium concentration is measured at 35 different data points in time. Therefore, calcium is injected at 6 "main points" in time, the so-called pulses, namely after 151. 35, 218.85, 290.85, 626.85, 698.85 and 866.85 hours. At these points, the concentration is measured and additionally respectively after half an hour, after one, two, three and four hours, except for pulse 22, where no measurement is available after 3 hours, which results in 35 temporal points. x 1 x 2 x 3 x 4 x 5 x 1 x 2 x 3 x 4 x 5  81, 11.43, 19.05, 26.67, 34.29, 41.91, 49.53 and 57.15 cm distance from the bottom. For the calcium concentration, there are only five spatial measurement points located at 10. 16, 20.32, 30.48, 39.37 and 49.53 cm distance from the bottom. The measurement locations in the models are evenly distributed at a respective distance of half an inch (1.27 cm).

Conceptual Models and Related Uncertainty
We analyze three models for MICP that describe biogeochemical processes in porous media provided by Hommel et al. [2015Hommel et al. [ , 2016. For detailed explanation of their equations and the used numerical schemes, we refer to that original publication.
An <Intel(R) Xeon(R) CPU E5-2680 v2 @2.80 GHz, 40 Cores> machine was used for the model evaluations. The computational effort for the most detailed MICP model, referred to as full complexity model, is extremely high with a run time between 16 and 42 hours, depending on the respective model parameter set. The exact costs are dependent on the model parameter set chosen for the evaluation, since the time stepping varies adaptively. Therefore, Hommel et al. [2015] suggest two simplifications of the full complexity model M FC using the certain physical assumptions. initial biofilm model (M IB ): the suspended biomass is ignored and the biofilm is already established at the beginning of the experiment. simple chemistry model (M SC ): all the urea injected to the system is assumed to precipitate as calcite. The precipitation occurs immediately as described in the overall reaction equation (R1) [Hommel et al., 2016]. As described in Section 3.2, the experiment starts with a biomass injection and a waiting period until the biofilm is established. The initial biofilm model M IB omits this part of the simulation under the assumption that the biofilm is already established in the beginning of the experiment and the attachment periods of the biomass are not simulated [Hommel et al., 2016]. The simple chemistry model M SC simplifies the reactions of urea. The model makes the assumption, that all the urea put into the system completely reacts to calcite. Therefore, there is no need for computing the ureolysis rate and the precipitation rate, or either the expensive-to-calculate saturation state and carbonate and calcium activities [Hommel et al., 2016]. The computational time of the initial biofilm model M IB still remains high and and is only slightly lower than for the full complexity model on the same computational cluster. The strong assumptions in the simple chemistry model M SC allow to obtain results of one model run after 40 minutes using the same computational cluster.
Apart from decreasing the computational cost, model simplification reduces parametric uncertainty. A too detailed (too complex) model with many parameters and without enough calibration data (and therefore parametric uncertainty) results in a high predictive variance (i.e. uncertainty) of the model. Models should generally be "as simple as possible, as complex as necessary" (principle of parsimony) [Höge et al., 2018] to prevent overfitting [e.g. Babu, 2011Lever et al., 2016]. The considered parameters in the following were previously identified as the most uncertain parameters of the MICP models in Hommel et al. [2015]: the coefficient for preferential attachment to biomass c a,1 , s −1 the coefficient for attachment to arbitrary surfaces c a,2 , s −1 the mass density of biofilm ρ f , kg/m 3 the enzyme content of biomass k ub , [kg/kg].
As the initial biofilm model M IB assumes that there are no attachment periods, it is only dependent on the model parameters ρ f and k ub . The full complexity model M FC and simple chemistry model M SC are both dependent on all four model parameters. Following the physically possible range of the considered uncertain parameters, we assume that all of the model parameters are uniformly distributed in the intervals shown in Table 2. model parameter interval c a,1 1 · 10 −10 s −1 , 1 · 10 −7 s −1 c a,2 1 · 10 −10 s −1 , 1 · 10 −6 s −1 ρ f 1 kg/m 3 , 15 kg/m 3 k ub 1 · 10 −5 kg/kg, 5 · 10 −4 kg/kg We construct three surrogate models for the three competing MICP models described in Section 3.3 using a d = 2 order aPC expansion according to the prior distributions presented in Table 2. For this purpose, the three original models will be evaluated D = (N p + d)!/(N p !d!) − 1 times according to Section 2.1. Since the D evaluations for the construction of the surrogate models are independent, these model runs were parallelized. Further, we refine each of the three surrogates using iterative Bayesian updating of the aPC representation according to Section 2.2. Here, we restrict the number of Bayesian updates to ten due to the high computational demand and previous experience (see e.g. [Beckers et al., 2020]), so that P end = D + 10 = (N p + d)!/(N p !d!) − 1 + +10. During the Bayesian updating, we consider the standard deviation of measurement errors at each point in space (and time) equal to 20% of the associated measurement value for both the calcite content and the calcium concentration.

aPC-Based Representation of MICP Models
Equation (4) provides errors of the surrogate models for every point in space and time due to the structure of equation (1). As every point in space and time has its own surrogate model, there are 5 · 35 · 10 = 1750 LOOCV errors (5 spatial and 35 temporal points, 10 updating steps) computed for calcium and 8 · 10 for calcite (8 spatial points, 10 updating steps) in the analyzed set up. The LOOCV error is computed after the primal construction of the surrogate models and during the iterative Bayesian updating. In order to visualize the errors, we will average the respective values over space (and time) after every updating step. In order to compare the LOOCV error of the surrogate models for calcium and calcite, the relative errors must be considered, since the two quantities of interest (Calcite content [%] and calcium concentration [mol/m 3 ]) are in different orders of magnitude. For this purpose, they were normalized to the mean output value, as shown in Figure 4. The relative mean LOOCV errors before the first update are not considered in this figure to get a better visualization, since this error is significantly higher than the ones after the updates. First of all, the figure shows that the surrogate error for calcite decreases more strongly than the error for calcium. It is also remarkable that the error for all models for calcite is in a similar order of magnitude. This means that all surrogate models are of a comparable quality for the calcite content. For calcium, the error of the simple chemistry model M SC is significantly larger than the one for the other two surrogate models. This can occur if one uses Bayesian updating and wants to improve the models only in the region of the measurement data. This means the surrogate model is similar to the original one in the region of the measurement data, but it deviates a lot from the original model in other regions (not part of the measurement points). This results in a higher overall LOOCV error. The larger error of the surrogate model is compensated later by the newly introduced correction factor in Section 2.5. Furthermore, the errors for calcite are in a range of [2 · 10 −5 , 6 · 10 −5 ] after the last update and those for calcium are in a range of [4 · 10 −3 , 4 · 10 −1 ]. Accordingly, the worst surrogate response for calcite is still better than the best one for calcium. This indicates that the surrogate models for the calcite content as a whole are better with respect to the LOOCV error than those for the calcium concentration.

aPC-Based Justifiability Analysis for MICP Models
We will perform the aPC-based Bayesian model selection incorporating the measurement data and aPC-based Bayesian model justifiability analysis according to Sections 2.5 and 2.7 using the obtained surrogate representations of the three analyzed MICP models from Section 4.1. Following the justifiability analysis, we compute the model weights as stated in Section 2.6 and adjust them with the novel correction factors from Sections 2.5 and 2.7 in a second stage. In order to justify the underlying physical assumptions behind the MICP models, we will assess the impact of the data set size onto BME values appearing in the Bayesian justifiability analysis. To do so, we start with only one spatial data point, then we use half of the available data set size and finally we include all of the spatial data points for calcium and calcite. This results in the following data set sizes N D,spatial ∈ {1, 3, 5} for calcium and N D,spatial ∈ {1, 4, 8} for calcite.

aPCE-Based BMS and Justifiability Analysis
In a first stage, the conventional BMS analysis for measurement data is performed with results illustrated in Figure 5. One can observe that the simple chemistry model M SC obtains the highest model weight (normalized BME value) for all data set sizes. A model wins the competition either because of its low complexity or because of its goodnessof-fit to the measurement data (or both) [Schöniger et al., 2015a]. These two aspects will be further investigated in a second stage, the justifiability analysis. Figure 6 shows the corresponding model confusion matrices for both the calcium concentration and the calcite content predictions. Each entry corresponds to the weight of one model, which is the probability that model M k (rows) is the data-generating process of the predictions made by model M l (columns) according to Bayes' theorem. The main-diagonal entries of the model confusion matrix in Figure 6 represent the models' ability to identify their own predictions. The higher the value of the main diagonal entry in Figure 6, the higher probability of the model to identify itself as the data-generating process. The diagonal values increase when a bigger data set size is used, agreeing well with the Bayesian justifiability analysis discussed in [Schöniger et al., 2015a]. The diagonal weight of the simplest model, the simple chemistry model M SC , is always the highest, independent of the data set size, which shows that the analysis identifies this model as data-generating, even if the data set is large and the model makes strong assumptions. For both the calcium and the calcite, the diagonal entries achieve the "absolute majority" of more than 0.50 in favor of justifiability (except for the initial biofilm model M IB for calcite) when taking the full data set into account. This means that the data set size is sufficient to justify the modeling concepts behind the considered models.
But even for the full data set, the full complexity model M FC obtains a high weight when the initial biofilm model M IB generates the data and vice versa. It follows that the initial biofilm model M IB and the full complexity model M FC confuse their predictions and are not confident in identifying their own predictions (the initial biofilm model M IB for calcite is not even able to identify itself). However, only for the simple chemistry model M SC the weight is 1.00 and therefore its simplicity is perfectly supported with the full data set. The measurement data (MD) obtains a model weight of 1.00 for the full data set too, since it is clearly able to identify itself with the full data set. The weights for the models with the measurement data as the data-generating process are strikingly low. In statistical terms, this means that all models are clearly rejected by the full data set. This fits with the conclusions drawn in Hommel et al. [2015], that there is at least one relevant process not yet implemented in "sufficient detail", which is necessary for better results.

How Much Data Do We Need?
The matrices on the left in Figure 6 show that considering only one spatial data point is not sufficient, since the diagonal entries for calcite and calcium are all less than 0.50 except for the measurement data for the calcium concentration. This means that there is no "absolute majority" in favor of justifiability for any model and even the measurement data of the calcite content is not able to identify itself (which is obvious since there is clearly a variance between the measurements at different spatial data points). The matrices also show that the simplest model (SC) obtains the highest weight of all three models when the data set size is small (principle of parsimony).
When using half of the data set, the simplest model M SC and the most complex model M FC for calcium receive an absolute majority with model weights of 0.63 and 0.52, while the data set size does not suffice for self-identification of the initial biofilm model M IB . The weight of M IB on the diagonal entry increases with an increasing data set size, but it never gains a weight greater than 0.5. In contrast, the weight for M IB for the calcium concentration reaches the absolute majority, which means that the data set size is sufficient for selfidentification and the physical model assumptions leading to simplifications are justifiable.
Let us now have a closer look on the main-diagonal entries of the model confusion matrix ("self-identification weights") over an increasing data set size in Figure 7.
It shows, that for the simplest model (SC) and clearly for the measurement data, perfect justification (model weight of 1.00) is achieved very quickly. For the initial biofilm model M IB and the full complexity model M FC , a larger data set size is required to justify their complexity. Since the weights for the more complex models do not stagnate at some point, we do not expect that a much larger data set is required to justify their complexity. Fig. 7: Average model weights for the data-generating process of the two quantities of interest (calcite content and calcium concentration) of the the three models and the measurement data (MD) over increasing amount of used spatial data points N D,spatial .
When comparing both quantities of interest for the same data set size, the data-generating process for the calcite content is always identified with less confidence (i.e. obtains a lower weight) than for calcium.

How Similar are the Models?
Now we will assess the similarities between the different models looking on the off-diagonal entries in Figure 6. For a single data point, we can clearly see that the models "confuse" their predictions, as the off-diagonal weights are relatively high. When the initial biofilm model M IB or the full complexity model M FC are the data-generating process for the calcite content, the weights for the other models are even larger than the main-diagonal entry. For increasing data set size, the dissimilarities between the models become more significant, but only for the calcium concentration. In contrast, the model confusion remains for the calcium predictions, i.e. the current data set size does not yield a clearer distinction between the models. However, using the full data set, the model confusion decreases significantly, only the similarity between the initial biofilm model M IB and the full complexity model M FC remains clearly visible. For both calcite and calcium, M IB and M FC are similar, since they both have a relatively high weight, when the other one generated the data. Having a look only at the calcite content shows that even when the initial biofilm model M IB is the data-generating process, the full complexity model M FC obtains a higher weight, which means that the model cannot be justified with this data set size [Schöniger et al., 2015a].

How Good Do the Models Fit the Data?
In a last step, we will analyze the goodness-of-fit of the models to the measurement data. Figure 8 shows the root mean squared errors (RMSE) between the different model outputs and the measurement data, averaged over all model outputs evaluated on P different collocation points: with y 0,j the vector of measurements at position j of total length N s and M k,j ω (i) the model output of model M k at position j evaluated at collocation point ω (i) . The RMSE values for different predictions of the same model (different evaluations on different collocation points) were averaged to obtain one representative value per model. For all models the mean RMSE is almost identical in comparison, but in both cases it is smallest for the simple chemistry model M SC (the error for calcium is higher since the output its magnitude is much higher than for the calcite). With regard to the BMS analysis it shows that the small BMS weights of the initial biofilm model M IB and the full complexity model M FC stem from an only slightly better goodness-of-fit, while the models are much more complex than the simple chemistry model M SC . Remember that a more complex model needs to have a significantly better goodness-of-fit to justify its complexity [Schöniger et al., 2015a] (and to achieve a similar weight as a simpler model). Furthermore, it is interesting that the weight of the initial biofilm model M IB is smaller than the one for full complexity model M FC for the same data set size, although the full complexity model M FC is slightly more complex. Therefore, the high computational effort of the initial biofilm model M IB is not justified.

Conclusions
Combining the insights from the Bayesian model justifiability analysis and the goodness-of-fit analysis, we draw the following conclusions about the initial biofilm model M IB and simple chemistry model M SC as simplifications of the full complexity model M FC . The full complexity model M FC provides moderate BME values in the BMS analysis and does not use its full potential according to the Bayesian model justifiability analysis. Additionally, M FC provides unsatisfactory goodness-of-fit to the measurement data and cannot capture the underlying physical process reasonably well. The simple chemistry model M SC for calcite and calcium obtains the same weight of 1.00 in the BMS analysis ( Figure 7) and Bayesian model justifiability for (Figure 6) with the full data set. Therefore, the simple chemistry model M SC uses its full potential to represent the data and it captures the response of the underlying physical system appropriately.

Summary and Conclusions
Bayesian model selection (BMS) cannot only be used for ranking models based on their goodness-of-fit to measurement data and parsimony, but also to quantify similarities among models. This work introduces surrogate-based Bayesian model justifiability analysis for analyzing microbially induced calcite precipitation models in porous media. The suggested framework offers a rigorous pathway to address so-called conceptual uncertainty, i.e. which model is best suited for describing the underlying physical system. The justifiability analysis compares the models among each other and the available measurement data.
Applying the justifiability analysis in addition to the BMS analysis yields a better insight on why a model wins the BMS ranking: either because it really fits the measurement data best or only because the data set size is too small to identify a more complex model, that actually fits better. Thus, the current best model is only best in the case of the given too limited data set size [Schöniger et al., 2015a].
The BMS and justifiability analysis were performed using surrogate models, which were built via an arbitrary polynomial chaos expansion (aPC) in order to assure feasibility of the analyses for computationally demanding biogeochemical models. The aPC accelerates the analysis, which requires a large number of model evaluations, by reducing the required number of evaluations of the original model. We apply Bayesian iterative updating of the surrogate models improving their accuracy while incorporating measurement data. In order to account for the error, that arises by comparing the surrogates instead of the original models correction factors for the calculated weights were introduced. The correction factor proposed by Mohammadi et al. [2018], correcting the comparison of model and measurement data, was extended to a novel correction factor for a comparison between two models. It helps to perform reliable surrogate-based Bayesian model justifiability analysis.
Applying the introduced Bayesian justifiability analysis to three different models (simple chemistry model M SC ,initial biofilm model M IB and full complexity model M FC ), we compare the models to measurement data and among each other. The comparison is based on the predictions of calcite content and calcium concentration at different data points in space and time. The justifiability analysis has shown that the simple chemistry model M SC and the full complexity model M FC for calcite and calcium and the initial biofilm model M IB only for calcium identify themselves best, in comparison to the other models, when a certain data set size is used. The simple chemistry model M SC even achieves perfect justification with a weight of 1.00.
The analysis has also revealed that the data set size is too small for justification of the initial biofilm model M IB in terms of the calcium concentration, since its diagonal entries of the model confusion matrix are always smaller than 0.5. Further, it shows that the initial biofilm model M IB and the full complexity model M FC are similar in terms of both quantities of interest (calcium concentration and calcite content). Additionally, performing the conventional BMS analysis reveals the simple chemistry model M SC as the best model in the model set, because of its best trade-off between goodness-of-fit to the measurement data and its sufficiently small degree of complexity.
The proposed analysis provides an extension of the very general justifiability analysis by Schöniger et al. [2015a] that makes it applicable for computationally expensive models. It can be concluded that the results for surrogate models followed the intuitively assumed preference for the simplest model when only little data is available. This makes the method ideal for application cases where the same situation, little data and computationally expensive models, appears. Although this method poses an effective way of comparing computationally expensive models their computational cost must not be disregarded. With increasing computational cost the number of model evaluations decrease for a given period of time, which leads to a more imprecise surrogate model and therefore less reliable results in the justifiability analysis.