The method of forced probabilities: a computation trick for Bayesian model evidence

Bayesian model selection objectively ranks competing models by computing Bayesian Model Evidence (BME) against test data. BME is the likelihood of data to occur under each model, averaged over uncertain parameters. Computing BME can be problematic: exact analytical solutions require strong assumptions; mathematical approximations (information criteria) are often strongly biased; assumption-free numerical methods (like Monte Carlo) are computationally impossible if the data set is large, for example like high-resolution snapshots from experimental movies. To use BME as ranking criterion in such cases, we develop the “Method of Forced Probabilities (MFP)”. MFP swaps the direction of evaluation: instead of comparing thousands of model runs on random model realizations with the observed movie snapshots, we force models to reproduce the data in each time step and record the individual probabilities of the model following these exact transitions. MFP is fast and accurate for models that fulfil the Markov property in time, paired with high-quality data sets that resolve all individual events. We demonstrate our approach on stochastic macro-invasion percolation models that simulate gas migration in porous media, and list additional examples of probable applications. The corresponding experimental movie was obtained from slow gas injection into water-saturated, homogeneous sand in a 25 x 25 x 1 cm acrylic glass tank. Despite the movie not always satisfying the high demands (resolving all individual events), we can apply MFP by suggesting a few workarounds. Results confirm that the proposed method can compute BME in previously unfeasible scenarios, facilitating a ranking among competing model versions for future model improvement.


Introduction
Many competing conceptual models can be used to represent real-world systems. These models differ in their underlying hypotheses, which need to be tested against realworld observation data for their accuracy in representing the featured real-world system. Bayesian model selection is a statistical method used for testing competing conceptual models against each other by ranking them based on Bayes' Theorem. Bayesian model selection involves computing Bayesian Model Evidence (BME), which is the likelihood of a model producing the observed data, given the prior distribution of its parameters.
Computing BME using analytical solutions is applicable only under strongly limiting assumptions, which generally prior. The latter corresponds to a well-specified prior and relatively uninformative data sets. This, in turn, means that practical applications can require extremely (up to prohibitively) large MC ensembles.
The size of the sampling ensemble for these methods is limited by the available computational resources. For high-dimensional problems (i.e. with many uncertain parameters), the so-called curse of dimensionality kicks in, requiring an exponential number of model evaluations [16,17]. Additionally, for highly accurate or informative data sets, the overlap between predictive distributions and observed data may be so small that MC methods may not result in a meaningful BME value (> 0) at all.
For a model-data system involving binary (yes/no) decision output, the likelihood function becomes a Diracdelta function, thus leading to likelihood values of zero for practically all sampled parameter values of the model. Thus, the BME value would tend to zero, and any model would be rejected as infinitely poor. This becomes a problem, especially for long-time sequences of repeated outputs. For example, in a lotto game, getting the first number right is not that difficult, but getting the exact sequence of six numbers in a row right is almost impossible.
For such model-data systems involving binary output, with highly discretized atomic-event-type data and Markov chain models, we propose a method to compute BME with a reasonably low computational effort. Observed states are called atomic events if each individual possible outcome can be enumerated and they are mutually exclusive and collectively exhaustive. Markov Chain models are stochastic models that fulfil the Markov Chain property, i.e. the probability distribution of model states in the next (time) step depends solely on the previous step, not on any prior state to that. We call our method of BME computation the Method of Forced Probabilities (MFP) due to its core idea: instead of evaluating millions of forward runs that may fit the data by random chance, the model is forced to follow the data during each time step. We record the individual probabilities of the model performing these exact transitions as if they were done without any constraints. Following a strict mathematical derivation, we compute BME as the product of these probabilities. By exploiting the Markov Chain property of the model with this procedure, we are able to compute BME in previously nearly impossible cases without resorting to any kind of approximations.
Model order reduction techniques offer an alternative approach for optimization, parameter sampling or Bayesian analysis of high-dimensional problems. Reduced-order models are a computationally cheap abstraction of the original, high-fidelity models [18]. Examples of such reducedorder modelling techniques include, but are not limited to, models obtained using projection-based model reduction method (e.g., polynomial chaos expansion [19], proper orthogonal decomposition [20]), response surface models (e.g., polynomials, kriging, radial basis functions, artificial neural networks, etc. [21]) and, lower-fidelity models (physically reliable simple abstractions of the system under study [21]). Although such reduced-order models assist in solving the computing time problem, they are only approximate. In contrast, our method (MFP) is exact. Also, our method tackles the challenge of evaluating BME rather than the computational efficiency issue of complex high-fidelity models. Further, our method can be used in combination with all reduced-order modelling approaches that maintain the Markov property. Other options include an abstraction of summary statistics from data (so-called approximate Bayesian computation [22]), manual-visible techniques (like moments matching [23]), or the use of plausible, non-Bayesian metrics [24].
In Section 2, we discuss the mathematical formulation of BME (Section 2.1), introduce our MFP approach for computing BME (Section 2.2), and illustrate it on a didactic example (Section 2.3). In Section 3, we introduce our test case for demonstration: we apply the method on a Stochastic Macro-Invasion Percolation (SIP) model that simulates multiphase flow in porous media (Section 3.1). The corresponding highly resolved data set was obtained from an experiment with slow gas injection into watersaturated, homogeneous sand in an acrylic glass cell (Section 3.2). We also design a synthetic data scenario for the proof-of-concept of our method (Section 3.3), and we list the implementation steps of the MFP for the SIP model under the different data scenarios (Section 3.4). Further, we add a list of general algorithmic steps of MFP in Section 3.5. In a previous study [24], we used the (Diffused) Jaccard coefficient to facilitate a quantitative comparison of an invasion percolation model to the experimental data set used in this study. This technique only works on image-type data and is not free from information losses. Therefore, our proposal MFP is the first method ever that facilitates a fully Bayesian assessment of the SIP model and is free of information loss. Section 4 discusses the results obtained from the synthetic (Section 4.1) and real-data scenarios (Section 4.2). Finally, we summarize the contributions of this study, draw conclusions, provide an outlook towards future work, and list a few examples of applications where our method can be applied in Section 5.

Bayesian model evidence and its evaluation via the proposed method of forced probabilities
We present the concept and mathematical formulation of Bayesian Model Evidence (BME) in Section 2.1. Then, we introduce the concept of our approach to computing BME (MFP) in Section 2.2. We illustrate our proposed method with the help of a didactic example in Section 2.3.

Bayesian model evidence
For N m competing models M k , k = 1...N m and observation data y 0 , the BME value BME k of any model M k can be evaluated as (Bayesian integral from [25]): where U k denotes the model's parameter space, u k represents a random parameter vector with prior distribution p(u k | M k ), and p (y 0 | u k , M k ) is the probability or likelihood of the parameter set u k of the model M k to have generated the observed data set y 0 . The integral I k over the entire parameter space is computationally expensive and can become infeasible (with no meaningful BME value) in the cases discussed in Section 1. A review of existing methods to determine BME can be found in [1]. To facilitate the introduction of our proposed method, MFP, we present here the approach of simple (or: brute-force) MC integration [26] of Eq. 1. The integrand is evaluated at randomly chosen points (u k,r ) of the parameter space U k , which are drawn from their prior distribution p(u k | M k ). The mean of the evaluated likelihoods (p (y 0 | u k , M k )) provides the approximate value of the integral (referred to asÎ k ): with N being the number of MC realizations (ensemble size). To re-stress the problem: in applications with many precise data, the summands in this equation are (close to) zero with a probability very close to one, such that convergence can be prohibitively slow.
To rank models against each other, one can directly compare their BME values (the larger, the better) or their negative logarithmic BME values (the smaller, the better). Alternatively, one computes so-called Bayes factors (BF) [25] for two models k1 and k2: with a scale for interpretation provided by, e.g., [27].

Method of forced probabilities (MFP): key idea
For our purposes, we redefine Eq. 1 as: Here, the parameter space U k is split into uncertain parameters θ k and random events ω k , p(y 0 | ω k , θ k , M k ) is the likelihood of the parameters ω k and θ k of model M k to have generated the data set y 0 , and p(ω k , θ k | M k ) is the prior probability density of these parameters. Uncertain parameters θ k comprise those parameters and inputs of the model with unknown or non-measurable values. Random events within the model ω k represent apparently stochastic system behaviour that cannot be explained deterministically (but only distribution-wise) by the model's equations, assumptions or mechanisms (see also Section 3.1).
Using the law of total probability [28], we split the double integral of Eq. 4 into an inner integral over random events and an outer integral over uncertain parameters: The key idea of the Method of Forced Probabilities is to replace the inner integral (over random events) with a single analytical solution and use an MC integration (Eq. 2) only for solving the outer integral over uncertain parameters (θ k ), for models obeying the Markov Chain property. This means that for random events ω k , as opposed to simulating thousands of forward model runs and waiting for a random match with the observed data, we instead record the individual probabilities p(ω k | θ k , M k ) of the model performing the exact transitions observed in the data at each time step. Using the Markov chain property, the product of these probabilities corresponds to p(y 0 | θ k , M k ) in Eq. 5: (6) where P (y 0 (t + 1) | y 0 (t), θ k , M k ) is the probability of transition in y 0 (in accordance with the data) from time step t to t + 1, and t max is the total number of time steps in the experimental data. The idea is to plug this exact analytical solution into Eq. 5 and use the MC method only for the uncertain parameters. If numerical scaling becomes an issue for Eq. 6, one can simply work in (negative) logarithmic scale: Further, even after using the logarithmic scale, numerical issues with the BME values can arise during averaging (after exponentiating Eq. 7) for the outer integral of Eq. 5 due to the scale and span of individual values. We address this by a numerical trick that involves subtracting a common BME value at the logarithmic scale, such that the exponent of Eq. 7 (Eq. 5) is closer to zero, see Appendix B.
One may argue that the act of multiplying individual likelihoods in order of appearance in a time sequence is close to the process done in data assimilation methods, where a time series of time slice-wise likelihoods and cumulative BME values can be spit out as a simple by-product. This analogy is most apparent when comparing to particle-filterlike schemes for data assimilation [29]. Moreover, just like our BME computation can be used for parameter selection / Bayesian update of parameters, this could also offer the path to parameter estimation in data-assimilation mode. Some data assimilation schemes perform a joint estimation of system states and uncertain parameters, typically called augmented state vector approaches (e.g., [30]) or parameter-space schemes (e.g., [31,32]). Without going into further detail, this opens a future pathway to apply our MFP method in (real-time) data assimilation for either state forecasting, parameter updating, or both at once.

MFP: implementation illustrated with a didactic example
As a toy model for demonstrating our method, let us consider a simple Markov Chain with two output states (0 and 1) and a fixed (instead of uncertain) parameter π k (e.g., a repeated coin flip experiment). Thus, the Bayesian integral in Eq. 4 simplifies to: i.e. θ k is fixed, and the outer integral disappears. Note here that U k only contains the random events. The transition probabilities of the model are defined as: Here, b is an output state at a particular flip, and a is an output state in the previous flip (see Fig. 1).
For a number of flips t max = 3 (i.e., t = 0, 1, 2, 3), the possible predictions by the model are shown in the  Fig. 2. Additionally, in this diagram we fix the initial condition at t = 0 to y(0) = 0. Let us assume that the true observation data sequence is 0110 (highlighted in red in Fig. 2).
In such a simple tree structure with equiprobable branching (π k = 0.5), it is obvious that the probability (BME) of observing the single true path with likelihood one is 1 number of paths . Now imagine if the sequences' length t max increases (a deeper tree) or the dimension of the state space is increased (more than two branches for each node), the complexity of the probability tree diagram will increase exponentially (see Appendix A for more details). For example, for a binary tree with t max = 100, we end up with 2 100 different paths. This would further diminish the BME value and increase the computational effort to completely sample all possible paths in direct MC-based approaches based on the conventional Eq. 2.
Most real-world applications involve a more complex structure, where the branches are not equiprobable or complete enumeration is not possible anymore. In such cases, an MC approach would be used to sample each random path in proportion to its probability, requiring an even more significant number of samples to represent all paths, including the ones with very low probability, statistically sufficiently well. Note that it is not enough to "hit" the one path that coincides with the observation, but for an accurate approximation of BME, we need an accurate representation of low probabilities just as well (zeros play an essential role in arithmetic averaging), see, e.g. [1].
In contrast, the MFP simply calculates BME as the probability of mimicking the observed state changes, i.e., a flip from 0 to 1 and then staying at 1 and finally flip again to 0: With π k = 0.5, this equals to the enumeration or MC solution of 1 8 . This means that we only need to calculate a finite product over a set of t max (here: three) values. Therefore, our method (MFP) scales linearly with t max and does not exponentially explode like full enumeration or MC methods.

Demonstration on a real case study
In this section, we demonstrate the applicability of our method on a more complex model with Markov Chain property: a Stochastic Macro-Invasion Percolation model (SIP), see Section 3.1. Invasion percolation (IP) models are discrete growth models, which repeatedly apply socalled rules that specify why which model block is invaded by the wetting or non-wetting fluid in the next step, see Section 3.1. Macro-Invasion Percolation (Macro-IP) models are an upscaled abstraction of pore-scale IP models. We have used a Macro-IP model with the experimental data used in this study in a previous study [24]. The SIP model is conceptually similar to the Macro-IP model of [24]. The difference between the SIP model and the Macro-IP model of [24] is an additional rule for stochastic selection from the Stochastic Selection and Invasion (SSI) model of [33]. Thus, to describe the SIP model, we describe the model formulation of the Macro-IP model [24] and the additional rule for stochastic selection in Section 3.1. We compare the SIP model to experimental binary-image data from gasinjection in homogeneous, water-saturated sand [34], see Section 3.2. For a test in the absence of all problems that real experimental data bring about, we also use synthetic model-generated data, see Section 3.3. Then, we discuss the implementation of MFP on the SIP model in Section 3.4 for both the synthetic and the real data sets. Before discussing the results of this case study, in Section 3.5, we list the general algorithmic steps of MFP.

Stochastic macro-invasion percolation model (SIP)
Originally derived from the Percolation theory of [35], Invasion Percolation (IP) models are often used for simulating multiphase flow in porous media (E.g., [23,33,[36][37][38][39][40][41][42][43][44][45][46]). Various versions of IP models have been used in literature, but all of them have a similar implementation structure (illustrated in Fig. 3): • The porous medium is conceptualized as a network of 2-dimensional (2D) or 3-dimensional (3D) blocks or nodes, with a given connectivity, by assigning threshold values to the blocks from a distribution (depending on the specific porous medium). • Initially, all the blocks are occupied by one (defending) fluid. Then the invading fluid is placed in one of the blocks, depending on its source or injection site (see Fig. 3a). • The neighbouring blocks of the invading fluid block are evaluated for their entry thresholds (pressure) and based on some rules (mostly minimum entry threshold, see below) one of the connecting blocks is filled by the invading fluid (see Fig. 3b & c). The filling process of one block can occur in a single step or in multiple steps.
Macro-IP models differ from traditional IP models in scale, i.e. the blocks in Macro-IP models represent a subnetwork of pores and throats in contrast to individual pores as in IP models [24]. We use a 2D representation of these models to mimic the quasi-2D setup of the experimental data (Section 3.2) and then to simulate gas invasion in homogeneous water-saturated sand.
In the Macro-IP model [24], at each time step, the gas invades one of its neighbouring water-filled blocks using The hatched light blue block shows the next block to be filled and, the red-rimmed neighbouring blocks are the ones evaluated for invasion at each step a rule that searches the block with the smallest invasion threshold T e calculated using: where P e is the local entry pressure in each block, which is the capillary pressure (P c ) required by gas to invade a water-occupied block. P w is the pressure of the water phase, calculated with the hydrostatic pressure assumption as: where ρ w is the water density, g is the acceleration due to gravity, and z is the height from the top of the acrylic glass cell. P e is calculated from the Brooks-Corey capillary pressure (P c ) -saturation (S) curve [47]: where S e is the effective wetting-phase saturation, S w is the wetting-phase saturation, S r is the residual wetting saturation, P c is the capillary pressure, P d is the macroscopic displacement pressure, and λ is the pore-size distribution index, the value of which typically ranges between 1 − 4 and can be up to 7 for very uniform sands. Please note that P e is a specific value of P c (a point on the P c − S curve), and that we provide the term: S w −S r 1−S r of the Eq. 12 only to correspond with the general form of the Brooks-Corey pressure-saturation relation prevalent in literature. The term S w −S r 1−S r has no further use in our model description.
At the scale of the model, the exact pore-scale arrangement is unknown. Thus, the capillary pressure P c is randomized per block using the Inverse transform sampling method, the details of which have been discussed in [24]: with U being a random number from a standard uniform distribution on the interval [0, 1].
Since the Macro-IP model from [24] is deterministic for any given value of θ and a frozen set of random P c values, we would have no random events ω within the model. That means computing BME would focus only on the outer parameter-related integral of Eq. 5. The inner integral would degenerate to a simple yes or no problem. Without addressing measurement errors or any other form of randomness between the model and data, the answer would be a straightforward rejection with BME = 0. Thus, to include this model in our BME comparison, a modification of the model to include random events ω is required. This is achieved by using the SIP model. SIP differs from the Macro-IP in the way that instead of gas selecting the block with the minimum invasion threshold (T e ) for invasion, it invades based on a slightly modified rule for stochastic selection from the Stochastical Selection and Invasion (SSI) model of [33]. The stochastic selection rule of the SSI model accounted for viscous effects (randomness brought into the system by the fluids) and was originally applied to dense non-aqueous phase liquid (DNAPL) migration. At the injection rate of the experimental data used in this study, viscous forces are negligible. However, viscous forces come into play at other injection regimes or even for different fluids. This stochastic selection rule has been modified to be applicable for our gas invasion in water-saturated sand [23], instead of the DNAPL invasion of the original work, and is explained below.
In the modified stochastic selection rule of the SSI model, the decision of gas invasion is still proportional to the T e values of the neighbouring blocks but is slightly modified using an additional parameter: c called the cell selection weighting factor [33]. In this rule, the list of T e values of the neighbouring blocks (n) of the current gas cluster are arranged in an ascending order T e,asc and the cumulative sum T e,cum is evaluated: Then, the first block (value of i) where the rule specified in Eq. 15 is found true is invaded by the gas: Here, R is a uniformly distributed random number between [0,1], and c, in the range of [0....∞], controls the strength of randomness in the stochastic selection rule. When c → ∞, the value of R c → 0 for almost all values of R. In this case, the block with the lowest T e value, first on the list of T e,asc , will be selected for invasion. This results in lightning-bolt-like gas fingers. The lower the c value, the higher the RHS of Eq. 15, which ensures that the higher T e [j ] are picked more often; hence we observe a gas finger that is not moving strictly upward, but resulting in a gas finger pattern with a wider spatial distribution. Example simulations for c = 5, 15, 100 will be provided in Section 3.3, see, e.g., Fig. 4. Once the gas invades a block, it is assigned a gas-saturation value of 1. Thus, the model has a binary (gas or no gas) type of image data as output.
Furthermore, in our SIP model, we also include the additional re-invasion rule of the Macro-IP model from [24] to incorporate fragmentation and mobilization events [23,41,48] observed at low gas flow rates. This re-invasion rule is based on the terminal thresholds (T t ): where P t is the terminal pressure calculated from the P e -to-P t ratio (α), which can be obtained from the characteristic drainage and imbibition curves of the corresponding porous medium, including capillary pressure hysteresis [49]: where α accounts for capillary hysteresis between drainage and imbibition [50]. Water re-invades the gas-occupied blocks if where the subscripts g and w stand for the gas-and water-occupied blocks, respectively. When the re-invasion of water occurs on the peripheral blocks of the gas cluster, mobilization of the gas cluster occurs. If the re-invasion of water disconnects the gas clusters, a fragmentation of the gas cluster occurs. Therefore, gas invasion can only occur at a block connected to the gas cluster containing the gas-injection port. Hence, the other gas clusters can have a re-arrangement of blocks, but no further growth. This mimics the trapping of the gas phase at this scale. The model parameters used in this study are given in Table 1. Out of the list of model parameters, P c per block calculated using P d (Eq. 13) is surely an uncertain parameter to enter into θ . We assume that the other parameters are known in this study.

Gas injection experimental data
We use the experimental data from a quasi-2D gas injection experiment in an acrylic glass cell of dimensions 250mm × 250mm × 10mm filled with homogeneous, water-saturated sand of 0.7 mm average grain size from the set of experiments in [34]. The gas is injected at a rate of 0.1ml/min (Experiment 0.1-A of [34]). At this injection rate, gas migrates along with fragmenting and coalescing events on their way. The discontinuous nature of the gas flow is further confirmed by continuously measuring the pressure at the injection point during the experiment [34]. The experimental setup, data collection, and processing are described in detail in [34]. We present a brief overview of the experiment we use in this study.
The experimental data is a time series of 2D binary images (around 10,000 images) obtained using a light transmission technique [53][54][55]. The images are obtained at the rate of 30 frames per second for a total of 330s. Optical density (OD) [55] values are used to detect gas in the system. Gas is considered to be present in a block above an OD threshold value of 0.02. An ideal data set for our method would be where each atomic step (individual invasion events or re-invasion events for each block) is separately visible in time. We pick this particular data set because of its high resolution in both space and time. However, the data obtained is not free from some challenges that we need to overcome to use MFP to evaluate the BME for the SIP model.
• Firstly, non-atomic events are observed in the data set even at this high temporal resolution. This means that, from one time step to the next step, multiple atomic events (e.g. invasions, re-invasions) are found to occur so that their exact sequence is not given uniquely. • Secondly, at some time steps, the experimental data shows re-invasion at a block not as expected by the model's deterministic re-invasion rule as specified in Eq. 18. • Thirdly, at some time steps, invasion of gas occurs at a block that does not appear to be connected to the cluster containing the original gas injection block, violating the SIP model's assumptions. This disconnection could result from the data's optical detection limits. • Fourthly, in some time steps, the number of gas pixels also decreases from the previous time step. This violates the mass conservation principle that the model (in the absence of a variable gas density) simplifies to a volume balance.
With the current configuration of the SIP model, using MFP would thus lead to zero-probability events because of the aforementioned observations (second to fourth) in the data set. This would lead to, within the scope of the present work, meaningless BME computations. This is not an artefact of MFP but would also occur in all other methods to compute BME. The MFP is able to map it to individual zeroprobability events, while other BME computation methods would merely return an overall zero value for the entire inner integral of Eq. 5.
That is why we first test our method on synthetic data, as will be discussed in the next section. We then slowly introduce the above-mentioned irregularities in our synthetic data set, and we also introduce some workarounds (discussed later) to be able to use MFP despite the nonideal data set and the non-ideal model assumptions. The first problem leads to an extension of MFP towards nonatomic data events, while the second problem leads to an augmented probabilistic interpretation of zero-probability events. After that, we use the longest sub-sequence of the real-experimental data with non-decreasing n invasions ≥ n re−invasions number of invaded gas blocks (7 steps between image number 239 and 246), which excludes the third and fourth problem. Within that sub-sequence, the workarounds can be implemented without computational difficulties.

Synthetic data
We begin with testing our method on the SIP model with synthetic data. By using synthetic data, we first test our method under ideal conditions. Next, we introduce irregularities in the data set in a controlled manner. To that purpose, we run the SIP model, with c = 15 on a particular invasion threshold field (T e,syn ) with no re-invasion events (i.e. the rule given by Eq. 18 is removed) and use the results instead of a real data set. Thus, our synthetic data set consists of a sequence of atomic events and no measurement errors. It is now guaranteed that the models can follow the data with a non-zero probability. Figure 4(a) shows this synthetic truth.
As a next step, we add non-atomicity to the synthetic data set, which is also observed in the real data set (see Section 3.2). To make non-atomic synthetic data, we regularly omit time steps, such that the SIP model would need n ev = 2, 3, 6 iterations to get from one state to the next. This means we keep only every second, third, and sixth state from our atomic data set.

Implementation of the MFP on the SIP model
In this section, we discuss the setup and implementation of the MFP on the SIP model for both a synthetic data set and the sub-sequence from a real data set. First, we describe the common implementation setup for both types of data sets. Then, in Sections 3.4.1 and 3.4.2 we will highlight the difference in scenario setups for the corresponding data set.
We choose three cell selection weighting factors to correspond to one rather random (c = 5), one more deterministic (c = 100) and one model version in between (c = 15) of the SIP model. This results in three model versions. The choice of these three different cell selection weighting factors can be thought of as representative sand pack experiments with different force-dominated regimes: viscous (c = 5) or capillary (c = 100) [34]. The invasion threshold field makes up the uncertain parameters θ k over which we have to marginalize the inner integral (random gas-invasion decisions of SIP) through the outer integral of Eq. 5 to obtain BME. Figure 4 visualizes the randomness of the SIP model with 10 model runs for each of the model versions (c = 5 or 15 or 100). In contrast to the didactic example of Section 2.3 (Fig. 2), for SIP model we consider at each "node" of the decision tree a random decision of gas migration. These random decisions each have multiple, situation-specific possibilities for invasion and re-invasion (illustrated in Fig. 5) instead of binary decisions.  Fig. 2); The blue block marks the injection block. The red-filled blocks mark the currently invaded blocks. In the next step, any block on the interface (red-rimmed blocks) might be invaded, and any one of the red-filled blocks or none might be re-invaded Also, unlike the didactic example in Section 2.3, the probabilities for each forced time-step in SIP will not be a constant π k , or 1 − π k , but they will depend on multiple factors, namely: (1) the cell selection weighting factor (c), (2) the invasion threshold (T e ) field that depends on the randomized P c fields and, (3) the current shape of the cluster of gas-invaded blocks at the current time-step. We would like to recall, from Section 3.1, the cumulative sum T e,cum (Eq. 14) and its connection to the uniformly distributed random variable R in Eq. 15. The model chooses to invade the block with the index i (of the ascending order structure) and not any other neighbouring block if and only if Eq. 15 is fulfilled for i and not for i − 1, i.e., Rearranging the terms in the equation above gives us two bounds, and R must be between The interval width between these bounds is the probability of this exact invasion at the block i. Note here that, for the block with index i = 1, the lower bound remains undefined by Eq. 14 and is set to zero.
When we investigate non-atomic steps in the data, we cannot simply evaluate the transition kernel of our model implied by Eq. 20 but must think about a workaround. One can view it like an excerpt of a complete probability tree diagram as shown in Fig. 2, where n ev atomic events occur. The difficulty lies in not knowing the states and their ordering in between. We know the start and the end and can only guess the sequence of atomic events that happened in between. This means that any permutation of the events could be a suitable choice.
A reasonable treatment is to consider all the permutations, i.e. compute the BME over all these possible permutations. Only permutations that do not lead to a path the model can traverse by its underlying rules (see Section 3.1) and give by definition a probability of zero can be excluded. Moreover, further, to not favour any specific one of the remaining permutations, it is reasonable (and statistically Fig. 6 Schematic for the workaround for non-atomic data steps. In the experiment path y 0,a , the step from a = 2 to a = 3 consists of 5 atomic events. Each blue path corresponds to one out of n paths of atomic events leading to y 0,3 , with n being the number of possible permutations of the order of the atomic events. (Here, in the schematic it is n = 5! = 120) correct) to average their BME (see Fig. 6). We call this workaround a mini-Monte Carlo (mini-MC) approach.
If we assume that the number n ev of the non-atomic events is bounded by a constant m throughout the whole experiment, we increase computational effort by a factor of m!, but we preserve the linear complexity of our method in t max as mentioned in Section 2.3. It is reasonable to assume m t max , and thus we only employ an exhaustive search on a small scale and do not affect the overall effort significantly.

Synthetic scenarios
This section specifies scenario setups to treat the synthetic data set from Section 3.3. We split up our evaluations into three synthetic data scenarios as follows.

Scenario 1:
In this scenario, we plug in the true invasion threshold field from Section 3.3 (i.e. the field T e,syn used to generate the synthetic data set) for all 3 model versions (visualization in Fig. 4(b)) and evaluate BME with our method on the atomic synthetic data and the non-atomic synthetic data (i.e, with 2, or 3, or 6 − step jumps). This scenario represents gas-injection experiment repetitions in the same sand pack without any disturbances to the setup (ideally). Scenario 2: In this scenario, we draw an ensemble of 1000 invasion threshold fields by adding small, random noise to the true invasion threshold field (T e,syn ). Then, we plug each of these fields into the 3 model versions (visualization in Fig. 4(c)) for both the atomic and non-atomic synthetic data (with 2, or 3 − step jumps). This scenario represents gasinjection experiment repetitions in the same sand pack with smoothed-out local heterogeneities or disturbances, e.g. due to grain re-arrangement during the injection of gas. Scenario 3: This scenario involves an ensemble of 1000 independent random invasion threshold fields, each of which is plugged into the 3 model versions (visualization in Fig. 4 (d)) for both the atomic and non-atomic synthetic data (with 2,or 3 − step jumps). This scenario represents gas-injection experiment repetitions, where the sand is repacked after each experiment.

Real data scenario
When using the real data sequence as mentioned in Section 3.2, we use a setup similar to Scenario 3 of Section 3.4.1 with 7000 random invasion threshold fields.
The difference being that we now use the SIP model with the ability for re-invasions (recall that Section 3.4.1 uses SIP model without Eq. 18), so that they can better resemble the real data set from Section 3.2. An immediate evaluation of these models leads to BME=0 for almost all invasion threshold fields; because of its deterministic re-invasion decision (rule specified in Eq. 18), the model wants to re-invade a wrong block and is punished with complete Bayesian rejection. Theoretically, the BME value of 0 is correct, but it has no practical significance. The focus of this study is primarily method development and not model development. Therefore, we probabilistically change the model. We assign a 90% probability to the model's decision to re-invade the block obtained from the rule specified in Eq. 18 or not-reinvade any block. The remaining probability of 10% is uniformly distributed among the other blocks of the gas cluster for the re-invasion of water. That means, any block of the current gas cluster can be re-invaded with a probability of at least 0.1 n gas,cluster , see Fig. 7. Note, that n gas,cluster only accounts for the blocks in the respective cluster. Also, we have to treat the injection cluster differently as we have one less choice since the injection block cannot be re-invaded.
We also need to adjust our workaround for the non-atomic data (Fig. 6) because we now have a re-invasion rule Fig. 7 Illustration of the modification to tackle the second challenge in data from Section 3.2, where water re-invasion in gas-occupied blocks occur not according to the model's choice (guided by Eq. 18): The blue block marks the injection block. The red blocks are gas-occupied. In this example, top gas cluster has a probability of 0.1 ngas,top = 0.1 5 and the injection cluster has a probability of 0.1 n gas,injection = 0.1 6 , for a re-invasion of water in the respective cluster in the models. To do that, we combine the different orderings of re-invasions with the orderings of invasions from before and leave the rest of the nonatomic modification unchanged. Note, that an atomic time-step may also have no re-invasion at all, since n invasions ≥ n re−invasions . The total number of orderings is then n orderings = (n invasions !) 2 /(n invasions − n re−invasions )!. For example, a combination of non-atomic events with 5 invasions and 2 re-invasions leads to 2400 different orderings, which happens to be the maximum number for the sub-sequence mentioned in Section 3.2

MFP: list of algorithmic steps
Before we discuss the results from our case study, we summarize the general algorithmic steps of the MFP. These steps are the same for all models obeying the Markov Chain property combined with exact data (knowledge of each atomic event).
(1) List all possible events in the data, both reproducible and non-reproducible, by the model. For example, in the case of our demonstration case study, the data's non-atomic events fall under the model nonreproducible events category. (2) State the formula for the probabilities of events being executed by the model. These could be individual, fixed values, evaluations of a probability distribution function or a combination of both. In our case study, it is stated by Eq. 20.
(3) In the original model code, code a new update rule to force the next model state, similar to a restart capability of a code. (4) Propagate and accumulate probabilities through all time steps, i.e. a simple multiplication. At this stage, a possible code break-off criterion can also be included to identify and flag zero-probability events.
The implementation of our code is mostly non-intrusive because no re-writing of the code is necessary. However, step (3) requires good restart abilities of the model code with forced model states per time step. Also, the simplest way to achieve step (2) is to add a line to the original code that outputs the probability of the forced event.

Results and discussion
This section discusses the results obtained from our analysis. Table 2 contains the BME values on a negative logarithmic scale (the smaller these values are, the better the model) obtained using MFP on the SIP model for both our synthetic-data case as well as our real-data case.

Results from the synthetic data case
In both Scenario 1 and 2, the model version with c = 15 has the best BME values (see bold font in Synthetic Scenario 1 and 2 of Table 2). For Scenario 1, this is what we expect because this model version and threshold field were used Table 2 Table containing the BME values obtained in the three synthetic scenarios and the real scenario on a negative logarithm scale, the ensemble sizes n MC , number of atomic events n ev occurring within a non-atomic step and, computed Bayes factors BF k2 k1 Note, here the model versions (k1, k2), are denoted by their respective c values. The best performing model is highlighted with bold font − ln BME value to generate the synthetic data. For Scenario 2, the threshold fields were close to the synthetic data setup; therefore, the correct model version still had the best BME value. Also, according to expectations, all the model versions had significantly worse BME values for Scenario 3, where entirely random entry threshold fields were used. However, the ranking also changed, and the more random model version (c = 5) emerged as the best model in Scenario 3.

Why does the model ranking change for Scenario 3?
Let us first look at the two extreme model versions to understand why the ranking changes. The model version with c = 100 is almost deterministic in its choice of a gas pathway, which is different for each invasion threshold field. This is why this model version can get good BME values (small − ln BME) if and only if the invasion threshold field closely matches the true field (T e,syn ), which is highly unlikely when we use entirely random invasion threshold fields. If this is put colloquially, the few good predictions of the c = 100 model version do not make up for the many bad ones. The more random (c = 5) model version is not as deterministic in its choice of the gas pathway as c = 100 is, and so, it is largely unimpaired by the choice of the invasion threshold field. This is why the random model version (c = 5) achieves mediocre values for any invasion threshold field. Thus, in the scenario where the invasion threshold field is highly uncertain, it has an advantage that helps it emerge as the best model version in Scenario 3. The model version with c = 15 is not identified as the best model when we increase the uncertainty in the threshold field. This indicates that the entry threshold field is a highly sensitive and important parameter for the SIP model to function correctly.

Effect of non-atomic synthetic data
The introduction of non-atomicity in the synthetic data does not change the ranking of the models in any scenario (see, − lnBME values for n ev values other than 1 for Synthetic Scenarios (1, 2, and 3) in Table 2) but makes it slightly less decisive in comparison to n ev = 1 for all the synthetic scenarios of Table 2. This coincides with the synthetic data set becoming, in a sense, weaker or less informative if parts of it are unknown in ordering. This is visible in Table 2 as, despite the general rise of − ln BME values with increasing n ev , their differences become slightly smaller. Looking at the Bayes factors between the competing models makes it easier to see this effect: they generally decrease with increasing n ev . There are only a few exceptions to this observation. For example, the Bayes Factors BF 15 5 between the models in Scenario 2 increases with the increased nonatomicity in the synthetic data. However, looking at the orders of magnitude of the values in comparison, it can be safely concluded that this does not affect or change the level of decisiveness.

Results from real data scenario
Initially, we evaluate the model versions on the complete real data set. This helps us gather information on the magnitude of the effect of the challenges in the real data for implementation of MFP, as discussed in Section 3.2. We find that non-atomic events with a very high number of events n ev are pretty common in the data set, which leads to very high computation time for the mini-MC workaround explained in Section 3.4, thus making a BME evaluation infeasible even with MFP.
The events of wrong block re-invasion (the second problem in data discussed in Section 3.2) in the data set are plenty, but we are able to tackle them with our workaround mentioned in Section 3.4.2. In the later time-steps of the data, block invasion in non-gas injection clusters (Third problem in data discussed in Section 3.2) or events with decreasing numbers of invaded gas blocks (Fourth problem in data discussed in Section 3.2) are predominant. However, we have no fix to this problem in the real data set.
Thus, we decide to look for a sub-sequence of time steps in the data that aligns with the model's assumptions and has a reasonably small number of non-atomic events (for reasonable computation time of mini-MC runs, see Fig. 6). The resulting sequence of time steps in the data is the one mentioned in Section 3.2. For that sequence, − lnBME is between 269 and 360 for the probability of correctly predicting seven steps, which is a small probability already. This is not a fault of our method MFP. We hope that the models to which our readers may decide to apply MFP match their corresponding data set better than in our case.
Regarding the ranking of the model versions, we see a similar pattern as in the synthetic Scenario 3. The best model version is the one with c = 5, followed by c = 15 and then c = 100 (see Row: Real from Table 2). The uncertainty in the invasion threshold fields is handled better by a random (c=5) model than by the more deterministic models. Therefore, more information about the invasion threshold fields is necessary for these models to accurately predict the gas path under the experimental data's conditions and scale.

Conclusions and outlook
In conclusion, our method MFP makes it possible to calculate BME for Markov-Chain type models and discrete atomic data in previously impossible cases. The method works well, is non-intrusive to the model and has a linear computational cost. In our case study, the method was demonstrated only on a relatively small sequence of real data. This is because the large distance between the model outputs and the real data leads to many zero-probability events. So, for more conclusive results, better models or more-informative data are required, i.e. data with no or few non-atomic events and an improved SIP model.
When we use MFP to evaluate the BME for imperfect models or data or both, resulting in practically futile BME values (BME = 0), we can adapt our approach and use MFP to detect events leading to such values in the model and the data by flagging them. This exercise helps determine the structural errors in the model, or mismatch between the model concepts and the observations. From our implementation of MFP on the SIP model and gas-injection experimental data used in this study, we can conclude that both the model and the experimental data have a scope for improvement. The rules in the SIP model could be updated by looking at specific types of events, e.g. the ones that get the model rejected or result in poor performance (like the deterministic re-invasion events in the current version). The experimental data technique processing could be updated to have more discrete and atomic data steps. However, experimental data or model improvement is beyond the scope of the present research work.
Our method, in its current stage of development, requires that the data be noise-free. Further research is needed to apply MFP to noisy data (e.g. with statistical assumptions on the distribution of black/white detection errors). A straightforward idea would be to perturb the available data with several realizations of randomly generated noise and then handle each realization with our method. However, this multiplies computational costs by a substantial factor to host these repetitions.
Our method enables a fully Bayesian assessment of the SIP model for the first time. Besides the SIP model and gas-injection experimental data, this method can be applied to gas (fluid) migration in fractured-porous media under the conditions of Markov-style model formulation and complete observations (e.g. in thin slices or with high-resolution 3D micro-tomography). It can also be applied to systems involving experiments and models at the microscopic scale, where individual pores are resolved by appropriate monitoring techniques (e.g., [56]).
Apart from multiphase flow in porous media problems, this method can be used in applications such as counting processes (e.g. as in traffic), discrete computerized systems (e.g. network traffic), probabilistic Markov-style modelbased river water quality monitoring [57], tracer experiments / Lagrangian movement (e.g. fluorescent microparticles to monitor turbulent flow [58]), stochastic models for discrete, dynamic systems and complete observation (e.g. chemical reaction modelling), statistics-based data-driven soil-plantatmosphere modelling [59] and micro-seismic modelling [60] to name a few.

Appendix A: Monte Carlo simulations grow exponentially in t max
We will prove here that the required size of MC simulations grows exponentially in t max . Based on the tree structure in the didactic example from Section 2.3, we can see that there are N = 2 t max equiprobable branches. Therefore, the probability of the correct branch (the one with non-zero likelihood) is exactly: which apparently is the situation-specific definition of BME. When approximating BME via MC sampling with i = 1 . . . n independent random realizations, then each realization i has a constant and independent probability of finding or not finding the correct branch. This situation is described exactly by the Binomial distribution. The Binomial distribution is a discrete probability distribution of the number of success events k out of n independent trials: P (X = k) = n k p k (1 − p) (n−k) (22) where p is the probability of success and 1 − p is the probability of not obtaining success. Here, we define n as number of MC trials, X is the (random under repetition of the entire MC simulation) number of times the MC finds the correct branch (i.e., the one with Likelihood = 1). For a given execution of MC with given n, we will find k times the correct branch and n − k times any branch with Likelihood = 0. Also, in the context of our example, the probability parameter p of the Binomial distribution is equal to BME = P true . From the MC results, one would estimate: Just for reassurance, the asymptotic MC result for BME at n → ∞ converges to the exact solution: But can we estimate the MC error in this approximation, e.g., expressed as the coefficient of variation (CV). For the Binomial distribution, we know that: Variance of X: Var[X] = n · p · (1 − p)

Mean of X:
E[X] = n · p As the conversion from k to the estimate of p is simply a division by n, we apply the rules of linearized uncertainty quantification to see that: Variance of X n : Var X n = p·(1−p) n Mean of X n : E X n = p Using this in the definition of the coefficient of variation: For small values of p as in the given example with p = 2 −t max , we can replace 1 − p ≈ 1, and hence: CV of X n for small p: Thus, the number of MC runs required for a desired accuracy (expressed as a desired value of the CV) is: This shows that the number n of required MC samples increases, for a given precision requirement, exponentially in t max . The base 2 of the exponent originates from the tree structure, where each node expands into two further branches. In real applications, where the evolution of the model over time has more than two possibilities, the base will simply increase, so the exponential growth will be even stronger.

Data Availability
The experimental data set used in this study is made available by [61] at Scholars Portal Dataverse: https://doi.org/10.5683/ SP2/RQKOCN.

Code Availability
The modelling data and codes used in this study are available as a data set [62] in the DaRUS dataverse for Stochastic Simulation and Safety Research for Hydrosystems (LS3):https://doi. org/10.18419/darus-2815.

Declarations
Ethics approval and consent to participate 'Not applicable'

Conflict of Interests
The authors declare that they have no conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons. org/licenses/by/4.0/.