A History Matching Study for the FluidFlower Benchmark Project

In this study, we conduct a comprehensive history matching study for the FluidFlower benchmark model. This benchmark was prepared and organized by the University of Ber-gen, the University of Stuttgart, and Massachusetts Institute of Technology, for promoting understanding of the complex physics of geological carbon storage (GCS) through in-house experiments and numerical simulations. This paper synthesizes the experiences of history matching the benchmark data encountered by the Delft-DARTS and CSIRO participants. History matching is first performed based on a low-dimensional-zonated structured model using a simple Poisson-like solver. The permeability of six facies and two faults is inferred in this stage to match the digitized concentration data. The history matching is then further enhanced to richer spatial and physical models to capture the spatial variation of permeability and buoyancy effects, using an unstructured grid. Efficient adjoint methods are used to evaluate the gradient used in the optimization of data misfits or equivalent Bayesian log-likelihoods. With efficient optimization methods available for both maximum a posteriori model inference and Randomized Maximum Likelihood methods for model uncertainty, we perform history matching using both binary and continuous concentration observations. The results show that the tracer plumes in the enriched model match the experimental plumes more accurately compared with the results from the parsimonious-zonated model. The history matching results based on the concentration observations provide more similar plume shapes compared with the case based on the binary observations. The permeability difference between the model before and after history matching reveals that the tracer plume zone and the high permeable zone are the regions of high sensitivity in terms of data misfit between the model


Introduction
Escalating greenhouse gas concentrations in the atmosphere and their effect on climate have become an urgent and global concern.Extensive CO 2 emissions are a pervasive aspect of modern industrialised society and its dependence on fossil fuels.Large reductions in CO 2 emissions are mandated by the most recent IPCC reports, and virtually all the conceivable pathways require engineered removal of new carbon pollution sources (IPCC 2022;IEA 2020a).Carbon capture and storage (CSS) has been proven to be one of the most promising solutions to this environmental issue.Typically, CCS can reduce 85-90% CO 2 emissions from large point emission sources (Leung et al. 2014), for example, power plants, cement kiln plants, etc.It is also very likely that carbon-removal technologiesdirect air capture or bioenergy with CCS-will be required because emissions of CO 2 cannot otherwise be reduced in time to avoid dangerous global warming (IEA 2020b).The IEA models require a rapid scale-up of CCS, from 40 Mt p.a. now to a Gt p.a. by 2030 (IEA 2020c).
There are several ways to store CO 2 into the geological formations (Chiaramonte et al. 2011;White et al. 2003;Lyu et al. 2021b;House et al. 2006;Schaef et al. 2010).Of these possibilities, CO 2 storage in saline aquifers is considered to have the largest potential for the storage of CO 2 .This is because saline aquifers can be widely found in both onshore and offshore areas.The usual four CO 2 trapping mechanisms, with varying characteristic time scales, are expected to apply: structural, residual, solubility, and mineral trapping, in increasing order of time scale.
To understand and quantify the flow dynamics of CO 2 storage process, numerical simu- lation is essential for modelling the process based on the governing equations and equations of state.However, the numerical simulation of CO 2 storage is challenging because of the complexity of the fluid thermodynamics and the unfolding of multiscale physics.
To assist in these challenges, the FluidFlower benchmark project (Nordbotten et al. 2022) has been constructed to accelerate progress in the field of CO 2 storage modelling in geological settings.This project has closely inspired the 11th Society of Petroleum Engineers Comparative Solution Project (SPE 2023).The central component of the benchmark is an experimental rig emulating a typical geological scenario for CO 2 injection, constructed by the University of Bergen.Several international research groups were then invited to perform numerical simulation of this benchmark, where the dominant processes are associated with multiphase flows, capillarity, dissolution, and convective mixing.The benchmark experiments furnish tracer images and pressure observations to serve as calibration data for the spatial model of petrophysical properties.Details of these experiments are described at length in Nordbotten et al. (2022).The efforts in history-matching following in this paper were motivated by the natural and reasonable belief that this would lead to the most accurate forecasts of the CO 2 injection.Hence, this study focuses on the history matching of the tracer and pressure tests of this benchmark, but we consider also the issue of forecasting uncertainty under CO 2 injection.
Inversion of geophysical data very often reduces to an optimization problem where some misfit measure between the observations and model predictions is minimized.Depending on whether model gradients are available, optimization methods are then classified into gradient-free and gradient-based methods.For problems with challenging nonconvexity or multimodality in the misfit function, gradient methods may be of limited use, as they are vulnerable to trapping in inferior local minima.In such cases gradient-free global methods like simulated annealing (Kirkpatrick et al. 1983), genetic algorithms (Holland 1984), or particle swarm algorithm (Eberhart and Kennedy 1995) are popular.Their efficiency is always vastly inferior to gradient methods, if applied to a problem that has a single or dominating global optimum.For such problems, where gradients can be obtained, the most efficient methods are usually variants of the Gauss-Newton method if the second-order (Hessian) information can be obtained or well approximated, or varieties of conjugate-gradient or sparse quasi-Newton methods (e.g.l-BFGS) where only the gradient is available, as is typical in high dimensional model settings (Nocedal and Wright 2006).Even in non-convex settings, gradient-based methods are very often the techniques of choice, if combined with graduation or continuation strategies for smoothing out the nonconvexity in the early phases of the optimisation.This paper uses exclusively gradient based techniques, which approach Newtonlike (quadratic) efficiency when the model dimensionality is very low and the Hessian is well approximated via use of the explicit Jacobian, as per the Marquardt method, but reduce to linear-only efficiency when only the adjoint-based gradient is used in the high-dimension cases.
When the model dimensionality is high and the forward physics is generated by a set of PDEs that are differentiable in the model parameters, which is typical of geo-energy systems, the most efficient way to compute gradients is the adjoint method.It allows for the efficient calculation of the gradient by reusing the solution of the governing equations, since the solution of the adjoint problem is usually very closely related to the forward problem, often by a time-reversal procedure.In the field of geo-energy reservoir engineering, adjoint methods have been applied to perform history matching (Chen et al. 1974;Oliver et al. 2008), petroleum recovery processes (Mehos and Ramirez 1989;Fathi and Ramirez 1984;Ramirez et al. 1984), thermal recovery processes (Wei et al. 1993) and so on.Later, with the advent of "smart well" and "smart field" concepts, adjoint-based optimization and optimal control ideas have been widely studied and applied in the field of reservoir engineering and reservoir management (Brouwer and Jansen 2004;Sarma et al. 2005Sarma et al. , 2006;;Volkov and Voskov 2016).
In this study, we will utilize the open-source Delft Advanced Research Terra Simulator (open-DARTS) framework (Voskov et al. 2023) designed for modelling energy transition applications (Khait and Voskov 2018b;Wang et al. 2020;Lyu and Voskov 2023).This implementation features an efficient forward reservoir simulation but also inverse modelling capability based on the adjoint gradient method.The adjoint implementation for multicomponent, multi-phase system in petroleum-related reservoir simulation problems are described in Tian et al. (2021), and both the forward and inverse modelling capabilities have been extended into generic energy transition applications (Tian and Voskov 2023).This extension allows for the incorporation of various effects, such as energy transfer, gravity, diffusion, and chemical reaction, through corresponding operators.
This paper presents a two-stage approach for history matching tracer test image data, using first a simple Poisson-like solver in a very parsimonious zonated model, then escalating to an adjoint-based framework for a much richer model with ≈ 72000 transmissibility parameters.The simple zonated model uses structured grids, but we move to unstructured grids for the high dimensional model.The adjoint method uses operator-based linearization (OBL) for the development of the misfit gradient, and its efficiency makes possible the modelling of heterogeneity and computation of uncertainties in the history matching process.To estimate the uncertainty of the model inference, we use the standard Randomized Maximum Likelihood (RML) method (Oliver et al. 2008), and a spatial model of heterogeneity in lognormal permeability is added to the geological model.
The structure of this paper is as follows.In Sect.2, we give a short description of the FluidFlower benchmark setup.In Sect.3, history matching is performed using a simple zonated model using buoyancy-free Poisson-like physics.To further improve the history matching results and account for permeability uncertainty and buoyancy effects, enhanced history matching using a two-stage approach with richer physics is described in Sect. 4. Section 5 discusses the results of this enhanced method, including uncertainty estimates and forecasting behaviour.Discussion and conclusions then follow in the usual way.

FluidFlower Benchmark Description
The FluidFlower experimental rig (Nordbotten et al. 2022) comprises an engineered heterogeneous sand pack assembled within a thin (25 mm) vertical "filled" Hele-Shaw cell, about 2.8 m wide by 1.3 m high.The layered heterogeneous structure is filled using sands filtered into different grain sizes: the sand is placed in a layered fashion between the front glass panel and a sealed back panel with perforations for ports that can be used as injectors, producers, or pressure gauge locations.A sketch of the experimental rig with colored layers is shown in Fig. 1.
Experimentally, the sand facies have been sieved into groups labelled ESF, C, D, E, F, G, and there are 3 "fault" regions manufactured in the model.
Permeability and porosity measurements for the different types of sand have been performed in independent sand-pack experiments (Nordbotten et al. 2022).We expect that the in-situ values of sand layer parameters may vary due to non-uniform sand distribution and boundary effects.The tracer tests have generated tracer data in the form of timestamped camera images.These images are digitized and mapped on to grid representations Fig. 1 The sketch of the experimental rig geometry following FluidFlower benchmark description as needed.Pressure data are measured at gauges attached to ports in the rig, but the gauges are some distance from the ports, on narrow feed plumbing.
Since the principle objective of the FluidFlower comparison project is to predict the flow of CO 2 in a controlled study, and because this flow will depend strongly on the flow properties of the media to leading order, it is sensible to use all experimental information to estimate permeability and porosity.Apart from the sand-pack experiments, the richest source of such information is the history matching of the tracer experiments.Section 3 describes a simple and fast way to invert the tracer data for leading-order estimates, and this is augmented to richer models and full-physics simulation in Sect. 4.

History Matching Based on Zonated Model and Simple Poisson-Like Solver
In this section, we describe a fast, parsimonious, but approximate baseline history match using a simple single-phase flow model.The closely similar densities of the tracers, together with the extremely rapid pressure communication in the system, are exploited to write down a fast approximate forward model which can be exploited for inversion.

Governing Forward Model
For the tracer tests, flow is presumed to be single phase, isotropic, the density-independent of tracer, and governed by the averaged thickness (2D model) diffusion equation where p is the pressure, h the cell thickness, the porosity, k w the endpoint permeability of the media for water, w the water viscosity, c t the total (rock plus water) compressibility, and Q the volume rate of water injection at the ports as a function of time.The porosity and thickness h are taken as "known"-from sample measurement and thickness measurement bilinear interpolations.Given the experimental care in the sieving process and sandpack porosity measurements, and the relative insensitivity of unconsolidated porosity values, it was deemed unnecessary to model porosity variation with extra parameters.The sample measurements for (endpoint) permeability, for the labelled facies, are notated k w,l in the below, for facies l.
The digitized and rasterized model of the facies is shown in Fig. 2, with labels 1, … 9 for the facies {1=ESF, 2=C, 3=D, 4=E, 5=F, 6=G, 7=Fault1, 8=Fault2, 9=Fault3}.Fault 2 (facies 8) is designed to be impermeable, and no inferences for its properties are performed in the below.The digitization process uses a linear mapping between intensity and tracer concentration, accounting for labelled porosity and local thickness, and calibrated using known injection volumes and mass conservation.The computed values map closely to the interval [0, 1], but are clipped before use in inversion.
From the digitized and interpreted images of facies labels F i ∈ L = {1, 2, ..7, 9} , in voxel i, for facies {ESF, C, D, E, F, G, Fault1, Fault3}, the absolute permeabilities are modelled by k w,i = u F i k F i in gridblock i, where the model is u = {u l } , l ∈ L .The supplied "core" sample absolute permeabilities k l are taken as per Table 1.
The model parameters u l , l ∈ L are expected to be O(1) correction factors, and the res- caling is designed to make the inversion as well-scaled as practically reasonable.The , for the experiment box size L, is very short, typically t ≈ 5 ms using scaling = .45, w = 10 −3 Pa.s, c t ≈ 0.7310 −9 Pa −1 , L = 1 m , k sc = 6.9874 × 10 −10 m 2 .All pressures are rescaled to dimensionless pressure using a scaling P sc = Q sc w ∕(k sc h 0 ) , with additional chosen constants h 0 = 0.025 m (typical cell width), Q sc = 6.25 × 10 −7 m 3 ∕ s (the water injection rate) yielding scaling pressure P sc = 35.778Pa.
For this reason, the forward modelling has approximated the response as steady state on the time scale of the experiment, with the pressure corresponding to the steady-state solution of Eq. (1) at each instant of time.During the experiment, the rates Q are sustained as constant over 30-min interval chunks, so the overwhelming majority of the data are collected with the system equilibrated in terms of pressure.In the experimental data, we see the measured pressures respond virtually instantly to the applied Q when it changes.The implication of the assumption of constant density is that the fluid configuration remains fixed as long as there is no water injection.The triple-injection tracer experiment was split into 3 successive injections/tranches, each of three 30 min constant-rate injections with rest intervals: The upshot of this is that for the first 6 h of the tracer experiment, about 3 h are injecting, and 3 h waiting.After Day 1, there is a ≈ 16 h wait till the third tranche of tests on Day 2, during which buoyancy rise in the tracer becomes evident in the comparison of the images at the end of tranche 2 and the start of tranche 3. Since the gravity effects are slow, we expect this effect in the first two tranches to be relatively weak, but the effect is clear in the 3rd tranche images, especially around the 9_3 port injection.The Poisson-like model neglects this buoyancy effect, but for small buoyant drift we expect the residual error at the tracer front to end up symmetrically distributed around the prediction of the constant-density model, so we believe the approximation is reasonable, especially for profiles formed around newly-injected ports.Roughly speaking, we expect the approximation to increase the variance of the predictive model to the leading order, but not the bias.Similarly, dispersion is treated as a minor effect, and probably confounded with 3D effects and heterogeneity.During injection, the Peclet number is estimated from the tracer velocity and grain size to be Pe ≈ 50 , so advection should be more dominant, but dispersion will smear the front and the sharp modelled tracer front from the Poisson model should fit the mildly dispersed experimental front at a middle value corresponding to an average velocity.
Thus, on the basis that time-stepping the diffusion equation was unnecessary, we solve the steady-state Poisson equation with Q at two different places (ports 9_3, 17_7), corresponding to the experiment performed.The resulting fixed pressure fields and velocities are then steady over the period where the injection rate is held steady, and are sufficient to compute the advection of the tracers over this interval and the observed pressure fields.The model was based on a 5 mm grid ( N x = 568 , N y = 300 ), with upper boundary condi- tion set as fixed (atmospheric) pressure in the water column, and no-flow Neumann conditions on the left, right, and lower edges of the model.A typical pressure solution for injection at port 17_7 is shown in Fig. 3.
To model the tracer movement, tracers were advected along streamlines using the fixed velocities computed from the Poisson solves, continuing for the correct time corresponding to each tracer color.The tracer advection was implemented using the upwinding scheme described in Vreugdenhil and Koren (1993).The time-stepping in the tracer computation was adjusted so the number of time steps divided the total tracer time exactly, which has the merit that any computed quantities from the tracer image are a smooth differentiable function of the parameters in the PDE.

History Matching
For inversion, the data available consisted of the injection port and monitor port pressures during the injections, and colour tracer images at the end of the injection periods.The cross-port pressures were conspicuously noisy and clearly close to the noise floor of the instruments.Significant drift was evident in these measurements, and even the stable values showed the curious property of the amplitude not diminishing in a (Port17_7) 5.19 − 5.49 pm(clear) 6.20 − 6.50 pm(clear) 7.20 − 7.50 pm(clear) (Day1)  (Port9_3) 8.52 − 9.22 pm(clear) 9.52 − 10.22 pm(clear) 10.52 − 11.22 pm(clear) (Day1)  (Port9_3) 2.48 − 3.18 pm(blue) 3.48 − 4.18 pm(blue) 4.48 − 5.18 pm(blue) (Day2) consistent way with the distance from the injection port.This has implications in the inversion if this data is weighted very heavily.The injection pressures were strong signals but measured some way away (20 cm or more) from the actual injection face and subject to unknown frictional and other losses in the feed plumbing.This makes them not very useful for inversion.By contrast, the tracer images were very clearly interpretable, rich in spatial content, and not obviously contaminated by an experimental artifact of any significant kind.
The cell is initially filled with a blue tracer.Injections were modelled at Q = 2250 ml/h ( 6.25 × 10 −7 m 3 ∕ s) for 3 ×30 min at port 17_7 (with clear water), then 3 × 30 min at port 9_3 (more clear water), then 3 × 30 min at 9_3 with blue tracer.The forward modelling operation, which implements standard tracer advection under an upwinding scheme, is expressed below as a function f t (u) which generates concentration profiles, which are very close to unity inside the swept region, and fall rapidly to zero at the tracer front.The tracer advection is stepped forward for precisely the number of time steps needed for the injection, and numerical integrations of the total tracer mass over the modelling grid at the end of the simulation agree very closely with the mass known to be injected from Q in the tracer source.Under the assumptions of the single-phase PDE and the fast equilibration time, the experimental 30 min wait time between injections does not need to be modelled, as nothing happens in the Poisson model if the sources are switched off since the velocities are then instantly zero and no advection occurs.The modelled tracer positions at the end of each 30 min injection period are compared to digital image experimental data for inversion.
The inversion of this data was couched as a Bayesian inverse problem with a likelihood P(d obs |u) formed as a joint probability using pressure and tracer data d obs = {P obs , c tracer } .The model was taken to be multiplier modifiers of the permeabil- ity parameters, per facies, and applied in a "paint-by-numbers" fashion over the labelled facies model.The thickness and porosity data were considered to be sufficiently precise and experimentally stable to be fixed for the purposes of model prediction and inversion.The Bayesian framework was completed with the provision of a weak prior for the modifier parameters, of Gaussian form P(u l ) ∼ N(1, 2 ) with = 5 for each parameter.The associated prior covariance is C p = diag { 2 } .The model point estimate at the global maximum of the posterior probability is referred to as the MAP (maximum a posteriori) inversion.
The inversion is performed using a Levenberg-Marquardt routine (Nocedal and Wright 2006;Madsen et al. 2004), which requires the Jacobian J = { f∕ u} of the for- ward response with respect to the unknown model parameters.Since the model dimensionality was very low and the forward model speed very fast (measured in seconds), this was computed using simple forward differences.
The negative log posterior E(u) ∼ − log(P(d obs |u)P(u)) used in the optimization step was written as a standard l 2 misfit energy where the cross-port pressure misfit, accumulated over only stable average measurements at ports p5.3_1, p5.7_1, p9.3_1, p15.5_1, p17.7_1, p17.11_1 is and the trace image mismatch is written as The weights p , t are adjusted so the tracer data is dominant in the likelihood as this data is much more abundant and artifact-free.The prior Bayesian term amounts to and has a very benign influence on the inversion, except that the likelihood term from the tracer image is expected to be sensitive to permeability ratios only, i.e. has a null space associated with a global scalar multiplier.If no pressure data are used, the weak prior will have the effect of producing a MAP point as the nearest point in the likelihood null space to the prior mean point u = 1 , i.e. the core measurements.In dimensionless units, the pres- sure data P obs are O(1) numbers, but rather noisy, so setting p = 1 seems appropriate.The tracer data c are processed from the digital images to have concentration values ranging over 0 < y < 1 .Since the associated l 2 norm has a very large number of voxels, t is scaled such that the tracer misfit energy is E tracer (u) = 1000∕2 for a model that produces no tracer concentration ( f t (u) = 0 ), i.e. the information content is equivalent to 1000 measurements.In practice since the volume in which experimental and forward-modelled concentrations differ is only a small fraction of the image, the misfit energy from this term ends up being O(10), perhaps equivalent to putting 10-fold the emphasis on the tracer images as the crosswell pressure data.
One possible approach was to assert that the inconsistencies in the crosswell pressure data were too problematic to warrant their inclusion, and that the inversion should be performed on the basis of the 3-injection tracer data alone.It was also considered reasonable to merge the parameters for regions 5 and 6, since region 6 is at the edge of the modelling region and will have a more fragile permeability inference.
The corresponding parameter inferences are as per each parameter formed from the inverse Hessian matrix at the final optimum, where The final inversion forward model images and associated data snapshots are depicted in Fig. 4.
A more optimistic approach was to include the crosswell data, but down-weighted in the sense previously described, and allow independence in zone 5 and 6 scaled permeabilities.The result of this inversion is shown in Table 2 (Model B).One sees that the parameters differ  1 3 from the previous inversion but within the estimated standard deviation associated with the estimated "statistical power" of the data embedded in the likelihood weightings.Since permeability is usually a sensitive parameter, the differences between these inverted values and the core estimates could be due to many factors, including significant experimental variability in the core experiments, modelling error, inversion sensitivity, or other reasons.

History Matching of FluidFlower Benchmark Using Two-Stage Approach
The preceding model fits show clear systematic structure in the tracer misfit residuals due to the simplistic physics, and neglect of zone-internal heterogeneity.This can only be addressed by escalating the forward physics to a model that includes buoyancy, but also with adjoint capabilities to manage the increased dimensionality that will inevitably follow upon an increased spatial resolution in the model.The capabilities of the open-DARTS code meet these needs very well, but it is helpful to rehearse the basic conservation and flow laws that are implemented in this code.

Governing Equations for Forward Simulation
The mass conservation equations describe a flow dynamic system bounded in the domain with volume Ω and surface Γ .The conservation equation can be written as: where M c is the accumulation term for the c th component ( c = 1, … , n c , index of the mass components [e.g., water, CO 2 ]), F c is the flux term of the c th component, n is the unit nor- mal direction pointing outward to the domain boundary, and Q c is the source/sink term of the c th component.
For the accumulation term M c for a given component c, it can be written as: where is porosity, s j is the saturation of phase j, j is the density [kmol∕m 3 ] of phase j, and x cj is the molar fraction of component c in the jth phase.
The mass flux F c for a given component c is written as: Here v is the velocity and follows Darcy's law including gravity effects for the given phase j: (2) where K is the permeability tensor [mD] , k rj is the relative permeability of phase j, j is the viscosity of phase j [mPa ⋅ s] , p j is the pressure of phase j [bar], j =  j ḡ is the specific weight [N∕m 3 ] (where ḡ is the gravity acceleration), z is the depth vector [m].The source/ sink term Q c mainly includes terms related to chemical reactions, which are absent in this study.
Based on the finite volume method using two-point flux approximation, the discretized form of Eq. ( 2) for the ith reservoir gridblock can be written as: Equation ( 6) shows the residual form of discretized Eq. (2).We introduce V i is the volume of the ith gridblock and i is the state variables at the current time step.In addition, i(k−1) is the state variables at the previous time step and Δt is the time step.The term a l is the contact area of the interface l between neighbouring elements.
The resulting highly nonlinear system of equations is linearized using the Operator-Based Linearization approach (Voskov 2017;Khait and Voskov 2018a;Lyu et al. 2021a), and a modified Newton-Raphson method is employed at each nonlinear iteration.For modelling of tracer experiments, we used the same CO 2 -brine physics used in the benchmark with the concentration of CO 2 in the injected stream below the solubility limit and a linear dependency of density on CO 2 concentration.

Initial Model Tuning
Based on interpreted high-resolution images of the rig, we built an unstructured mesh grid model, shown in Fig. 5.
Following the experimental description of unconsolidated layering in the rig, the mesh grid model is divided into multiple zones/layers filled with nine different types of sand in total.Different layers are assigned to different permeability and porosity values.The permeability anisotropy is modelled using a vertical/horizontal anisotropy ratio, per zone.This anisotropy ratio models the laminations clearly visible in high-resolution images, mostly related to sorting of fine-grain sands.Within the same layer, the same type of sand is maintained, and the petrophysical values of the grid cells are kept constant.( 6) Fig. 5 The unstructured mesh grid of FluidFlower model The petrophysical properties of the layers in Fig. 1 including permeability are initially estimated from the sand-pack experiments.These measurements will differ from the in-situ properties due to experimental variations in sand deposition/assembly, loading, and boundary effects.For the fitting procedure, we keep the porosity unchanged, while the permeability and the anisotropy of the layers are taken as the free parameters.A significant decision was taken to also introduce the density of the clear water as an additional parameter.The optimization was again based on Gauss-Newton methods in low dimensions from finite differences of efficient forward simulations.Specifically, the objective function is defined as: where E is the objective function, u are the model variables, G is the model response, and d obs is the observation data.
Instead of using the original tracer concentration of the experimental images as the observations, we process the experimental tracer images into several binary images, corresponding to the end of each 3 × 30 minute tracer injection group; see Fig. 6.The first row of this figure demonstrates the experimental images of the tracer plumes at the 3rd, 6th, and 9th "macro" time steps.The second row shows the associated digitized binary map of each time step.The value of the red colour is set as 1, while the blue colour is 0. A threshold of t = 7 × 10 −5 tracer concentration (mole fraction and unitless) is used to binarise the response G(u) to either 0 or 1.Although the objective function defined by the binarised model response and observations is mathematically non-differentiable, empirically, the numerical implementation is robust to this discontinuity.These binary maps delineate the boundaries of the tracer plumes and will be utilized as the observations in the initial stage of model tuning.This simple image processing technique will be later compared with the more accurate image recognition shown in Fig. 4.
The initial model parameters are obtained from Table 2 (Model A, except anisotropy).They are the permeability of layers C, D, E, F, Fault1, Fault3, G, and the anisotropy factor of the layer ESF in the z-direction.These eight parameters are updated in the history matching iterations.The optimal values of these eight parameters are shown in Table 3.Note that Table 3 also includes placeholder information for the water layer W. ( 7) Fig. 6 The experimental images of tracer plumes and the associated digitized binary maps Figure 7 shows the corresponding forward model of the last time step (i.e. the 9th time step in Fig. 6) based on the tuned petrophysical properties of Table 3.In Fig. 7, the left figure is the tracer concentration map of the model response.The middle figure is the binary plot based on a given threshold of the tracer concentration (i.e.red colour if concentration >  t ; otherwise, blue).The right figure is the experimental result of the tracer test at the last time step.It is evident that the position of the simulated tracer plume (the left and middle figure) is systematically shifted from the experimental results.Assuming the forward physics is adequate, and that the optimization has found a global minimum, this indicates that the spatial model needs to be enriched in order to better match the experimental results.In the next section, we enlarge the model with additional parameters to address the underfitting problem.Spatial variation of the petrophysical properties will also be introduced as an attempt to model the variation caused by the manual sand deposition process in the experimental rig.Fig. 7 The tracer test results at the last time step

The Generation of the Prior Ensemble
In the previous sections, nine types of sand are "painted" into the model to represent the different layers and faults.This forms homogeneous petrophysical properties within a single layer.However, since the sand is manually filled into the rig, it is difficult to maintain homogeneous properties within the same layer.Moreover, compaction is observed in the course of pre-injection flushing, environmental temperature fluctuations, water injection, etc, which is certain to be non-uniform, and the tracer profiles exhibit visual features probably caused by internal heterogeneity.We concluded that it is necessary to include spatial heterogeneity into the model.
To this end, a high dimensional permeability field K is introduced to the model, equipped with a prior distribution log(K) ∼ N(log(K t ), C M ) , where K t is the spatially mapped tuned permeability from Table 3, and C M = diag { 2 } , with = 0.02 .Samples may be drawn from this model in many standard ways, such as sequential Gaussian simulation, as depicted in Fig. 8.

History Matching Framework Using Randomized Maximum Likelihood (RML)
The enriched model is now equipped with some regularization apparatus, which behaves in a very similar way to explicit Bayesian prior declarations.Further, the high dimensionality of this model ( n = 72262 parameters) means that dense Jacobian methods are no longer possi- ble, and efficient gradient methods must be sought.The new objective function reads where R is the regularization term.Posterior samples of this model, associated with the Gibbs distribution P(u|d obs ) ∼ exp(−E(u)∕2) , are generated using the Randomized Maxi- mum Likelihood (RML) technique, briefly summarised in the Appendix.
In the gradient-based optimization method, we take all n transmissibilities of the grid cell interfaces as the model variables to do the history matching.In this study, we use the adjoint method in the DARTS framework to evaluate the gradient.The idea of the adjoint method is to introduce a Lagrange multiplier T to combine the objective function E with the governing equation g of the forward simulation.The new augmented objective function Ē can be written as: where is the state variable.The term T is the transposed form of the Lagrange multiplier.Following Tian et al. (2021) and Tian and Voskov (2022), the discretized adjoint method can then be written as: where K is the total number of the simulation time steps, j k is the misfit between the model response and the observation data at the given simulation time step k.Therefore E( , u) can also be defined as: The Lagrange multiplier T can be solved from Eqs. ( 11) and ( 10) recursively in a backward manner.Finally, the gradient  Ē u can be obtained from Eq. ( 12).

History Matching Formulation and Results
The history matching approach described below commences with binarized tracer images.Particular reformulations of the likelihood function proved to be efficacious in handling the data in this form, involving the hinge loss.We sketch this below, before surveying the inversion results generated using this measure.

Hinge Loss Function
As per the approach in Sect.4.2, instead of using the original tracer plumes of the images as the observations, we digitized the original images into binary maps, e.g. the second row of Fig. 6.Based on the information of the binary maps, we replaced the misfit term of Eq. ( 8) (i.e. the first term at the right-hand side in this equation) with the hinge loss function.The updated objective function reads: where H(u) is the hinge loss function.If the given cell is located at the red region in Fig. 6, the hinge loss function is defined as: (10) Similarly, when the given cell is located at the blue region in Fig. 6, the hinge loss function is defined as:

Regularization and Randomized Maximum Likelihood
The regularization term R(u) in Eq. ( 8) is defined as: where R is the scaling coefficient, and u ref is the reference of the model variables (i.e.transmissibility at the grid cell interfaces).The covariance matrix C M is modelled here as a diagonal, since the experimental rig is artificially filled, with few of the usual spatial correlations associated with deposition or consolidation.
In the Randomized Maximum Likelihood (RML) sampling method, u ref is a realization from the prior.The details of the generation of these priors are described in Sect.4.3.Using the digitized binary observations in the RML method, we implemented multiple history matchings using sampled reference models u ref .Figure 9 demonstrates two samples of the history matching results of the tracer concentration at the last time step.These two samples correspond to the realizations of P10 and P90 of the accumulated mobile CO 2 in Box B, which will be demonstrated in Fig. 11.Their associated changes of the permeability after history matching are shown in Fig. 10.
Compared with the tracer concentration result in Fig. 7, the results in Fig. 9 match the experimental plume observations more accurately, especially for the top right plume.The total misfit error of the inferred model ensemble is reduced by around 39%. Figure 10 illustrates two RML samples of the permeability difference between the model before and after history matching.These two samples correspond to the realizations from Fig. 9.It is clear from Fig. 10 that the permeabilities in the tracer plume zone and the high permeable zone have more permeability adjustment compared with the rest of the region; large parts of the permeability of the sands are barely adjusted in Fig. 10 because the tracer does not reach them.Specifically, mathematically speaking, regions where the tracer does not reach exhibit zero gradient with respect to that particular zone, implying that no useful information can be gathered from those regions in the course of calibration.Thus, this calibration result is likely not transferable to another injection scenario.In this sense, we believe conducting more tracer tests at various injection locations would help capture sensitivity across the entire domain.However, in reality, multiple field experiments or geophysical tests can be very expensive or sometimes infeasible, making it difficult to cover the entire domain.Therefore, the primary concern should be whether the calibrated model can predict results to an acceptable degree.While a more detailed characterization would result in more accurate predictions, we must strike a balance between the accuracy of our predictions and the cost of calibration.In this FluidFlower project, the aim was to simulate a practical CO 2 storage project, in which a tracer test has been conducted, albeit with limited tracer test scenario and data collection.Our subsequent analysis suggests that our predictive model is highly effective in forecasting the CO 2 plume, as illustrated in our co-authored publication "The FluidFlower International Benchmark Study: Process, Modelling Results, and Comparison to Experimental Data." ( 16) Utilizing the inferred model ensemble, we performed numerical simulations using 100 realizations of CO 2 injection, all with identical CO 2 injection rates and thermody- namic conditions.The ensemble outcomes of the CO 2 plumes at the last time step (after 120 h of CO 2 injection), as illustrated in Fig. 11, illustrates the variability associated    uniform thickness for the experimental rig, potentially introducing inaccuracies in estimating the model's total volume.

Regularization and RML Based on the Concentration Interpreted from Images
An alternative strategy of history matching was also conducted based on the concentration data interpreted from experimental images, as per the Tracer 1, 2, and 3 data in Fig. 4. In this case, we use the original objective function Eq. ( 8) to infer the model, instead of applying the hinge loss to the binary observations in Eq. ( 14).Based on the same realization ensemble u ref and RML method, the history matching results of the tracer concentration and the changes of permeability are shown in Figs. 13 and 14.
The error of the inferred model ensemble is reduced by around 16%.The misfits associated with the parsimonious layer models are not discussed here.Figure 13 has more similar plume shapes to the experimental images compared with the results in Fig. 9. Figure 14 shows more intensive adjustments of permeability by the optimizer in comparison to Fig. 10.This disparity can be attributed to the digitised concentration observations derived from the image, which provide additional "concentration gradient" We also conducted the simulations of CO 2 injection based on the ensemble of 100 realizations of inferred models.The obtained results of the 10th and 90th percentiles of the entire ensemble are presented in Fig. 15.The CO 2 mass within Box A and B, as observed across the 100 realizations, is shown in Fig. 16.The curves in Figs.16 and  12 show that the CO 2 mass within Box A remains relatively insensitive to variations in model calibration.This is primarily due to its substantial coverage of the CO 2 injection formation, where CO 2 concentrations significantly outweigh those in regions exhibiting fingering patterns characterized by lower CO 2 concentration.In essence, the concen- tration within the injected formation exerts a dominant influence on the mass of CO 2 , rendering the mass of CO 2 within Box A rather indifferent to adjustments in the model calibration.While there is an improvement in the history matching results when compared to those of Fig. 9, the prediction outcomes of CO 2 injection do not exhibit signifi- cant enhancements.This observation suggests that the model predictions are not excessively sensitive to the selection of observations in the inversion.The ensemble results appear to be a reasonable quantification of the uncertainty associated with the fingering phenomenon.

Discussion
In this study, we have performed history matching of the FluidFlower benchmark using a two-stage approach.In the first stage of history matching, a zonated model and a simple Poisson-like solver were utilized to infer the petrophysical parameters for each layer.The inferred model parameters are used in the second stage of history matching, which achieved notable improvements through the use of multiple high-dimensional spatial model realizations and adjoint methods.The misfits for both binary and concentration observations were reduced by 39% and 16%, respectively.
Naturally, the computational time for the simple-physics zonated model is much lower compared to the high-dimensional spatial model.Running a single-zonated model only takes approximately 5 seconds due to the assumption of steady-state pressure, and the Marquardt method converges in O(10) iterations; for comparison, the high-dimensional spatial model forward model takes around 70 seconds on CPU, and the optimization typically around 100 times of iterations.
However, the simple zonated steady-state model does not account for the geological uncertainty within the layers associated with the dynamic changes in tracer behavior.The high-dimensional spatial model is better suited to quantifying this uncertainty, and more realistically models the complexity of real geological reservoirs and fluid dynamics.In the simulations of CO 2 injection using the inferred ensemble of high-dimensional spatial models, variability in the fingering phenomenon emerges from permeability uncertainty in the posterior.The percentile plot of CO 2 plumes of the entire ensemble show that the majority of uncertainty is concentrated in regions where fingering occurs.
In comparing the CO 2 injection simulations based on the model ensemble calibrated to binary and concentration observations, we observed an improvement in the history matching results when using the concentration observation.Somewhat paradoxically, the flow prediction outcomes of CO 2 injection did not exhibit significant enhancements.These findings indicate that the model predictions are not excessively sensitive to the selection of observations during the model inference process, at least within the context of the FluidFlower project.The FluidFlower benchmark employs a relatively modest and small-scale experimental configuration.Nevertheless, in the context of real field-scale CO 2 storage projects, it is important to assess the transferability of the findings from this FluidFlower study.The geological and physical conditions in field-scale projects may differ substantially from the experimental setup used in our study.A thorough program of numerical simulations, as detailed in the simulation study for the FluidFlower benchmark (Wapperom et al. 2023), will highlight the importance of physical conditions when extending the FluidFlower methodology to actual CO 2 storage projects.A crucial issue is the substantial difference in physical properties such as density and viscosity between surface and typical reservoir conditions.Surface conditions introduce very strong buoyancy and marked nonlinearity into the system, thus posing computational challenges for both linear and nonlinear solvers.One symptom of this is the inability of the CPR preconditioner to operate effectively for the FluidFlower problem.In more representative reservoir conditions, smaller variations in physical properties result in comparatively fewer computational difficulties.
The methodology of this paper may be applied for practical scenarios resembling the shallow FluidFlower experimental conditions.The use of an unstructured mesh for discretization provides high flexibility in representing diverse features encountered in practical field-scale settings.Further, it is realistic to expect that real geological scenarios introduce increased uncertainty, such as channels, faults, and other complexities.A comprehensive approach to uncertainty quantification via simulation, based on a sizable ensemble of both geological scenarios and realizations, is a necessity.For a wider range of models and scenarios, the history matching procedure will naturally be more computationally intensive.These challenges underscore the importance of an efficient approach to history matching in real field-scale applications.To mitigate the computational burden in cases where available computational resources are more limited, one may consider implementing proxy methods or dimensionality reduction approaches to optimize computational efficiency.

Conclusion
A comprehensive history matching for the FluidFlower experimental results has been described in this study.Several stages of model tuning and inference have been implemented to improve the history matching results and estimate a spatial uncertainty range in permeability from the FluidFlower tracer tests.The pressure data were of limited use in this experiment due to noise and unknown frictional losses, but since the salient issues were unrepresentative of field measurements, we do not believe this reduces the importance of pressure data in the context of field-scale CO 2 injection.For example, in the case of the Ketzin pilot site's onshore CO 2 storage project, pressure and temperature are con- tinuously monitored by sensors over the entire project lifetime (Martens et al. 2013).Furthermore, the mining authority has mandated an operating pressure threshold for this pilot project.Their monitoring data reveals a positive correlation between the injection rate and reservoir pressure.
Compared with the pressure data collected in this experiment, the tracer images were of high quality and a rigorous test of model choices and fitting.The history matching was first conducted utilizing a fast single-phase physics model without buoyancy effects, using a simple Poisson-like solver.A structured grid and digitized concentration data were utilized in this stage to infer the permeability of six facies types and two faults.The numerical framework of this approach is a simple finite difference code with upwind tracer advection, using Marquardt methods for gradient-based optimisation.
To improve the model fit and address the issues around buoyancy and heterogeneity, enhancements in two stages were added, first for physical modelling, and later, significant model inflation to capture spatial heterogeneity.For the physics enhancements, the forward physics was switched to an unstructured grid with a facility for modelling buoyancy.Eight petrophysical parameters (including permeability and anisotropy) for model variables and simplified binary tracer images were used.Inversions were performed using the Gauss-Newton method with explicit Jacobians obtained by finite differences.This lowdimensional inversion was then taken as the reference model of the next stage of history matching using the adjoint-based gradient method as the workhorse in a Randomized Maximum Likelihood (RML) approach, with a richer and high-dimensional spatial model to account for heterogeneity.The coarse-scale morphology of the tracer images was remarkably well predicted by the low dimensional models, but fine detail could only be well reproduced by significant inflation of the spatial model space.
The high-dimensional spatial model is naturally better suited to representing typical geological heterogeneities, and has enough freedom to fit the finer spatial detail in the tracer data.We found that forward predictions of CO 2 distribution and fingering patterns are sensitive to this heterogeneity, but the uncertainties of these forecasts were not sensitive to whether the tracer data were transformed in the history matching process.The forward prediction of CO 2 mass within a specified area (e.g., Box A and B) demonstrates differing outcomes when accounting for the influence of buoyancy, despite the CO 2 plume contour exhibiting minimal sensitivity to the buoyancy effect.
We have demonstrated the cost and value of an escalating range of history matching models on the FluidFlower CO 2 storage benchmark model.The simple parsimonious models are fast and efficient but do not fit all the details of the experimental data due to simplifications in the physics and lack of resolution in the spatial models.The heterogeneous models are a successful demonstration of an adjoint method derived from operator-based linearization of multiphase physics in a very high dimensional model, and persuasively capture effects related to buoyancy and heterogeneity.In the predictive regime, the RML ensemble method produces a very useful uncertainty forecast of CO 2 injection behaviour.This is particularly evident in the fingering regions, which are strongly influenced by permeability fluctuations.
where C −1 D is the inverse of the covariance matrix of the combined measurement and forward modelling error process, ∼ N(0, C D ) is a "bootstrap" sample from that noise distri- bution, and u ref ∼ N( ū, C M ) is a sample from the geostatistical prior.Note that in this study, the experimental measurement error data is not available, so the approach has been simplified by setting = 0 ; most of the model uncertainty is controlled by the prior rather than experimental noise in this sort of problem.Finally, we get the modified objective function shown in Eq. ( 8) (omitting the coefficient 1 2 ).A single sample of the posterior PDF can be obtained by solving Eq. (A1).

Fig. 8
Fig. 8 A sample from the prior distribution for K with spatial lognormal variation

Fig. 9
Fig. 9 Figure a and b represent two samples of the history matching results based on the binary observations.Figure c and d are their associated threshold data.Figure e is the binary map of experimental results.These figures demonstrate tracer test results at the last time step

Fig. 10
Fig. 10 Two samples of the changes of the permeability distribution after the history matching based on the binary observations Fig. 11 The forward predictions of CO 2 plumes at the last time step (after 120 h of CO 2 injection).Figure a and b show the CO 2 plumes of the 10th and 90th percentiles of the entire ensemble of 100 inferred models calibrated to binary observations, respectively.Figure c shows the forward predictions of plumes based on the inferred zonated model.The solid and dashed boxes represent Box A and B, respectively

Fig. 12
Fig. 12 The mobile gaseous CO 2 of Box A (left figure) and Box B (right figure) in Box B based on the inferred models calibrated to binary observations.The blue and grey curves represent the results based on the inferred zonated model and the inferred model ensemble, respectively

Fig. 13
Fig. 13 Figure a and b represent two samples of the history matching results based on the tracer concentration observations.Figure c and d are their associated threshold data.Figure e is the binary map of the experimental results.These figures depict tracer test results at the last time step

Fig. 14
Fig. 14 Two samples of the changes of the permeability distribution after the history matching based on the tracer concentration observations

Fig. 16
Fig. 16 The mobile gaseous CO 2 of Box A (left figure) and Box B (right figure) based on the inferred models calibrated to concentration observations.The blue and grey curves represent the results based on the zonated model and the inferred model ensemble, respectively

Table 1
Nordbotten et al. (2022)and fixed porosities for modellingThe experimental values are taken from sand-pack experiments documented inNordbotten et al. (2022)

Table 2 (
Model A).The table shows the dimensionless scaled model inference and corresponding actual unitized values.The final column is the dimensionless uncertainty estimate 2 l

Table 2
Inversion results from 3-tracer inversionModel A: regions 5 and 6 merged to a common parameter, no crosswell data.Model B: model with facies 5,6 independent, including crosswell data