A multiscale method for data assimilation

In data assimilation problems, various types of data are naturally linked to different spatial resolutions (e.g., seismic and electromagnetic data), and these scales are usually not coincident to the subsurface simulation model scale. Alternatives like upscaling/downscaling of the data and/or the simulation model can be used, but with potential loss of important information. Such alternatives introduce additional uncertainties which are not in the nature of the problem description, but the result of the post processing of the data or the geo-model. To address this issue, a novel multiscale (MS) data assimilation method is introduced. The overall idea of the method is to keep uncertain parameters and observed data at their original representation scale, avoiding upscaling/downscaling of any quantity. The method relies on a recently developed mathematical framework to compute adjoint gradients via a MS strategy in an algebraic framework. The fine-scale uncertain parameters are directly updated and the MS grid is constructed in a resolution that meets the observed data resolution. This formulation therefore enables a consistent assimilation of data represented at a coarser scale than the simulation model. The misfit objective function is constructed to keep the MS nature of the problem. The regularization term is represented at the simulation model (fine) scale, whereas the data misfit term is represented at the observed data (coarse) scale. The computational aspects of the method are investigated in a simple synthetic model, including an elaborate uncertainty quantification step, and compared to upscaling/downscaling strategies. The experiment shows that the MS strategy provides several potential advantages compared to more traditional scale conciliation strategies: (1) expensive operations are only performed at the coarse scale; (2) the matched uncertain parameter distribution is closer to the “truth”; (3) faster convergence behavior occurs due to faster gradient computation; and (4) better uncertainty quantification results are obtained. The proof-of-concept example considered in this paper sheds new lights on how one can reduce uncertainty within fine-scale geo-model parameters with coarse-scale data, without the necessity of upscaling/downscaling the data nor the geo-model. The developments demonstrate how to consistently formulate such a gradient-based MS data assimilation strategy in an algebraic framework which allows for implementation in available computational platforms.

not, the observed data and the forward model are described at different spatial scales. In fact, it is an open question at which scale data should best be assimilated: the simulation model scale or the observed data scale [10].
The relevancy of addressing the multiscale nature of inverse problems is observed in the recent literature on the topic. A collection of articles about multiscale forward modeling strategies and multiscale challenges associated to inverse problems can be found in [25]. In [36], the authors propose a multiscale data assimilation scheme based on the decomposition of the objective function so that the error covariance can be estimated for distinct spatial scales. A multiscale parameter field discretization designed to reduce the dimensionality of the inverse problem via an adaptive grid refinement is presented in [55]. The impact of the scale dissimilarity in terms of observation information content and the parameter space size on the ensemble collapse in ensemble-based methods [1] has been addressed by [16,17,40] via upscaling/homogenization techniques [7,15]. The authors also benefit from coarse-scale simulations to improve the inverse problem computational efficiency. In [18], a multiscale method is proposed which accounts for microscale features by assuming that they can be represented by a low-dimensional parameterization. Nonetheless, the aforementioned works are based on the assumption that the fine-scale uncertain parameters can be homogenized and represented at a coarser scale.
On the other hand, data assimilation strategies based on multiscale (MS) simulation [28,32] have also been developed. MS methods are efficient simulation strategies capable of solving the flow problem at a coarser grid, while being capable of accurately representing fine-scale heterogeneities. An adjoint-based multiscale finite volume method for computation of sensitivities has been presented in [20] and later extended to time-dependent [19] singlephase flow in porous media. More recently, a general framework for the computation of multiscale gradients has been introduced in [42], with an extension to multiphase flows [41]. The latter two are based on a general framework for derivative computation, whose algebraic nature does not rely on any assumption regarding the nature of the parameters, observations, or objective function type. Also, in [3], a multiscale inversion technique is presented based on the Markov-chain Monte Carlo method that also relies on the generalized multiscale finite element method [8].
Despite this body of work found in the inverse modeling literature, when one is interested in assimilating spatially distributed data, there is an implicit assumption that the observed data is described at the same scale of the parameters is usually made. Actually, assuming one is not interested in changing the scale of the model parameter description, some treatment must be employed in the change of the observed data or the forward model response scale. The literature indicates that upscaling/downscaling of the observed data to the forward model scale, as a preprocessing step with respect to the data assimilation process, is the most employed strategy in practice [23,24,35]. In the present work, we are particularly interested in addressing the spatial scale dissimilarity between observations and the discretized forward model. In many applications, there is no observability of the spatially distributed data (due to limitations in the acquisition process; e.g., in terms of resolution) at the parameter resolution which is necessary to accurately describe important physical phenomena. Therefore, here we consider observed data that is acquired at a resolution that is coarser than the parameter resolution, and hence represented in a grid that is coarser than the one employed in the forward simulation. Note that sub-grid observations, at a resolution lower than the forward model discretization, is outside the scope of this work.
Another important aspect to be considered in data assimilation and uncertainty quantification (UQ) studies is the fact that those rely on computationally demanding algorithms. Different techniques such as Monte Carlo (MC) methods [37], Ensemble Kalman Filter (EnKF) and derivations [1,11,14], and randomized maximum likelihood (RML) [45] are developed to perform those studies. A comparison between the different techniques is provided by [12]. Regardless of the technique, a common feature they share is the necessity of performing many forward model runs in order to reasonably sample the posterior probability distribution of the reservoir uncertain parameters. As already mentioned, upscaling [7,15] can build faster, reasonably accurate forward models that can speed up the sampling process. However, to accurately represent some physical phenomena, e.g., mixing, diffusion, fluid fronts, or compositional capillary effects, fine-scale resolution is of utmost importance. Hence, the ability of keeping the high fidelity description of geological parameters is fundamental for an adequate reservoir characterization. Partial-differentialequation-(PDE)-constrained optimization techniques can be employed in the solution of the inverse problem. In this case, it is well known that gradient-based algorithms are the most efficient ones, mainly if combined with efficient gradient computation. And it is also well known that gradients obtained with the adjoint method [2,31,33,47] are the most efficient and accurate ones.
The objective of this work is to develop and demonstrate an inverse modelling method that, at the same time, (1) is computationally efficient, (2) addresses the scale dissimilarity issue, with minimum loss of information, and (3) is capable of updating the highest fidelity model description. To this end, we exploit multiscale (MS) simulation strategies in order to (1) speed up the forward simulation, while preserving fine-scale geological features, (2) efficiently compute gradient information, and (3) seamlessly conciliate model and observed data scales. For a comprehensive review on the recent developments associated with MS methods applied to reservoir simulation, see [36].
The remainder of this paper is organized as follows. Firstly, a brief overview about how data assimilation is approached from a Bayesian perspective is presented. Next, we state our target forward model, consisting of incompressible single-phase flow in heterogeneous porous media. Also, we revisit the MS solution of the flow equation in a purely algebraic presentation. Thereafter, we discuss the data assimilation problem setup, focusing on the challenges of assimilating spatially distributed data. More specifically, we discuss alternatives on how to conciliate data and model scales. Then, we introduce our multiscale data assimilation strategy, consisting of, basically, a MS objective function and a MS gradient computation strategy. The method here employed is largely based on the MS gradient computation strategy discussed by [42]. We compare the different data conciliation methods and our newly introduced method based on a synthetic 2D case. We focus our experiments on both the maximum a-posteriori (MAP) estimate and UQ via the RML method. A discussion about the results and the challenges that the method can encounter is presented next. Finally, a summary of the developments and results, as well as future research perspectives, is presented.

Problem statement
Let N d denote the number of space dimensions. Let Ω ⊂ R N d be the problem domain with boundary ∂Ω. Let s ∈ R N d be an arbitrary space position. Let n ∈ R N d be an unit normal vector to ∂Ω. Our analysis focuses on phenomena governed by an elliptic PDE equation, denoted by g, in the form is the variable of interest, q = q (θ) is the sink/source term, and θ is the heterogeneous uncertain coefficient which we aim to estimate via the assimilation of real system observations. We assume θ has no separation of scales; hence, homogenization techniques would lead to unavoidable approximations to the effective property. Let be the observable model responses, then the inverse problem can provide an estimate for θ given the description of the real observation errors . We assume that d, the real system data, can only be observed at a resolution that is coarser or equal to the resolution at which θ is described.

Inverse problem as a PDE-constrained optimization
We base our developments on a Bayesian framework. Let N y be number of observable responses, N θ be the number of model parameters, and N x the number of primary (state) variables. According to Bayes' theorem, the posterior probability distribution function (PDF) can be computed as where θ ∈ R N θ is the vector of model parameters and d ∈ R N y is the vector observable responses. If the a priori PDF of the uncertain parameters, f (θ), and the measurement errors from the observations are assumed Gaussian, it can be shown that the conditional a posteriori distribution is given by [50] where the objective function O ∈ R is given by In the above equations, y ∈ R N y is the vector of model responses (outputs), x ∈ R N x is the state vector, θ prior ∈ R N θ is the prior mean, C θ ∈ R N θ ×N θ is the parameter covariance matrix, and C D ∈ R N y ×N y is the covariance matrix of the measurement errors. The solution of Eq. 6 can be stated as a PDE-constrained optimization problem as [46] where g : R N x × R N θ → R N x represents the set of forward model equations and θ min ∈ R N θ , θ max ∈ R N θ are, respectively, the parameter lower and upper bound vectors. The efficient solution of Eq. 7, resulting from the discretization of Eq. 1, requires gradient-based methods [44] combined with efficient gradient computation methods.
For this purpose, by applying the chain rule to Eq. 6 it follows that where G ∈ R N y ×N θ is the so-called sensitivity matrix, representing the sensitivity of the responses w.r.t. the parameters. Efficient gradient methods are analytical, and more specifically in inverse problems where the number of parameters is greater than the number of output functionals, the adjoint method is the most accurate, efficient method [46]. The efficient computation of the right multiplication of G by an arbitrary vector (as in Eq. 8) via the adjoint method is discussed in [47].

Randomized Maximum Likelihood (RML)
RML [45] is an approximated sampling method for UQ, which obtains the j th sample of the posterior PDF distribution by solving Eq. 7 for a given sample θ uc,j from a normal distribution N (θ prior , C θ ) and a given sample d uc,j from N (d, C D ). Therefore, Eq. 6 can be re-written Hence, a minimization problem has to be solved for every j th posterior PDF sample one wants to estimate. This is only feasible with efficient gradient computation methods, as described in the previous section.

The forward model
The set of discretized equations that describes the forward simulation at the fine scale can be algebraically expressed as [42] g F (x, θ) = 0, where g F : R N F × R N θ → R N F represents the set of algebraic forward model equations resulting from the numerical discretization of Eq. 1 over a fine grid G F ∈ R N F , x ∈ R N F is the state vector and the subscript F refers to "fine scale." There are N F fine-scale cells. Equation 10 implicitly assumes a dependency of the state vector x on the parameters θ, i.e., Once the model state is determined, the observable responses of the forward model are computed. The forward model responses may not only depend on the model state, but also on the parameters themselves, and can be expressed as where h F : R N F × R N θ → R N y represents the output equations [30]. It is assumed that g F can be described as where A = A (θ) ∈ R N F × R N F represents the elliptic discrete operator and q = q (θ) ∈ R N F is a vector of source terms and boundary conditions.

Multiscale simulation
A multiscale (MS) solution strategy can be algebraically devised [56,59] by firstly computing a coarse-scale solution where after an approximate fine-scale solution is formed as Letx ∈ R N C be the coarse-scale solution (N C N F ), and x ∈ R N F the approximated fine-scale solution. The prolongation operator P = P (θ) is an N F × N C matrix that maps (interpolates) the coarse-scale solution to the fine scale, where N C is the number of coarse grid blocks. The restriction operator R = R (θ) is defined as an N C × N F matrix which maps the fine scale to the coarse scale.
In multiscale methods, the scaling operators are constructed based on locally supported basis functions. Different strategies to build MS basis functions are available in the literature [8,28,32,43]. In this work, we employ the multiscale finite volume (MSFV) method [32]. However, we emphasize, as will be clear from the formulation, that the framework allows the employment of different MS methods, as long as they can be expressed in terms of R and P. Next, we discuss the MSFV basis function construction.

Construction of scaling operators via the MSFV method
The MSFV discretization relies on two overlapping coarse grids, namely the primal and dual coarse grids, which are superimposed on a given fine grid. The grids are illustrated in Fig. 1. The primal coarse grid contains N C control volumesΩ i , i ∈ {1, . . . , N C }, and the dual coarse grid contains N D local domainsΩ j , j ∈ {1, . . . , N D }.
The MSFV basis functions are constructed based on local solutions of the elliptic governing Eq. 1 for every Fig. 1 Multiscale finite volume grids and illustration of interfacial connections between cells used in the wirebasket ordering Ω j , with no right-hand-side and subject to special boundary conditions [32,56] where ϕ i j is the basis function associated with the vertex i inΩ j , the subscript represents the projection along ∂Ω j , and δ ik is the Kronecker delta, and k ∈ {1, . . . , 2 N d } denotes the vertices inΩ j (N d is the spatial dimensionality of the problem -1, 2, or 3).
Assuming a finite volume discretization of Eq. 1, the basis functions, and hence the prolongation operator, can be constructed directly from the given fine-grid linear system matrix as [60] after A in Eq. 13 is re-ordered in a wirebasket ordering [54] as where P ∈ R N F ×N F is a permutation matrix that reorders from wirebasket to natural ordering, are, respectively, the sub-matrices of A corresponding to the interior-interior, interior-edge, edge-interior, edge-edge, edge-vertex, vertex-edge, and vertex-vertex cell connections and N I , N E , and N V are, respectively, the total number of interior, edge, and vertex cells in the fine grid. The interfacial connections are illustrated in Fig. 1 respectively, the model equations, and the state and source term sub-vectors corresponding to the interior, edge, and vertex cells.
Note that the construction of P requires setting A V E = 0, A EI = 0, and A V V = I V V and likewise the corresponding entries A EE , resulting inÃ EE , which is equivalent to the localization assumptions required to build the basis functions as stated in Eq. 16 [60].
If the FV method is used in the fine-scale system discretization, the restriction operator can be defined as the sum of the equations of all the fine cells contained in the coarse cell, i.e., [60] hence, in combination with Eq. 17, establishing the multiscale finite volume (MSFV) method. Also, a Galerkin restriction operator could be used by making R = P T , and hence, in combination with Eq. 17, establishing the MS finite element (MSFE) method. While the MSFV is conservative by construction, the MSFE provides monotone solutions.

Computational efficiency
An analytical computational efficiency estimate of the MSFV method for the solution of one phase flow is discussed in [32] and briefly revisited here. For CPU studies based on the MSFV pressure equation solution of 3D heterogeneous domains, see [56]. Let N R be the coarsening ratio employed to construct the MS coarse grid, N L be the number of local problems that must be solved per coarse grid vertex (4 in 2D and 8 in 3D problems), and b a constant associated specifically to the linear solver employed. Assuming that the solver employed to the MS system is equally efficient to the one employed to solve the fine-scale system, it can be shown, via an asymptotic analysis of the operations, that the cost ratio between MS and FS elliptic PDE solutions is expressed as [42] In order to illustrate an estimate of the MS simulation efficiency potential, let us assume b = 1.3, N R = 10, N L = 4, and N F = 10 8 . That gives O MS O F S = 0.082, hence the MS solution is proportional to only 8% the cost of a fine-scale solution in this scenario.

Adjoint gradient computation
The maximum a posteriori probability (MAP, [46]) of the uncertain parameters is obtained by solving the optimization problem stated in Eq. 7, with the objective function (OF) given by Eq. 6 (and using Eq. 12), the gradient which is given by where and m ∈ R N Y . Following an implicit differentiation strategy [33,47], the sensitivity matrix G can be obtained from the total derivative of Eq. 12 with respect to θ as follows [42]: where derivative matrices obtained from the derivation of Eqs. 10 and 12 with respect to x and θ.
The product G T m = m T G T can be solved at costs proportional to one backward simulation, regardless of the number of model parameters, via the adjoint method, by pre-multiplying Eq. 24 by the transpose of Eq. 23, as discussed in [47]. By defining it follows that

Conciliation of spatially distributed data and forward model scales
In the data assimilation of spatially distributed observations, Eq. 6 assumes that the observations d and the model responses h are described at the same scale. This is often not the case. Due to resolution issues and acquisition limitations, observations are often not available at the scale of the model responses. Therefore, if no MS simulation is available, either d must be downscaled to the simulation scale or h must be upscaled to the observation scale.
The downscaling of observed data can be expressed as where D is an N F × N C downscaling operator,d ∈ R N C is the coarse-scale observation and d ∈ R N F is the interpolated observation at the fine scale. Additionally, one must be able to describe the data covariance matrix C D , originally at (coarse) observation scale, at the fine scale. This can be achieved by setting where C D is the covariance matrix represented at the fine scale. It is simple to show that Eq. 28 holds because of the linearity of the expectation operator given the Gaussian assumptions. From Eq. 27, the expectation of d is given by The covariance of d can then be computed as (Emerick, A. A., personal communication, March 23, 2018) Alternatively, one could upscale the model responses as where U is an N C × N F upscaling operator, and solve Eq. 6 by setting d obs =d obs . One advantage over the dowscaling strategy is that C D is kept at its original scale. We highlight that we only consider strategies that change observed data / response scale and do not consider strategies that change the original uncertain parameter description scale. This is because we aim to update the most accurate description of the model parameters, so that important finescale features (crucial to describe the physical phenomena) are not lost.

Multiscale data assimilation
An MS solution strategy provides a coarse-scale solution that can, theoretically, be represented at any resolution coarser than the fine-scale resolution. In data assimilation studies, where the spatially distributed data resolution is known and is coarser than the model resolution, the MS grid can be chosen to be at the same resolution as the assimilation grid. This allows spatially distributed model responses to be computed at the same scale as the observed data. Next, we devise a multiscale data assimilation procedure based on this feature. This allows us, instead of manipulating the data and/or the uncertain parameters, to accurately compute responses at the observed data scale.
Therefore, a multiscale objective function is introduced by re-writting Eq. 6 as whereh is the response at the (coarse) observation scale andx is the coarse state variable, computed by Eq. 14.
Hence, the misfit term is computed at the coarse scalethe scale where data is assimilated-and the regularization term is described at the fine scale-the scale at the model parameters are described.

Multiscale gradient computation
As discussed in [42], the state vector can be described as a combination of both sets of primary variables at the fine and coarse scales, i.e., and, similarly, the model equations can be represented as a combination of the equations at both scales, i.e., The definition of the state vector as in Eq. 33 is a key aspect of this development. It allows the description of the state not only at the fine scale, but also at the coarse scale. The simulator responses y obtained from the multiscale method are represented as the sensitivity matrix G can be computed in a multiscale fashion as wherȇ The product G Tm = m T G T can be solved at costs proportional to one coarse-scale backward simulation, regardless of the number of model parameters, via the MS adjoint method presented in [42], by pre-multiplying Eq. 36 by the transpose of Eq. 38, defining and rearranging the terms, it follows that where and

Scaling operators partial derivative computation
The partial derivative computation of MSFV basis functions was originally discussed in [20] and recast in an algebraic, general mathematical framework expressed in terms of P in [42]. An efficient algorithm that computes the product β ∂P ∂θ x in a backward-fashion was originally introduced in whose partial derivative w.r.t. θ, reads The partial derivative of Eq. 17 w.r.t. θ is Substituting Eq. 45 in Eq. 46, it follows that Hence, likewise, the MSFV prolongation operator can be fully determined directly from the fine-scale system matrix, while the partial derivative of the prolongation operator w.r.t. the model parameters can fully determined from both A and the partial derivative matrix ∂g ∂θ . This allows a more straightforward implementation of the method in existing computational frameworks. Again, in the same way, the MS method can be integrated to existing simulation platforms if access to the system matrix and the grid topology is available, our method can be implemented in existing data assimilation frameworks if access to the grid topology, system matrix and partial derivative matrices are provided. Note that, however, as in any adjoint derivative computation framework, the computation of the partial derivative matrices for different parameters is usually a challenge. Derivative computation techniques like automatic differentiation [5] can be employed as a flexible solution to this problem.. Note that, even though Proposition 1 indicates the important ability of determining ∂P ∂θ from ∂g ∂θ , it does not provide enough information about how to efficiently compute this partial derivative. It is discussed in [42] how to efficiently compute the left/right multiplication of ∂P ∂θ in the context of, respectively, the direct and adjoint methods. Algorithm 4 in that paper presents an efficient way to compute the product β ∂P ∂θx , at costs proportional to the number of coarse cells and independent of the number of parameters, suitable to be used in combination with Eq. 40 for its efficient solution.
Because we aim to compute the gradient of a scalar function (see Eq. 6), the computational cost is proportional to solving one so-called backward simulation. In our case, this is proportional to the solution of the transposed linear system of equations in Eq. 39, whose system matrix RAP has size proportional to the number of coarse grid cells, plus the solution of β ∂P ∂θx , the cost of which is equal to the solution of the basis functions (43). See [42] for details. Hence, the computational cost ratio of the proposed method is also given by Eq. 20. The uncertainty around the absolute permeability distribution is represented by an ensemble of different permeability realizations. The ensemble is generated via the decomposition of a reference permeability "image" using principal component analysis (PCA) parameterization [29]. Figure 3 illustrates four different permeability realizations from the ensemble of 1000 equiprobable permeability realizations. In order to focus on the MS aspect of the data assimilation process, we assume that pressures can be approximately extracted from a time-lapse seismic survey [4,34,38,39,51]. However, it is important to note that this is not a limitation. If one is interested to perform the data assimilation in different domains, say, in the impedances domain [10], the additional complexity involved is the appropriate incorporation of seismic forward model equations in the forward model set of equations [9] and, consequentially, the computation of the appropriate partial derivative information necessary to compute Eq. 21.
The "true" observed data, p true , is obtained from a twin experiment, where a MS simulation is run using a permeability field randomly chosen from an ensemble of equiprobable model realizations. The coarse observed data is the coarse-scale pressurex computed with Eq. 14, while where z is sampled from N (μ = 0, σ 2 = 1) and √ C D is computed from a Cholesky decomposition. More details on the procedure can be found in [46]. The resulting noisy observed coarse and fine pressure fields are illustrated, respectively, in Fig. 4 d and c.
We consider the observation grid (Fig. 4b) where the observed data is represented to be three times coarser than the model grid (Fig. 4a) in the x and y directions. Hence, it has 7 × 7 grid blocks with grid block size 99.9 × 99.9 × 2 m.
The covariance matrix C θ is computed from the ensemble of realizations as Fig. 4 Schematic representation of the model (fine) grid (a) and observation (coarse) grid (b). Also, the (noisy) pressure data distribution observed at the fine-grid (hypothetical complete observations) (c) and at the actual observation grid resolution (d) where Θ is the N F × N e matrix whose j th column is given by the member of the ensemble θ j , j ∈ {1, ..., N e }, is the ensemble mean, and e = [1, ..., 1] T is a vector of ones of size N e × 1. The prior is taken to be the ensemble mean, In the fine-scale data assimilation strategy, an adjoint model [47] is used to calculate the OF gradient given by Eq. 6. In the multiscale strategy, we employ the MS adjoint gradient computation depicted in Algorithm 4 in [42]. Because the spatially distributed observed data at the coarse scale is the primary variable itself, in Eq. 32, we have whereȊ is the N C × N C identity matrix, and when pressure is observed at the coarse scale, and when pressure is observed at the fine scale. Also, because in this case the relationship between the primary variables and the outputs is not a function of the parameter, it follows that and ∂h ∂θ = 0.
We utilize the limited memory Broyden-Fletcher-Goldfarb-Shanno (LBFGS) algorithm as presented in [44], as it is the most efficient algorithm to deliver optimization results for the solution of Eq. 7 [22].

Observed data downscaling
Two different approaches on how to deal with the scale dissimilarity via downscaling are considered here. In the first one, we downscale the response measured at the coarse scale by setting where R is the MSFV restriction operator. This strategy can be viewed as a constant interpolation of the coarse-scale observations at the fine-scale model scale.
In the second strategy, we build a multiscale prolongation operator P prior = P θ prior , whose columns are comprised of local multiscale basis functions [32], and prolong (interpolate) the coarse-scale information by setting Note that P prior is static and can be viewed as a MS downscaling operator.
In the aforementioned strategies, Eq. 6 can be used by making d = d and a conventional gradient-based optimization to solve Eq. 7 is run at the model (fine) scale.

Model response upscaling
Two upscaling strategies are considered. In the first one, a simple arithmetic average is applied by setting where, In Eq. 59, N C F is the number of fine-grid cells within a given coarse cell C, Ω c is the cth primal coarse-grid domain and f is the fine-grid cell index.
In the second upscaling strategy, we again build a prolongation operator P prior based on θ prior and upscale the observed response by setting where the † symbol denotes the Moore-Penrose pseudoinverse. Here, we construct P † from its truncated singular value decomposition (TSVD) [49] where Σ ∈ R N F ×N F and Δ ∈ R N C ×N C are orthonormal matrices, Λ ∈ R N F ×N C is a diagonal matrix containing the singular values of P, and the subscript p indicates the first p columns of the matrices corresponding the p non-zero singular values.

Maximum a posteriori probability estimate
In this section, we assess the performance of our newly introduced method via the estimation of the maximym a posteriori probability (MAP) in comparison to the upscaling/downscaling strategies discussed in Section 6.1. Therefore, six different data assimilation strategies are considered, namely: 1. fine-scale data assimilation with constant prolongation downscaling of observed pressure (56); 2. fine-scale data assimilation with prior MS prolongation downscaling of observed pressure (57); 3. fine-scale data assimilation with arithmetic average to upscale the simulated pressure (59); 4. fine-scale data assimilation with pseudo-inverse of MS prolongation to upscale the simulated pressure (60); 5. multiscale data assimilation strategy introduced in this work; 6. fine-scale data assimilation with complete observations available at the model (fine) scale.
The latter, a hypothetical situation, is considered as the reference case, as if enough resolution was available to resolve the observed property at the (fine) model scale. Also, note that MS operators are used in strategies 1, 2, and 4. For comparison purposes, we consider the objective function normalized by the number of data points N d . Furthermore, according to [46,50], an acceptable data match is achieved when It is important to highlight that strategies 1-4 have similar computation cost of strategy 6, the hypothetical situation considering complete observations. This is due to the fact that, regardless of the upscaling/downscaling of the quantities (i.e., observations or responses), which also adds extra computations, the gradient computation is given by Eq. 26. This means the solution of a transposed linear system with size proportional to the number of fine-grid cells.
Firstly, we present a qualitative discussion based on the MAP conditioned permeability fields and final matched pressure fields in comparison to the respective "true" permeability and pressure fields. The results for the finescale, complete observation data assimilation exercise is illustrated in Fig. 5, followed by the results from the  Data assimilation results, fine-scale data assimilation, downscaling of data observations. In the first row, the a true, b initial, c matched using the constant interpolation (R T ), and d matched using the MS prolongation operator (P) pressure fields are shown. In the second row, d the "true," e the prior, g the conditioned using R T , and g the conditioned using P permeability fields are shown. The color maps follow the color bar found in Fig. 5 downscaling and upscaling data conciliation strategies, represented, respectively, by Figs. 6 and 7. Lastly, the results of the data assimilation using our MS data assimilation strategy are illustrated in Fig. 8.
From a qualitative point of view, the matched responses from all data assimilation strategies, except the responses obtained by the constant interpolation downscaling (Fig. 6c) and arithmetic average upscaling (Fig. 7c), are fairly similar to the observed data. The pressure matches are both in accordance with the fine-scale pressure match (Fig. 5c) and with the "true" pressure field. However, the simpler upscaling/downscaling strategies result in somewhat poorer matches around the injection well. It can be noted that higher pressure responses are computed around the injection well. This is due to the simpler interpolation employed in the pressure upscaling/downscaling, which results in a constant pressure for the fine-grid cells within the coarse grid block where the injection well is located.
Also, it is possible to observe that, from the point of view of the conditioned permeability fields, all assimilation strategies were capable of recovering the main features. Furthermore, not much difference is noted in the results when comparing the upscaling to the downscaling matched permeability fields. In order to better assess the quality of the parameter matches, we investigate the permeability distribution from the different matching exercises. Therefore, the density functions of the matched permeability fields are plotted in Fig. 9. It is possible to note that, even though the initial permeability distribution is considerably far from the true model (due to the rather homogeneous prior used in the MAP), the complete observation, fine-scale strategy is capable of reproduction of the reference permeability density function. But, more importantly, the MS data assimilation, with coarse scale only observations, can also provide a permeability field whose density function is consistent with the "true" permeability density function. Also, it can be noted that the permeability fields obtained by the other strategies are also consistent with the " true" permeability distribution.
We also analyze the optimization convergence behavior shown in Fig. 10 for quantitative assessment of the match. The fine-scale reference data assimilation reaches a normalized OF value very close to the ideal value of 0.5, while all other data assimilation strategies reach values relatively higher, with the simpler constant interpolation and arithmetic average scaling strategies reaching slightly higher OF values. It is important to note that the optimization behavior is remarkably similar for all upscaling/downscaling strategies, as well as for the MS data assimilation strategy here  presented. As discussed in [42], even though in this exercise the convergence behavior of our method is similar to the others, the computational cost of the gradient computation is proportional to the coarse grid dimensions, while in all other methods the cost is proportional to fine-grid dimensions.

Uncertainty quantification
A RML is run for 100 randomly chosen permeability realization from the 1000 members ensemble (Fig. 3), for each data conciliation strategy. In order to estimate the conditioned permeability distribution for each permeability realization, a LBFGS optimization is run for each chosen member. The results for the exercise are shown in Fig. 11.
It can be observed that the permeability marginal PDFs conditioned to the pressure data obtained by the MS data assimilation here introduced (Fig. 11e) are closer to the reference fine-scale conditioned PDFs (Fig. 11f). Additionally, by observing the spread of the conditioned PDFs obtained from the RML employing the upscaling/downscaling strategies, one can note that the MS strategy is also capable of somewhat better representing the uncertainty.

Discussion
Firstly, even though all data assimilation strategies were capable of achieving similar MAP estimates, one should note that the synthetic case used in the experiments has low permeability contrasts. Moreover the five-spot configuration is very simple and the well spacing is relatively dense compared to the characteristic size of the heterogeneities. Nonetheless, given the good results observed in the employment of our MS data assimilation strategy, we believe that the performance of the method in more challenging scenarios is worth investigating. A systematic study of the effects of the underlying geological complexity on the MS assimilation procedure is necessary. The MS ability to preserve fine-scale features is expected to allow for more detailed description of the fine-scale uncertain parameters. Additionally, we emphasize the importance of the proper representation of the measurement Fig. 9 Permeability conditioned marginal PDFs for the different data assimilation strategies errors at different scales. One must take into account the data redundancy in the case of downscaling the observed data to the model scale.
One could consider a third, fine-scale only, MS-based approach, based on the reconstruction of the prolongation operator at every optimizer iteration γ , so that changes in the permeability during the optimization process are also captured by the basis functions update. Hence, one could write where P γ is the reconstructed prolongation operator at every optimization iteration γ . This can only be achieved at the Fig. 10 Optimization performance of the data assimilation utilizing the 6 different scale conciliation strategies as presented in this section. Note that the FS represent the hypothetical case where both observed data and model parameters are at fine scale (i.e., item 6 on our list of strategies) expense of the reconstruction of the basis function every γ . We performed studies (not reported here) where we neglect the partial derivative of P w.r.t θ but we did update P. Similar results were obtained when P is not updated and only based on the prior (57), as reported here. Moreover, it is discussed in [42] how to efficiently compute ∂P/∂θ. This can be an alternative to further take advantage of MS principles even when a MS forward simulation is not available.
In this work, we employ a two-stage MS simulation strategy, and consequently a two-stage MS gradient computation strategy. We make the primal MS coarse grid to be coincident to the observation grid resolution. However, the idea of the MS data assimilation can be extended to seamlessly address data available at multiple scales, or even consider one, or multiple MS grid resolution(s) for assimilation purposes only and different one(s) for the forward simulation. To this end, multilevel multiscale strategies [6] could be applied. Following the same multilevel multiscale strategy, data acquired at different scales (e.g., electromagnetics, high resolution close to the well, along with seismic data, low resolution in the vertical direction) could also be seamlessly and simultaneously be assimilated.
Following our studies, and also reported by [42] and [20], it can be noted that MSFV gradients can be less accurate for highly heterogeneous media. In addition, one may want to have error control on the MSFV gradient quality for practical applications. Furthermore, LBFGS proposes under/overshooting updates, mainly close to the wells [21], which also configures a challenging scenario for the MSFV gradient computation. These challenges can be addressed from the optimization point of view or from the gradient computation perspective. The former can be considered via data misfit damping or parameter constraints [21]. The latter by improved MS gradient quality, via more accurate MSFV solutions [26,57]. An iterative MSFV gradient computation, following the solution strategy proposed by [26], could allow for additional error control over the gradient computation.

Final Remarks
Our numerical experiments indicate that the presented method has the potential to outperform strategies that rely on upscaling/downscaling of model responses/observed data. An important result is the ability of our MS data assimilation strategy to closely reproduce the reference fine-scale uncertainty quantification results. Applications in more complex cases, and for different types of assimilation problems, should give more insights about the computational and methodological advantages of MS data assimilation, as indicated by the results of the simple Fig. 11 RML probability density functions for 100 permeability realizations randomly chosen from the original ensemble of 1000 realizations (see Fig. 3). The curves in red represent the prior permeability distributions, while the curves in green the conditioned permeability distributions and the curves in blue represent the reference ("true") permeability distribution. a Constant interpolation downscaling (R T ), b arithmetic average downscaling (M), c MS prolongation operator downscaling, d MS prolongation operator upscaling P † , e the multiscale data assimilation strategy, and f the reference fine scale example addressed in our study. Clearly, larger-scale tests, with more complex heterogeneity, are required to further quantify these potential benefits of MS data assimilation. Our paper demonstrates how to consistently formulate such MS data assimilation strategy, in particular in combination with the use of adjoint-based techniques to efficiently obtain MS gradient information, and in an algebraic framework which allows for implementation in existing computational platforms.