A sparse matrix formulation of model-based ensemble Kalman filter

We introduce a computationally efficient variant of the model-based ensemble Kalman filter (EnKF). We propose two changes to the original formulation. First, we phrase the setup in terms of precision matrices instead of covariance matrices, and introduce a new prior for the precision matrix which ensures it to be sparse. Second, we propose to split the state vector into several blocks and formulate an approximate updating procedure for each of these blocks. We study in a simulation example the computational speedup and the approximation error resulting from using the proposed approach. The speedup is substantial for high dimensional state vectors, allowing the proposed filter to be run on much larger problems than can be done with the original formulation. In the simulation example the approximation error resulting from using the introduced block updating is negligible compared to the Monte Carlo variability inherent in both the original and the proposed procedures.


Introduction
State-space models are frequently used in many applications.Examples of application areas are economics (Creal, 2011;Chan and Strachan, 2020), weather forecasting (Houtekamer and Zhang, 2016;Hotta and Ota, 2021), signal processing (Loeliger et al, 2007) and neuroscience (Smith and Emery, 2003).A state-space model consists of an unobserved latent {x t } discrete time Markov process and a related observed {y t } process, where y t gives information about x t , and the y t 's are assumed to be conditionally independent given the {x t } process.Given observed values y 1 , . . ., y t the goal is to do inference about one or more of the x t 's.In this article the focus is on the filtering problem, where the goal is to find the conditional distribution of x t given observations y 1:t = (y 1 , . . ., y t ).The filtering problem is also known as sequential data assimilation and online inference.
The Markov structure in the specification of the state-space model allows the filtering problem to be solved recursively.Having available a solution of the filtering problem at time t − 1, i.e. the distribution p(x t−1 |y 1:t−1 ), one can first use this filtering distribution to find the one-step forecast distribution p(x t |y 1:t−1 ), which one in turn can combine with the observed data at the next time step, y t , to find p(x t |y 1:t ).The identification of the forecast distribution is often termed the forecast or prediction step, whereas the process of finding the filtering distribution is called the update or analysis step.If a linear-Gaussian model is assumed analytical solutions are available both for the forecast and update steps, and is known as the Kalman filter (Kalman, 1960;Kalman and Bucy, 1961).Another situation where an analytical solution is available for the forecast and update steps is when x t is a vector of categorical or discrete variables.In essentially all other situations, however, analytical solutions are not available and one has to resort to approximate solutions.Ensemble-based methods are here the most popular choice, where each distribution of interest is represented by a set of particles, or an ensemble of realisations, from the distribution of interest.In ensemble-based methods one also alternates between a forecast and an update step.The forecast step is straightforward to implement without approximations in ensemble methods, whereas the update step is challenging.In the update step an ensemble of realisations, x (1) t , . . ., x (M) t , from the forecast distribution p(x t |y 1:t−1 ) is available and the task is to update each realisation x (m) t to a corresponding realisation x (m) t from the filtering distribution p(x t |y 1:t ).The forecast distribution p(x t |y 1:t−1 ) serves as a prior in the update step, and the filtering distribution p(x t |y 1:t ) is the resulting posterior.In the following we therefore refer to the ensemble of realisations from the forecast distribution as the prior ensemble and the ensemble of realisations from the filtering distribution as the posterior ensemble.
There are two popular classes of ensemble filtering methods, particle filters (Doucet et al, 2001;Doucet and Johansen, 2011) and ensemble Kalman filters (EnKFs) (Burgers et al, 1998;Evensen, 2009).Particle filters are based on importance sampling and resampling and has the advantage that it can be shown to converge to the correct filtering solution in the limit when the number of particles goes towards infinity.However, in applications with high dimensional state vectors, x t , particle filters typically collapse when run with a finite number of particles.The updating procedure in EnKFs is based on a linear-Gaussian model, but experience in applications is that EnKFs produce good results also in situations where the linear-Gaussian assumptions are not fulfilled.However, except when the linear-Gaussian model is correct the EnKF is only approximate, even in the limit when the number of ensemble elements goes towards infinity.Most of the EnKF literature is in applied journals and has not much focus on the underlying statistical theory, but some publications has also appeared in statistical journals (Saetrom and Omre, 2013;Katzfuss et al, 2016Katzfuss et al, , 2020;;Loe andTjelmeland, 2020, 2022).
In EnKF the update step consists of two parts.First a prior covariance matrix is established based on the prior ensemble, which thereafter is used to modify each prior ensemble element to a corresponding posterior element.Many variants of the original EnKF algorithm of Evensen (1994) have been proposed and studied in the literature, see for example Evensen (2009) and Houtekamer and Zhang (2016) and references therein.In particular the original algorithm of Evensen (1994) is known to severely underestimate the uncertainty and much attention has focused on how to correct for this.The most frequent approach to this problem is the rather ad hoc solution of variance inflation, see the discussion in Luo and Hoteit (1997).Houtekamer and Mitchell (1997) identified the source of the problem to be inbreeding, that the same prior ensemble elements that are first used to establish the prior covariance matrix are thereafter modified into posterior ensemble elements.To solve this problem Houtekamer and Mitchell (1997) proposed to split the prior ensemble in two, where the prior ensemble elements in one part is used to establish a prior covariance matrix which is used when modifying the prior ensemble elements in the other part into corresponding posterior elements.As discussed in Houtekamer and Zhang (2016) this cross validation setup is later generalised to a situation where the prior ensemble is split into more than two parts.Myrseth et al (2013) propose to use a resampling approach to solve the inbreeding problem.Another issue with the original EnKF setup of Evensen (2009) that has been focused in the literature is that the empirical covariance matrix of the prior ensemble is used as the prior covariance matrix, thereby effectively ignoring the associated estimation uncertainty.To cope with this problem Omre and Myrseth (2010) set the problem of identifying the prior covariance matrix in a Bayesian setting.To obtain an analytically tractable posterior distribution for the mean vector and the covariance matrix of the prior ensemble, the natural conjugate normal inverse Wishart prior is adopted.For each ensemble element, a sample from the resulting normal inverse Wishart posterior distribution is generated and used in the updating of that ensemble element.Corresponding Bayesian schemes for establishing the prior covariance matrix are later adopted in Bocquet (2011), Bocquet et al (2015) and Tsyrulnikov and Rakitko (2017).Loe and Tjelmeland (2021) propose to base the updating step of the EnKF on an assumed Bayesian model for all variables involved.The updating step is defined by restricting it to be consistent with the assumed model at the same time as it should be robust against modeling error.Both the need for defining a prior for the covariance matrix and the cross validation approach discussed above comes as necessary consequences of this setup.In addition it entails that the new data y t should be taken into account when generating the prior covariance matrix.
In the present article we adopt the model-based EnKF approach of Loe and Tjelmeland (2021), but our focus is to formulate a computationally efficient variant of this approach.To obtain this we do two important changes.First, in Loe and Tjelmeland (2021) the natural conjugate normal inverse Wishart prior is used for the mean vector and covariance matrix of the prior ensemble elements.When sampling from the corresponding posterior distribution this results in a covariance matrix that is full.To update a prior ensemble element to the corresponding posterior element the covariance matrices must be used in a series of matrix operations, such as various matrix decompositions and multiplications with vectors.We rephrase the approach of Loe and Tjelmeland (2021) to use precision matrices, or inverse covariance matrices, instead of covariance matrices and propose to use a Gaussian partially ordered Markov model (Cressie and Davidson, 1998) to construct a prior for the mean vector and the precision matrix of the prior ensemble.This prior is also conjugate for the assumed normal likelihood for the ensemble elements, and the resulting posterior is therefore analytically tractable.Moreover, with this prior distribution we are able to ensure that the precision matrices sampled from the resulting posterior distribution are band matrices.With a band structure in the precision matrix some of the matrix operations in the updating of a prior ensemble element to the corresponding posterior element can be done more efficiently.However, some of the matrix operations, singular value decompositions, do not benefit from this band structure.To make also this part of the procedure computationally more efficient we propose to do an approximation which allows us to replace the singular value decomposition of one large matrix with singular value decompositions of several smaller matrices.
Compared to the updating procedure in the original EnKF setup of Evensen (1994), the updating procedure resulting from adopting the Bayesian scheme introduced in Omre and Myrseth (2010) is clearly computationally much more expensive.The increased computational demands come for two reasons.First, the Bayesian setup requires one covariance matrix to be generated for each ensemble element, whereas the original EnKF setup uses the same covariance matrix for all ensemble elements.Second, the covariance matrices generated by the original Bayesian setup of Omre and Myrseth (2010) are full and without any special structure, whereas the covariance matrix used in the original EnKF setup has a low rank which can be used to speed up the necessary matrix computations.In the present paper we resolve the second issue by adopting a prior model producing sparse precision matrices.Regarding the first issue, Loe and Tjelmeland (2021) demonstrated that with their modelbased EnKF approach the generated ensemble gave a realistic representation of the uncertainty, whereas the original EnKF setup produced ensembles which underestimated the uncertainty.For small ensemble sizes the underestimation of the uncertainty when adopting the original EnKF setup is severe, whereas with large enough ensemble sizes it becomes negligible.To obtain a realistic representation of the uncertainty we therefore have two possible strategies.One can either adopt the model-based EnKF approach used in Loe and Tjelmeland (2021) and in the present paper, or one can use the original EnKF setup with a very large number of ensemble elements.Which of the two approaches that are computationally more efficient depends on the computational cost of running the forward function.If the forward function is computationally expensive, so that this part dominates the total computing time of the filter, it is clearly computationally most efficient to adopt the model-based EnKF procedure introduced in the present paper.If the computation time is not dominated by the running of the forward function, the picture is less clear.Depending on how many ensemble elements that are necessary to get a sufficiently small underestimation of the uncertainty, the original EnKF with a large number of ensemble elements may be the computationally most efficient alternative.
This article is structured as follows.In Section 2 we first review some numerical algorithms for sparse matrices and some properties of the Gaussian density.In the same section we also present the state-space model.In Section 3 we rephrase the approach introduced in Loe and Tjelmeland (2021) to use precision matrices instead of covariance matrices.The new prior for the precision matrix is proposed in Section 4. In Section 5 we propose a computationally efficient approximation of the updating procedure presented in Section 3. We present results of simulation examples in Section 6, and finally we provide some closing remarks in Section 7.

Preliminaries
This section introduces material necessary to understand the approach proposed in later sections.We start by discussing some numerical properties of sparse matrices, thereafter review how the multivariate Gaussian distribution can be formulated in terms of precision matrices.Lastly, we introduce the state-space model and provide a brief introduction on simulation-based techniques.

Numerical properties of sparse matrices
Suppose that x, y ∈ R n , that Q ∈ R n×n is a given symmetric positive definite band matrix with bandwidth p, where n p, that y is given, and that we want to solve the matrix equation with respect to x.Since we assume that Q is a symmetric matrix, we can solve the equation above using the Cholesky decomposition Q = LL T , where L ∈ R n×n is a lower triangular matrix.Due to the band structure of Q we know that L has lower bandwidth p, which in turn enables us to compute L with complexity O(np 2 ), see Algorithm 2.9 in Rue and Held (2005).Since we now are able to compute L efficiently, we can make use of Algorithm 2.1 in Rue and Held (2005) to solve (1) efficiently as well.In the following, we describe the algorithm step by step.First, we rewrite (1) as LL T x = y. (2) If we define L T x = v, we can solve (1) by first solving Lv = y for v and then solving L T x = v for x.We can exploit the band structure of L to solve these equations efficiently.Using that L is lower triangular, we can solve Lv = y row by row, using "forward substitution" (Rue and Held, 2005, p. 32).We denote the (i, j)th entry of L as L i,j and the ith element of y as y i .The ith element of v, denoted v i , is computed as follows Similarly, we can solve v = L T x for x using "backward substitution" Notice that we compute the entries of x "backwards"; we first compute x n and then move our way backwards to x 1 .If we again assume n p, the computational complexity of forward and backward substitution is O(np).This means that the overall complexity of computing x using the presented approach is O(np 2 ).Note that solving (1) with the "brute force" approach, i.e. computing x = Q −1 y, has computational complexity O(n 3 ).
Backward substitution can also be used to sample efficiently from a Gaussian distribution when the precision matrix is a band matrix.Assume that z is a vector of standard normally distributed variables and that we want to simulate x from a Gaussian distribution with mean 0 and covariance matrix Q −1 .We can simulate x by solving L T x = z (5) using backward substitution, as specified in (4).When Q is a band matrix with bandwidth p, where n p, the computational complexity is O(np 2 ).However, when Q is full the computational complexity of simulating x is O(n 3 ).

Gaussian distribution phrased with precision matrices
Let x ∈ R n have a multivariate Gaussian distribution with mean µ ∈ R n and precision matrix Q ∈ R n×n .We let N (x; µ, Q) denote the density function of this distribution, i.e.
Defining A ⊂ {1, . . .n} and B = {1, . . ., n} \ A, we can partition x, µ and Q into blocks According to Theorem 2.5 in Rue and Held (2005), we then have that Similarly, The last expression can be derived by picking out the parts of the mean vector µ and covariance matrix Q −1 that corresponds to x B .When we have found the covariance matrix of x B we find the precision matrix by inversion.From the first expression we see that computing the precision matrix for x A |x B is particularly easy when the Gaussian distribution is formulated with precision matrices.

State-space model
A state-space model (Shumway and Stoffer, 2016;Brockwell and Davis, 1990) consists of a set of latent variables, denoted {x t } T t=1 , x t ∈ R nx , and a set of observations {y t } T t=1 , y t ∈ R ny .The latent variables follow a first order Markov chain with initial distribution p(x 1 ) and transition probabilities p(x t |x t−1 ), t ≥ 2. That is, the joint distribution for x 1:T = (x 1 , . . ., x T ) can be written as In addition, each observation y t is considered to be conditionally independent of the remaining observations, given x t .The joint likelihood for the observations y 1:T = (y 1 , . . ., y T ) can be formulated as A state-space model is illustrated by a directed acyclic graph (DAG) in Figure 1, where each variable is represented with a node and the edges symbolise the dependencies between the variables. x The first expression is the prediction step and gives the forecast distribution p(x t |y 1:t−1 ), while the second expression is called the update step and yields the filtering distribution p(x t |y 1:t ).Note that the filtering distribution is computed through Bayes' rule, where p(x t |y 1:t−1 ) is the prior, p(y t |x t ) is the likelihood and p(x t |y 1:t ) is the posterior.In the following we therefore use the terms prior and forecast, and the terms posterior and filtering, interchangeably.In this article, our main focus is on the update step.
When the state-space model is linear and Gaussian, the expressions can be solved analytically through the Kalman filter (Kalman, 1960).However, the prediction and update steps in (12) and (13) are generally speaking not feasible to solve analytically.Hence approximative methods are used.A common approach is to use simulation-based techniques, where a set of realisations, which is called an ensemble, are used to explore the state-space model by moving the realisations forward in time according to the state-space model.The filtering and forecast distributions are then represented by an ensemble at each time step, which is initially sampled from p(x 1 ).The following paragraph provides an overview of how this is done for one iteration.
Assume that a set of M independent realisations { x (1) t−1 } from p(x t−1 |y 1:t−1 ) is available at time t.If we are able to simulate from the forward model p(x t |x t−1 ), we can obtain M independent realisations from p(x t |y 1:t−1 ) by simulating from x t−1 ) independently for m = 1, . . ., M. This is the prediction step, and can usually be performed without any approximations.Next, we use the prediction, or prior, ensemble {x t−1 } to obtain samples from the filtering distribution p(x t |y 1:t ), which is often called a posterior ensemble.This can be done by conditioning the samples from the forecast distribution, {x (1) t , . . ., x (M) t }, on the new observation y t .This step is called the update step and is generally not analytically feasible.Hence approximative methods are necessary.In the following section, we present a procedure that enables us to carry out the update step.

Model-based EnKF
In this section we start by reviewing the model-based EnKF framework introduced in Loe and Tjelmeland (2021).The focus in our presentation is on the underlying model framework, the criterion used for selecting the particular chosen update, and on the resulting updating procedure.We do not include the mathematical derivations leading to the computational procedure.Moreover, we phrase the framework in terms of precision matrices, whereas Loe and Tjelmeland (2021) use covariance matrices.
The focus in the section is on how to use a prior ensemble {x (1) t , . . ., x (M) t } to update one of these prior ensemble elements, number m say, to a corresponding posterior ensemble element x (m) t . All the variables involved in this operation are associated with the same time t.To simplify the notation we therefore omit the time subscript t in the following discussion.So we write } for the available prior ensemble, we write x (m) instead of x (m) t for the generated posterior ensemble element number m, we write x instead of x t for the latent variable at time t, and we write y instead of y t for the new data that becomes available at time t.

Assumed Bayesian model
In model-based EnKF the updating of a prior ensemble element x (m) to the corresponding posterior ensemble element x (m) is based on an assumed model.The dependence structure of the assumed model is illustrated in the DAG in Figure 2. It should be noted that the assumed model is not supposed to be correct, it is just used as a mean to formulate a reasonable and consistent updating procedure.To stress this we follow the notation in Loe and Tjelmeland (2021) and use f as the generic symbol for all densities associated to the assumed model, whereas we continue to use p to denote the corresponding correct (and typically unknown) densities.We use subscripts on f to specify what stochastic variables the density relates to.For example f y|x (y|x) is the conditional density for the new observations y given the latent variable x.
In the assumed model we let the prior ensemble elements x (1) , . . ., x (M) and the unobserved latent variable x at time t be conditionally independent and identically Gaussian distributed with a mean vector µ and a precision matrix Q, i.e.
Fig. 2 DAG representation of the assumed Bayesian model for updating the mth realisation where θ = (µ, Q).For the parameter θ of this Gaussian density we assume some prior distribution f θ (θ).Loe and Tjelmeland (2021) assume a normal inverse Wishart prior for (µ, Q −1 ), which implies that Q is full, whereas we want to adopt a prior which ensures that Q is a band matrix.In Section 4.1 we detail the prior we are using.The last element of the assumed model is to let the new data y be conditionally independent of x (1) , . . ., x (M) and θ given x, and Based on the assumed model the goal is to construct a procedure for generating an updated version x (m) of x (m) .One should first generate a θ = (µ, Q) and thereafter x (m) .In addition to being a function of x (m) it is natural to allow x (m) to depend on the generated θ and the new data y, as also indicated in the DAG in Figure 2. The corresponding conditional distribution for x (m) we denote by q( x (m) |x (m) , θ, y).The updated x (m) is to be used as an (approximate) sample from p(x t |y 1:t ), which in the ensemble based setting we are considering here is represented by p(x t |x (1) t , . . ., x (M) t , y t ).However, as we want to use x (m) as a source for randomness when generating x (m) , the x (m) must be removed from the conditioning set so one should instead consider p(x t |x , y t ).Under the assumed model this density is equal to f x|z (m) ,y (x|z (m) , y) when using the shorthand notation z (m) = (x (1) , . . ., x (m−1) , x (m+1) , . . ., x (M) ).Thus, we should construct q( x (m) |x (m) , θ, y) so that holds for all values of x, z (m) and y.Loe and Tjelmeland (2021) show that ( 17) is fulfilled if x (m) is generated by first sampling θ = (µ, Q) from f θ,z (m) ,y (θ|z (m) , y) and thereafter x (m) is sampled according to a q( x (m) |x (m) , θ, y) defined by where (m) is independent of everything else and generated from a zero-mean Gaussian distribution with a covariance matrix S, B is a matrix connected to the positive semidefinite S by the relation and K is the Kalman gain matrix We note in passing that by inserting for K in ( 19) and thereafter applying the Woodbury identity (Woodbury, 1950), one gets

Optimality criterion
When applying the updating equation ( 18) we have a freedom in the choice of the matrices B and S. The restrictions are that S must be positive semidefinite and that B and S should be related as specified by ( 21).
Under the assumed model all choices of B and S fulfilling the given restrictions are equally good, as they are all generating an x (m) from the same distribution.When recognising that the assumed model is wrong, however, all solutions are no longer equally good.So we should choose a solution which is robust against the assumptions made in the assumed model.Loe and Tjelmeland (2021) formulate a robust solution as one where the x (m) is changed as little as possible when forming x (m) , under the condition that B and S satisfy ( 21).The intuition is that this should allow non-Gaussian properties in x (m) to be transferred to x (m) .Mathematically the criterion is formulated as minimising the expected squared Euclidean distance between x (m) and x (m) .Thus, we should minimise with respect to B and S under the restriction ( 21), where the expectation is with respect to the joint distribution of x (m) and x (m) under the assumed model.Note that Loe and Tjelmeland ( 2021) is considering a slightly more Algorithm 1 Summary of the resulting updating procedure for computing the posterior ensemble x (1) , . . ., x (M) from the prior ensemble x (1) , . . ., x (M) .
Evaluate B in ( 23): general solution by using a Mahalanobis distance.Loe and Tjelmeland (2021) show that ( 22) is minimised under the specified restrictions when S = 0 and where U and V are orthogonal matrices and D and Λ are diagonal matrices given by the singular value decompositions and P and F are orthogonal matrices given by the singular value decomposition

Resulting updating procedure
The resulting procedure for updating a prior ensemble x (1) , . . ., x (M) to the corresponding posterior ensemble x (1) , . . ., x (M) is summarised by the pseudocode in Algorithm 1.
In the following sections our focus is on how to make this updating procedure computationally efficient when the dimensions of the state vector and corresponding observation vector are large.First, in Section 4, we propose a new prior for θ which enables efficient sampling of θ = (µ, Q) from the corresponding posterior f θ|z (m) ,y (θ|z (m) , y).The generated Q is then a sparse matrix, which also limits the memory usage necessary to store it, since we of course only need to store the non-zero elements.However, sparsity of Q does not influence the computational efficiency of the singular value decompositions in Steps 2, 3 and 5 of Algorithm 1.One should note that we may rephrase Steps 2 and 3 to use Cholesky decompositions instead, since Q and Q + H T RH are symmetric and positive definite matrices.Under reasonable assumptions also Q + H T RH is a sparse matrix so thereby these two steps can be done computationally efficient.The Z in Step 5, however, is in general neither a symmetric positive definite matrix, nor sparse.To introduce Cholesky decompositions in Steps 2 and 3 will therefore not change the order of the computational complexity of the procedure.Instead of substituting the singular value decompositions in Steps 2 and 3 with Cholesky decompositions, we therefore in Section 5 propose an approximation of Steps 2 to 6 by splitting the state vector into a series of blocks and running Steps 2 to 6 for each of the blocks separately.
4 Prior model and sampling of θ = (µ, Q) In this section we first formulate the new prior for θ = (µ, Q), where Q is restricted to be a band matrix.Thereafter we formulate a computationally efficient procedure to generate a sample from the resulting f θ|z (m) ,y (θ|z (m) , y).We start by formulating the prior.

Prior for θ = (µ, Q)
To obtain a band structure for the precision matrix Q we restrict the assumed Gaussian distributions in ( 14) and ( 15) to be a Gaussian partially ordered Markov model, a Gaussian POMM (Cressie and Davidson, 1998).To be able to give a mathematically precise definition of the Gaussian POMM we first introduce some notation.
We let x k denote the k'th element of x, so x = (x 1 , . . ., x nx ).To each element x k of x we associate a sequential neighbourhood Λ k ⊆ {1, . . ., k − 1}, and use the notation introduced in Section 2.2 to denote the elements in x associated to Λ k by x Λ k .The number of elements in Λ k we denote by |Λ k |.Moreover, we let Λ k (1) denote the smallest element in the set Λ k , we let Λ k (2) be the second smallest element in Λ k , and so on until Λ k (|Λ k |), which is the largest element in Λ k .We let the distribution of x be specified by the two parameter vectors η = (η 1 , . . ., η nx ) and φ = (φ 1 , . . ., φ nx ), where for each k = 1, . . ., n x we have that η k ∈ R |Λ k |+1 and φ k > 0 is a scalar.With this notation the Gaussian POMM is specified as where It should be noted that θ = (µ, Q) in N (x; µ, Q) is uniquely specified by η and φ.If the number of sequential neighbours is small, the resulting precision matrix Q becomes sparse, see Appendices A and B for a detailed discussion.We can therefore specify a prior for θ by specifying a prior for η and φ, which we choose as conjugate to the Gaussian POMM just defined.More specifically, we first assume the elements in φ to be independent, and each element φ k to be inverse Gamma distributed with parameters α k and β k .Next, we assume the elements of η to be conditionally independent and Gaussian distributed given φ, where ) are hyperparameters that have to be set.
4.2 Sampling from f θ|z (m) ,y (θ|z (m) , y) To sample from f θ|z (m) ,y (θ|z (m) , y) we adopt the same general strategy as proposed in Loe and Tjelmeland (2021).We include the underlying state vector at time t, x, as an auxiliary variable and simulate (θ, x) from By thereafter ignoring the simulated x we have a sample of θ from the desired distribution.To simulate from the joint distribution f θ,x|z (m) ,y (θ, x|z (m) , y) we adopt a two block Gibbs sampler, alternating between drawing x and θ from the full conditionals f x|θ,z (m) ,y (x|θ, z (m) , y) and f θ|x,z (m) ,y (θ|x, z (m) , y), respectively.We initialise the Gibbs sampler by setting x = 1 i) .This initial value should be centrally located in f x|z (m) ,y (x|z (m) , y), and since the Gibbs sampler we are using only consists of two blocks we should expect it to converge very fast.So just a few iterations should suffice.
From (30) we get the full conditional for x It is straightforward to show that this is a Gaussian distribution with mean µ and covariance matrix Q given by As discussed in Appendices A and B, the Q becomes sparse whenever the number of sequential neighbours is small.If x represents values in a two dimensional lattice and the sequential neighbourhoods Λ k are chosen as translations of each other, as we use in the simulation example in Section 6, the Q is a band matrix, again see the discussion in Appendix B. Assuming also R and H to be band matrices, the product H T RH can be efficiently computed and is also a band matrix.The Q is thereby also a band matrix, so the Cholesky decomposition of it can be computed efficiently as discussed in Section 2.1.The band structures of H and R can be used to evaluate the right hand side of (32) efficiently, and in addition computational efficiency can be gained by computing the product H T R(y − Hµ) in the right order.In general, multiplying two matrices C ∈ R u×v and D ∈ R v×w has computational complexity O(uvw).
Hence, we should first compute R(y − Hµ) before calculating H T R(y − Hµ).
Having Q and the Cholesky decomposition of Q we can both get µ and generate a sample from the Gaussian distribution efficiently as discussed in Section 2.1.From (30) we also get the full conditional for θ, To simulate from this full conditional we exploit that θ = (µ, Q) is uniquely given by the parameters φ and η of the Gaussian POMM prior, which means that we can first simulate values for φ and η from f φ,η|x,z (m) (φ, η|x, z (m) ) and thereafter use the generated φ and η to compute the corresponding µ and Q.
In Appendix C we study in detail the resulting f φ,η|x,z (m) (φ, η|x, z (m) ) and show that it has the same form as the corresponding prior f φ,η (φ, η), but with updated parameters.More specifically, the elements of φ are conditionally independent given x and z (m) , with where and χ (m),k = (x (1),k , . . ., x (m−1),k , x (m+1),k , . . ., x (M),k , x k ), ( 41) The distribution for η given φ, x and z (m) becomes where In particular we see that it is easy to sample from f φ,η|x,z (m) (φ, η|x, z (m) ) by first sampling the elements of φ independently according to (35) and thereafter generate the elements of η according to (44).Having samples of φ and η we can thereafter compute the corresponding µ and Q as detailed in Appendix A.

Block update
Section 3 presents a set of updating procedures that allows us to update a prior realisation into a posterior realisation, where the posterior realisation takes the observation in the current time step into account.In Section 3.2, we found the optimal filter according to our chosen criterion.However, as also discussed in Section 3.3, parts of this procedure is computationally demanding.In the following we introduce the approximate, but computationally more efficient procedure for Steps 2 to 7 in Algorithm 1 that we mentioned in Section 3.3.So the situation considered is that we want to update a prior realisation x (m) and that we already have generated a mean vector µ and a sparse precision matrix Q.The goal is now to define an approximation to the vector x (m) given in Step 7 in Algorithm 1.The strategy is then to use the approximation to x (m) , which we denote by x (m),Block , instead of x (m) .In the following we assume the matrices H and R to be sparse.The complete block updating procedure we propose is summarised in Algorithm 2. In the following we motivate and explain the various steps, and define the notation used.
Our construction of x (m),Block is general, but to motivate our construction we consider a situation where the elements of the state vector are related to nodes in a two-dimensional (2D) lattice of nodes, with r rows and s columns say, and where the correlation between two elements of the state vector decay with distance between the corresponding nodes.We assume the nodes in the lattice to be numbered in the lexicographical order, so node (k, ) in the lattice is related to the value of element number (k − 1) • s + of the state vector.We use S = {(k, ) : k = 1, . . ., r, l = 1, . . ., s} to denote the set of all indices of the elements in the state vector, where we for the 2D lattice example have n x = rs.
The first step in the construction of x (m),Block is to define a partition C 1 , . . ., C B of S. The sets C b , b = 1, . . ., B should be chosen so that elements of the state vector corresponding to elements in the same set C b are highly Algorithm 2 Summary of the block updating procedure for computing the posterior ensemble x (1),Block , . . ., x (M),Block from the prior ensemble x (1) , . . ., x (M) .Given x (1) , . . ., x (M) , y, H, R and C b , D b , E b for b = 1, . . ., B. For m = 1, . . ., M: Compute x (m),D b by Steps 2 to 7 in Algorithm 1 when x (m) , x (m) , y, µ, Q, H and R are replaced with correlated.In the 2D lattice example the most natural choice would be to let each C b be a block of consecutive nodes in the lattice, for example a block of r C × s C nodes.Such a choice of the partition is visualised in Figure 3(a).One should note that a similar partition of S is also used in the domain localisation version of EnKF (Janjic et al, 2011).However, the motivation for the partition in domain localisation is mainly to eliminate or at least reduce the effect of spurious correlations (Haugen and Evensen, 2002), whereas in our setup the motivation is merely to reduce the computational complexity of the updating procedure.In particular, if the dimension of the state vector, n x , is sufficiently small we recommend not to use the block updating procedure at all, one should then rather use the procedure summarised in Section 4. When forming x (m) from x (m) in Step 7 of Algorithm 1 one would intuitively expect that element number k of x (m) has a negligible effect on element of x (m) whenever the correlation between elements k and of the state vector is sufficiently weak.In numerical experiments where the dimension of the state vector is sufficiently small so that we can use the procedure in Section 3.3, this intuition has already been confirmed.This motivates our first approximation, which is to not allow an element of x (m) to influence an element of x (m),Block when the two elements of the state vector have a very low correlation.More formally, for each b = 1, . . ., B we define a set of nodes D b so that C b ⊆ D b , where D b should be chosen so that in the state vector, elements j ∈ C b are only weakly correlated with elements i ∈ S \ D b .In the approximation we only allow an element ∈ C b of x (m),Block to be a function of element k of x (m) if k ∈ D b .In the 2D lattice example it is natural to define D b by expanding the C b block of nodes with u, say, nodes in each direction, see the illustration in Figure 3(b) where u = 2 is used.
To decide how x k , k ∈ D b , the most natural procedure would be the following.Start with the assumed joint Gaussian distribution for the state vector x and the observation vector y, f x,y|θ (x, y|θ) = N (x; µ, Q) • N (y; Hx, R). (45) Then we could have marginalised out all elements of the state vector that are not in D b , where we use the notation introduced in Section 2.2 and let x D b and x −D b denote vectors of the elements of x related to the sets D b and S \ D b , respectively.From this, one could find the marginal distribution for the elements of the state vector x related to the block D b , i.e. f x D b |θ (x D b |θ), and the conditional distribution for the observation vector y given the elements of the state vector related to D b , f y|x D b ,θ (y|x x b , θ).As f x D b |θ (x D b |θ) becomes a Gaussian distribution, and f y|x D b ,θ (y|x x b , θ) becomes a Gaussian distribution where the mean vector that is a linear function of x D b and the precision matrix is constant, one could use the procedure specified in Section 3.3 to find the optimal update of the sub-vector x (m),D b , x (m),D b say.Finally, we could for each ∈ C b have set x (m),Block equal to the the value of the corresponding element in x (m),D b .However, such a procedure is not computationally feasible when the dimension of the state vector is large, and for two reasons.First, the marginalisation in ( 45) is computationally expensive when dependence is represented by precision matrices.Second, when the dimension of the state vector is large the dimension of the observation vector y is typically also large, which makes such a marginalisation process even more computationally expensive.We therefore do one more approximation before following the procedure described above.Instead of starting out with (45), we start with a corresponding conditional distribution, where we condition on the value of elements of the state vector x and observation vector y that are only weakly correlated with the elements in x D b .The motivation for such a conditioning is twofold.First, to condition on elements that are only weakly correlated with the elements in x D b should not change the distribution of x D b much.Second, when representing dependence by precision matrices the computation of conditional distributions is computationally very efficient.In the following, we define this last approximation more formally and discuss related computational aspects.
For each b = 1, . . .B, we define a set of nodes The set E b should be chosen so that the nodes in D b are weakly correlated with the nodes that are not in E b .For the 2D lattice example it is reasonable to define E b by expanding the D b block with, say, v nodes in each direction.Figure 3(b) displays an example of E b where v = 2.Moreover, we let a set J b contain the indices of the elements in y that by the likelihood are linked to one or more elements in x E b , i.e.
Starting from (45) we now find the corresponding conditional distribution when conditioning on x k being equal to its mean value µ k for all k ∈ E b , and conditioning on y k being equal to its mean value (Hx) k for all k ∈ J b .The resulting conditional distribution we denote by This (conditional) distribution is also multivariate Gaussian, and the mean vector and the precision matrix can be found by using the expressions in Section 2.2.One should note that the conditional precision matrix is immediately given, simply by clipping out the parts of the unconditional precision matrix that belong to x E b and y J b .The formula for the conditional mean includes a matrix inversion, where the size of the matrix that needs to be inverted equals the sum of the dimensions of the vectors x E b and y J b .The dimension of the matrix is thereby not problematically large.Moreover, one should note that in practice the conditional mean is computed using the techniques discussed in Section 2.1, thereby avoiding the matrix inversion.From g(x E b , y J b |θ) we marginalise over x E b \D b to get g(x D b , y J b |θ), which is also Gaussian and where the mean vector and the precision matrix can be found as discussed in Section 2. , where b is the index of the element in x (m),D b that corresponds to element in x (m),Block .

Simulation example
In the following we present a numerical experiment with the model-based EnKF using the proposed block update approximations.The purpose of the experiment is both to study the computational speedup we obtain by using the block updating, and to evaluate empirically the quality of approximation introduced by adopting the block updating procedure.To do this we do a simulation study where we run the model-based EnKF with block updates and the exact model-based EnKF on the same observed values, and compare the run time and filter outputs.As our focus is on the effect of the approximation introduced by using the block update, we use the Gaussian POMM prior distribution introduced in Section 4.1 for both filter variants and we also use the exact same observed data in both cases.We first define and study the results for a linear example in Section 6.1 and thereafter discuss a non-linear examples in Section 6.2.

Linear example
In the simulation experiment we first generate a series of reference states {x t } T t=1 and corresponding observed values {y t } T t=1 .The series of reference states we consider as the unknown true values of the latent x t process and we use the corresponding generated observed values in the two versions of EnKF.We then compare the computational demands and the output of the two filtering procedures.Lastly, we compare the output of the block update procedure to the Kalman filter solution.

Reference time series and observations
To generate the series of reference states we adopt a two dimensional variant of the one dimensional setup used in Omre and Myrseth (2010).So for each time step t = 1, . . ., T the reference state x t represents the values in a two dimensional s × s lattice.As described in Section 5 we number the nodes in the lexicographical order.The reference state at time t = 1 we generate as a moving average of a white noise field.More precisely, we first generate independent and standard normal variates z (k, ) for each node (k, ), and to avoid boundary effects we do this for an extended lattice.The reference state at node (k, ) and time t = 1 is then defined by where Γ r,(k, ) is the set of nodes in the extended lattice, that are less than or equal to a distance r from node (k, ), and |Γ r,(k, ) | is the number of elements in that set.One should note that the factor 20/|Γ 3,(k, ) | gives that the variance of each generated x is 20. Figure 4(a) shows a generated x 1 with s = 100, which is used in the numerical experiments discussed in Sections 6.1.4and 6.1.5.Given the reference state at time t = 1, corresponding reference states at later time steps are generated sequentially.For t = 2, . . ., T the reference state at time t − 1 is used to generate the reference state at time t by performing a moving average operation on nodes that are inside an annulus defined by circles centred at the middle of the lattice.For t = 2 the radius of the inner circle defining the annulus is zero, and as t increases the annulus is gradually moved further away from the centre.More precisely, to generate x t from x t−1 we define the annulus by the two radii T −1 , where v denotes the largest integer less than or equal to v.For all nodes (k, ) inside the annulus we define For nodes (k, ) that are not inside the specified annulus the values are unchanged, i.e. for such nodes we have x . The reference solution at time T = 5 corresponding to the x 1 given in Figure 4(a) is shown in Figure 4(b).By comparing the two we can see the effect of the smoothing operations.
For each time step t = 1, . . ., T we generate an observed value associated to each node in the lattice, so n y = n x .We generate the observed values at time t, y t , by blurring x t and adding independent Gaussian noise in each node.More precisely, the observed value in node (k, ) at time t is generated as where w (k−1)•s+ is a zero-mean normal variate with variance 20.Figures 4(c) and 4(d) shows the generated observations corresponding to the x 1 and x 5 in Figures 4(a) and (b), respectively.These observations are used in the numerical experiments discussed in Sections 6.1.4and 6.1.5.

Assumed model and algorithmic parameters
In the numerical experiments with EnKFs we use the assumed model defined in Section 3.1.For nodes sufficiently far away from the lattice borders we let the sequential neighbourhood consist of ten nodes as shown in Figure 5.This set of sequential neighbours should be sufficient to be able to represent a large variety of spatial correlation structures at the same time as it induces sparsity in the resulting precision matrix.For nodes close to the lattice borders we reduce the number of sequential neighbours to consist of only the subset of the ten nodes shown in Figure 5 that are within the lattice.
As priors for η and φ we use the parametric forms specified in Section 4.1.We want these priors to be vague so that the resulting posterior distributions are mainly influenced by the prior ensemble.We let the prior for φ be improper In preliminary simulation experiments we found that the Gibbs sampler for the sampling of θ converged very rapidly, consistent with our discussion in Section 4.2.In the numerical experiments we therefore only used five iterations of the Gibbs sampler.When using the block update procedure we let each C b consist of a block of 20 × 20 nodes.To define the corresponding sets D b and E b we follow the procedure outlined in Section 5 with u = v = 5.

Comparison of computational demands
The main objective of the block update procedure is to provide a computationally efficient approximation to the optimal model-based EnKF procedure defined in Section 3. In this section we compare the computational demands of the two updating procedures as a function of the number of nodes in the lattice.
For an implementation in Python, we run the two EnKF updating procedures discussed above with M = 25 ensemble elements for T = 5 time steps for different lattice sizes and monitor the computing time used in each case.The results are shown in Figure 6.The observed computing times are shown as dots in the plot, in blue for the optimal model-based EnKF procedure and in orange for the block updating procedure.The lines between the dots are just included to make it easier to read the plot.As we should expect from the discussion in Section 5 we observe that the speedup resulting from adopting the block update increases quickly with the lattice size.When adopting the block update in an EnKF procedure we will clearly get an approximation error in each update.When assessing this approximation error it is important to realise that the approximation error in one update may influence the behaviour of the filter also in later iterations.First, the effect of the approximation error emerged in one update may change when later iterations of the filter is run, the error may increase or decrease when run through later updates.Second, since the EnKF we are using is a stochastic filter, with the stochasticity lying in the sampling of θ, even a small change after one update may have a large impact on the resulting values in later iterations.In particular it is important to note that even if the approximation error in one update only marginally change the distribution of the values generated in later iterations, it may have a large impact on the actually sampled values.In this section our focus is first to assess the approximation error in one update when using the block update, whereas in the second part of this section we concentrate on the accumulated effect of the approximations after several iterations.
Adopting the reference time series and the data described in Section 6.1.1 and the assumed model and algorithmic parameters defined in Section 6.1.2we isolate the approximation error in one update by the following procedure.We first run the optimal model-based EnKF procedure for all T = 5 steps.Thereafter, for each t = 1, . . ., T , we start with the prior ensemble generated for that time step in the optimal model-based EnKF and perform one step with the block update.One should note that since we are using the same prior ensemble for the block update as for the optimal model-based EnKF the generated parameters θ are identical for both updates, so the resulting difference in the updated values are only due to the difference in the B matrix used in the two procedures.
Figure 7 shows results for the nodes (50, ), = 1, . . ., 100 at time step t = 4.The ensemble averages and bounds of the empirical 90% credibility intervals for the optimal model-based EnKF procedure are shown in blue, and the corresponding values when using the block update procedure for time t = 4 are shown in dashed orange.For comparison the reference is shown in black.One can observe that for most nodes the mean value and interval bounds when using the block update are visually indistinguishable from the corresponding values when using the block update.The largest difference can be observed for the lower bound of the credibility interval for = 80.Remembering that we are using blocks of 20 × 20 nodes in the block update it should not come as a surprise that the largest errors are for values of that are multiples of 20.
To study the approximation errors in more detail and in particular how the block structure used in the block update influence the approximation, we in Figure 8 show, for each t = 1, . . ., 5, the difference between the averages of the ensembles when using the optimal model-based EnKF and when using the block update procedure.As one should expect we see that the largest differences occur along the borders of the 20 × 20 blocks used in the block update.We can, however, also observe that the magnitude of the errors are all small compared with the spatial variability of the underlying x t process, which explains why the approximation errors are almost invisible in Figure 7.
We then turn our focus to the accumulated effect of the approximation error over several iterations.To do this we again adopt the reference time series and the data described in Section 6.1.1,and the assumed model and algorithmic parameters defined in Section 6.1.2.Based on this we run each of the optimal model-based EnKF and the block update procedures several times, each time with a different initial ensemble and using different random numbers in the updates.For two runs with each of the two filters, Figure 9 shows in green the values of all the M = 25 ensemble elements in nodes (50, ), = 1, . . ., 100 at time T = 5.The reference state at T = 5 is shown in black and the observations at time T = 5 are again shown as red dots.Figures 9(a slightly moved to the right relative to the corresponding blue histograms.This is as one would expect, but we also observe that the effect of the approximation error is negligible compared to the Monte Carlo variability.

Comparison of the results with the Kalman filter output
Since the forward model in this numerical example is linear and deterministic and the observation model is Gauss-linear, we are able to compare the block update to the Kalman filter solution.Figure 11 displays the mean of the Kalman filter solution along with the 90% credibility interval for the nodes (50, ), l = 1, . . ., 100 in blue at t = 4.The red lines visualise the empirical mean and empirical 90% credibility interval as given by one simulation of the block update.When comparing the two set of curves one should remember that the EnKF output is stochastic, so for another EnKF run a different set of curves would be produced.In general, however, we notice that the mean given by the block update follows the mean of the Kalman filter quite well.
When comparing the credibility intervals we can observe that the intervals provided by the block update appears to be somewhat longer than the credibility intervals provided by the Kalman filter.This is as one should expect.At each step of the block update filter, as for any other ensemble based filter, we represent the credibility and filtering distribution by a rather small set of realisations.So compared to the Kalman filter, where the exact credibility and filtering distributions are represented by mean vectors and covariance matrices, we necessarily loose some information about the the underlying true state vector at each iteration of the block update filter, when representing the distributions by a set of realisations only.
The results for times t = 1, 2, 3 and 5 are similar to the results for time t = 4, except that the widths of the credibility intervals decrease with time.The shrinkage of the widths of the credibility intervals is clearly visible for both the Kalman filter and the block update filter, but is stronger for the Kalman filter than for the block update filter.Such a shrinkage is as one should expect when the forward function is deterministic.That the shrinkage is strongest for the Kalman filter is reasonable since in each iteration of block update filter we loose some information about the underlying true state when just representing the distributions by ensembles.

Non-linear example
In the following we consider a non-linear example.We first describe how the reference solution and observations are generated, before we proceed to compare the results provided by the block update filter with the results of the optimal model-based EnKF.

Reference time series and observations
The reference state for the initial time step, x 1 , is generated in the exact same manner as for the linear case.The forward model used is inspired by the nonlinear forward function in Omre and Myrseth (2010).The forward function is acting on each element of the state vector separately.The reference state at time t > 1 for node (k, ) is defined by The term 0.5 • arctan(x/2) is chosen to make the forward function clearly nonlinear over the interval in which the elements of the state vector vary.The term is displayed in Figure 12.The observations are generated in the exact same manner as in the linear example.The assumed model and all the algorithmic parameters are also chosen identical to what we used in the linear example.

Approximation error in block update
As for the linear example, we first consider the approximation error obtained in one update, and thereafter study how the approximation error accumulates over multiple iterations.
To study the effect of the approximation error in one iteration we follow the same procedure as used in Section 6.1.4.In Figure 13 we compare the credibility intervals provided by the optimal update and block update at time t = 4, similar to Figure 7 for the linear case.The credibility interval for the nodes (50, ), = 1, . . ., 100 provided by the optimal update is displayed in To study how the approximation errors accumulated over several iterations, we followed the same procedure as in the linear example.When comparing ensembles resulting from the two filters, corresponding to what is done in Figure 9, the accumulated effect of the approximation error again seem to be small compared to the Monte Carlo variability.Figure 14 displays the histograms of the Kolmogorov-Smirnov statistics for t = 1, . . ., 5, analogous to Figure 10 for the linear case.We arguably can observe that the tendency of the mass in the red histograms to be moved to the right compared to the corresponding blue histograms, is slightly stronger in this non-linear example than in the linear case.

Closing remarks
In this paper we propose two changes in the model-based EnKF procedure introduced in Loe and Tjelmeland (2021).Our motivation is to get a procedure that is computationally faster than the original one, so that it is feasible to use it also in situations with high dimensional state vectors.The first change we introduce is to formulate the assumed model in terms of precision matrices instead of covariance matrices, and to adopt a prior for the precision matrix 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.5 that ensures the sampled precision matrices to be sparse.The second change we propose is to adopt the block update, which allows us to do singular value decompositions of many smaller matrices instead of for one large one.
In a simulation example we have studied both the computational speedup and the associated approximation error resulting when adopting the proposed procedure.The computational speedup is substantial for high dimensional state vectors and this allows the proposed filter to be run on much larger problems than can be done with the original formulation.At the same time the approximation error resulting from using the introduced block updating is negligible compared to the Monte Carlo variability inherent in both the original and the proposed procedures.
In order to further investigate the strengths and weaknesses of the proposed approach, it should be applied on more examples.It is of interest to gain more experience with the proposed procedure both in other simulation examples and in real data situations.In particular it is of interest to experiment more with different sizes for the C b , D b and E b blocks to try to find the best sizes for these when taking both the computational time and the approximation quality into account.As for all EnKFs the proposed procedure is ideal for parallel computation and with an implementation more tailored for parallellisation it should be possible get a code that is running much faster than our more basic implementation.To find how θ = (µ, Q) is given from φ and η we restrict the Gaussian densities f (x|θ) and f (x|φ, η) to be identical.The density f (x|θ) is given by Using the definition of f (x|φ, η) from Section 4.1 we have where Λ −1 (k) is the inverse function of Λ (k), i.e.Λ −1 (k) = t ⇔ Λ (t) = k, and I( ∈ Λ k ) is the indicator function returning one if ∈ Λ k and zero otherwise.By setting equal corresponding coefficients in front of (x k ) 2 , x k x and x k in (54) and (55) we can identify the elements of µ and Q. Setting equal the coefficients in front of (x k ) 2 we get the diagonal elements of Q, To get the non-diagonal elements we set equal the coefficients in front of x k x .
For < k we have (57) In particular one should note that for < k we get Q k, = 0 if ∈ Λ k and there is no node t so that k, ∈ Λ t .Thus, if the sequential neighbourhoods are chosen to be sufficiently small, Q will be sparse.For a more detailed discussion see Appendix B. Finally, by equating corresponding coefficients in front of x k we get which can be used to find the elements of the mean vector µ.Since Q is sparse the strategy discussed in Section 2.1 can be used to compute µ efficiently.In the following we discuss the relationship between the sequential neighbourhood and the structure of the precision matrix Q.We first identify a general expression for the set of non-zero elements in Q, and thereafter discuss the situation in more detail when the state vector is associated with the nodes in a two-dimensional lattice.From ( 56) and (57) we see that element (k, ) of the precision matrix, Q k, , may be non-zero only when where S is the set of all nodes in the lattice.A perhaps more instructive formulation of the set ∂k is as a union of sequential neighbourhoods, ∂k = t∈S:k∈Λt∪{t} Now consider a situation where the state vector is associated with the nodes in a two-dimensional lattice and assume all the sequential neighbourhoods to be of the same form, except for nodes close to the lattice boundaries where the sequential neighbourhood is necessarily smaller.In this situation (60) can be used to identify an upper bound on the number of non-zero elements in Q.The situation for this two-dimensional lattice case is illustrated in Figure 15.In the figure the red nodes are sequential neighbours of the green node.Assuming the green node to represent element k in the state vector, we let Ψ k denote the set of nodes inside the smallest rectangle that contains node k and all the sequential neighbours of k.By construction we then have Λ k ∪ {k} ⊆ Ψ k .In Figure 15 the set Ψ k consists of all nodes inside the red rectangle.Letting u and v denote the vertical and horizontal dimensions, respectively, of the rectangle defining Ψ k , again as illustrated in Figure 15, we get that the number of nodes in Ψ k is |Ψ k | = (u + 1)(v + 1).In Figure 15  (61) Thereby an upper bound on the number of non-zero elements in row k of Q is the number of nodes in the set Ω k .Since Ψ k is defined by a rectangle, the set Ω k can also be defined from a rectangle.In Figure 15, the set Ω k consists of all nodes inside the green rectangle, whereas the set ∂k consists of the green, red and blue nodes in the figure.The vertical and horizontal dimensions of the rectangle defining Ω k become 2u and 2v, respectively, and the number of elements in Ω k becomes |Ω k | = (2u + 1)(2v + 1).So the the number of nonzero elements in row k of Q is bounded by (2u + 1)(2v + 1).Moreover, since we have n x elements in the state vector, we will have at most (2u + 1)(2v + 1)n x non-zero elements in the precision matrix Q.
In the two-dimensional lattice case we can also provide an upper bound for the bandwidth of Q.Still assuming that each node in the lattice is assigned similar sequential neighbourhoods, an upper bound for the bandwidth is the maximum distance between node k and any node ∈ Ω k , i.e. su + v, where s is the number of columns in the grid.

C Derivation of posterior for hyperpriors
In the following we derive the posterior distribution for (η, φ).In the derivation we simplify the notation by omitting the subscripts on the f 's as it should be clear from the context what densities we are dealing with.We start by identifying f (φ|x, z (m) ) and thereafter find f (η|φ, x, z (m) ).
Using the definitions of f (x|η, φ), f (φ) and f (η|φ) given in Section 4.1, and that z (m) and x are conditionally independent given φ and η, we get (62) We note that the exponent in the integrand is a second order function in η k , so by completing the square it is straightforward to evaluate the integral analytically.When having evaluated the integral and inserting for f (φ k ) this gives, using the notation defined in ( 38 where αk and βm,k are as given in ( 36) and (37), respectively.We recognise this as the density of a inverse gamma distribution, so φ k |z (m) , x ∼ InvGam( αk , βm,k ). (63) This concludes the derivation of the posterior distribution for φ, and we proceed to the derivation of the posterior for η|φ.Using the definitions of f (x|η, φ) and f (η|φ) we get f (η|x, z (m) , φ) ∝ f (η|φ)f (x, z (m) |η, φ) = f (η|φ)f (x|η, φ)

Fig. 3
Fig. 3 Example of how the sets C b , D b and E b can be chosen when the elements of x are related to the nodes of a 2D lattice.The dots denote nodes.In (a) the nodes are partitioned in B different blocks, while (b) shows an example of how D b and E b might look like 2. The g(x D b , y J b |θ) should be thought of as a substitute for the f (x D b , y J b |θ) defined in(46).So following what we ideally would have liked to do with f (x D b , y J b |θ) we use g(x D b , y J b |θ) to form the marginal distribution for x D b , g(x D b |θ), and the conditional distribution for y J b given x D b , g(y J b |θ, x D b ).These two distributions are both Gaussian and can be found using the expressions in Section 2.2.The g(x D b |θ) should be thought of as a prior for x D b , and the g(y J b |θ, x D b ) represents the likelihood for y J b given x D b .In the following we let µ b and Q b denote the resulting mean vector and precision matrix in the prior g(x D b |θ), respectively.From (8) we get that the Gaussian likelihood g(yJ b |θ, x D b ) has a conditional mean in the form E[y J b |x D b , θ] = a b + H b x D b ,where a b is a column vector of size |J b | and H b is a |J b | × |D b | matrix.The precision matrix of the likelihood g(y J b |θ, x D b ) we in the following denote by R b .Noting that observing y J b is equivalent to observing y J b − a b , we thereby have a Bayesian model corresponding to what we discussed in Section 3.1, except that now x is replaced with x D b , the ensemble element x (m) is similarly replaced with the sub-vector x (m),D b , and y, µ, Q, H and R are replaced with y J b − a b , µ b , Q b , H b and R b , respectively.To update x (m),D b we can thereby use Steps 2 to 7 in Algorithm 1 to find the optimal update of x (m),D b , x (m),D b say.Finally, for each ∈ C b , we set x (m),Block = x (m),D b b

Fig. 4
Fig. 4 Linear example: Reference states at times (a) t = 1 and (b) t = 5, and corresponding generated observed values at times (c) t = 1 and (d) t = 5

Fig. 5
Fig.5The sequential neighbourhood used in the numerical examples.The red dots denote the sequential neighbours of the green node

Fig. 6
Fig. 6 Linear example: Computing time used for running each of the two model-based EnKF procedures with M = 25 ensemble elements for T = 5 time steps as a function of the number of nodes nx = s 2 in the lattice.Computing time for the optimal model-based EnKF procedure is shown in blue, and corresponding computing time for the block updating procedure is shown in orange

Fig. 7
Fig. 7 Linear example: Assessment of the approximation error by using the block update at step t = 4.The figure shows values for nodes (50, ), = 1, . . ., 100.The values of the reference are shown in black and the observations at t = 4 are shown in red.The ensemble averages and bounds for the empirical 90% credibility intervals when using the optimal model-based EnKF are shown in blue.The corresponding ensemble averages and bounds for the empirical 90% credibility intervals when using the block update are shown in dashed orange

Fig. 8 Fig. 9 Fig. 10
Fig. 8 Linear example: Difference between the averages of the ensembles when using the optimal model-based EnKF and when using the block update.Results are shown for each t = 1, . . ., 5

Fig. 11
Fig. 11 Linear example: The blue lines display the mean of the Kalman filter solution along with the 90% credibility intervals for one cross section at time t = 4.The red lines visualise the empirical mean and empirical 90% credibility interval for the block update for the same cross section at the same time step.

Fig. 12
Fig. 12 Non-linear example: The term 0.5 • arctan(x/2) in the forward function of the non-linear example

Fig. 13
Fig. 13 Non-linear example: Assessment of the approximation error by using the block update at step t = 4.The figure shows values for nodes (50, ), = 1, . . ., 100.The values of the reference are shown in black and the observations at t = 4 are shown in red.The ensemble averages and bounds for the empirical 90% credibility intervals when using the optimal model-based EnKF are shown in blue.The corresponding ensemble averages and bounds for the empirical 90% credibility intervals when using the block update are shown in dashed orange

Fig. 14
Fig.14Non-linear example: Histograms of the two-sample Kolmogorov-Smirnov statistics.The statistics computed from two ensembles updated with the optimal ensemble are displayed in blue, while the statistics computed by two ensembles from different updating procedures are visualised in red

Funding.
Geophysics and Applied Mathematics in Exploration and Safe production (GAMES) at NTNU (Research Council of Norway; Grant No. 294404).Declarations Conflict of interest.No conflicts of interest.Appendices A Derivation of θ = (µ, Q) from φ and η

Fig. 15
Fig.15The green node represents node k, while the red nodes represent Λ k .The nodes inside the red rectangle visualises Ψ k .The blue nodes represent the remaining nodes in ∂k, while the nodes inside the green rectangle represent Ω k B Sparseness of Q we have u = 2 and v = 5 and|Ψ k | = 18.As Λ k ∪ {k} ⊆ Ψ k it follows from (60) that ∂k ⊆ Ω k = t:k∈Ψt Ψ t .
) to (40),f (φ k |z (m) , x) ∝ e k (γ m,k −(ρ m,k ) T (Θ m,k ) −1 ρ m,k ) ) αk +1 , The main interest of this article, and the reason we introduce the statespace model, is to assess the filtering problem.The objective of the filtering problem is to compute the filtering distribution, p(x t |y 1:t ), for t = 1, . . ., T .That is, we want to find the distribution for the latent variable at time t, i.e. x t , given all of the observations up to the same time step, y 1:t .Due to the Markov assumptions made in the state-space model, we are in principle able to assess this quantity sequentially.Each iteration is performed in two steps, namely