Ensemble updating of categorical state vectors

An ensemble updating method for categorical state vectors is proposed. The method is based on a Bayesian view of the ensemble Kalman filter (EnKF). In the EnKF, Gaussian approximations to the forecast and filtering distributions are introduced, and the forecast ensemble is updated with a linear shift. Given that the Gaussian approximation to the forecast distribution is correct, the EnKF linear update corresponds to conditional simulation from a Gaussian distribution with mean and covariance such that the posterior samples marginally are distributed according to the Gaussian approximation to the filtering distribution. In the proposed approach for categorical vectors, the Gaussian approximations are replaced with a (possibly higher order) Markov chain model, and the linear update is replaced with simulation based on a class of decomposable graphical models. To make the update robust against errors in the assumed forecast and filtering distributions, an optimality criterion is formulated, for which the resulting optimal updating procedure can be found by solving a linear program. We explore the properties of the proposed updating procedure in a simulation example where each state variable can take three values.


Introduction
State space models are widely used to analyse dynamic data in a wide range of scientific disciplines, e.g. in finance, reservoir modelling, weather forecasting, and signal processing. A general state space model consists of an unobserved process {x t } T t=1 and a corresponding observed process {y t } T t=1 where y t is a partial observation of x t . The unobserved x t -process, usually called the state process, is assumed to be a first-order Markov process, and the observations y 1 , . . . , y T of the observed process B Margrethe Kvale Loe loe.margrethe@gmail.com 1 Department of Mathematical Sciences, Norwegian University of Science and Technology, Trondheim, Norway are assumed to be conditionally independent given {x t } T t=1 with y t only depending on x t . The main objective of state space modelling is some type of inference about the state process given the observations. There are many inference procedures associated with state space models, among which one of the most common is filtering. Filtering, which is the problem addressed in the present article, refers to the task of computing, for each time step t = 1, . . . , T , the distribution of the state x t given all observations y 1:t = (y 1 , . . . , y t ) available at time t. In some fields, filtering is known as sequential data assimilation. Other common terms are history matching and online inference. However, in the present article, we use the term filtering throughout.
Because of the particular dependency structure of the general state space model, the series of filtering distributions can be computed recursively according to a recursion which alternates between a forecast step and an update step. Generally, however, apart from a few simple special cases, the exact solution to the filtering recursions is intractable due to complex and/or high-dimensional integrals. Approximate strategies are therefore required, and simulation-based methods, or ensemble methods, represent the most popular approach. An ensemble-based solution may, similarly to the original filtering recursions, alternate between a forecast step and an update step. Instead, however, of computing the forecast and filtering distributions explicitly, the distributions are represented empirically with an ensemble of realisations. The main challenge in this context is the update step where, at time step t, an ensemble of (approximate) realisations from the so-called forecast distribution p x t |y 1:t−1 (x t |y 1:t−1 ) needs to be conditioned on the new observation y t so that an updated ensemble of (approximate) realisations from the filtering distribution p x t |y 1:t (x t |y 1:t ) is obtained. Since there is no straightforward way to approach this task, ensemble methods require approximations in the update step. This ensemble updating problem is the core focus of the present paper. In particular, we address the problem of updating an ensemble of categorical state vectors and present in detail an approximate ensemble updating method for this situation.
Among the ensemble filtering methods that have currently been proposed in the literature there are two main categories; particle filters (Gordon et al. 1993;Doucet et al. 2001) and ensemble Kalman filters (EnKFs) (Burgers et al. 1998;Evensen 2003;Tippett et al. 2003). Particle filters are based on importance sampling while EnKFs rely on a linear-Gaussian assumption about the underlying state space model. Particle filters have the advantage of being asymptotically exact in the sense that as the ensemble size goes to infinity, the filters converge to the exact filtering solution. In practical applications, however, computational resources often restrict the ensemble size to be quite small, and particle filters are known to collapse unless the ensemble size is very large compared to the state dimension (Snyder et al. 2008). For the EnKF, the solution is always biased unless the underlying state space model really is linear-Gaussian. Despite this fact, the EnKF often performs remarkably well also in non-linear, non-Gaussian situations and, unlike the particle filter, also scales well to problems with very high-dimensional states. The filter is, however, inappropriate in situations with categorical vectors, as considered in the present paper. Loe and Tjelmeland (2021b), in a follow-up study to Loe and Tjelmeland (2021a), present an alternative solution to the ensemble updating problem based on a generalised view of the EnKF. Specifically, they describe a general updating framework where the idea is first to introduce assumed models for the intractable forecast and filtering distributions and thereafter to update the prior samples by simulating samples from a distribution which, under the assumption that the assumed forecast distribution is correct, preserves the corresponding assumed filtering distribution. To make the update robust against the assumptions of the assumed forecast and filtering models, the distribution from which the posterior samples are simulated is also required to be optimal with respect to a chosen optimality criterion. More specifically, the updating distribution is required to minimise the expected value of a certain distance, or norm, between a prior (forecast) and posterior (filtering) ensemble member. The framework is Bayesian in the sense that the parameters of the forecast distribution are also treated as random variables. Uncertainty about these parameters are thereby incorporated into the updating. Two particular applications of the proposed framework are investigated, one continuous and one categorical. In the continuous case, the assumed forecast and filtering models are chosen as Gaussian distributions and the optimality criterion is to minimise the expected Mahalanobis distance between a prior and posterior ensemble member. The framework then leads to a Bayesian version of square root EnKF (Bishop et al. 2001;Whitaker and Hamill 2002;Tippett et al. 2003). In the categorical case, the assumed forecast and filtering distributions are instead chosen as first-order Markov chains and the optimality criterion is to minimise the expected number of variables of a prior state vector that change their values. An optimal transition matrix for simulating a posterior ensemble member from a corresponding prior ensemble member is constructed using a combination of dynamic and linear programming.
There are three important limitations to the updating procedure for categorical state vectors proposed in Loe and Tjelmeland (2021b). Firstly, the procedure is difficult to implement except in the binary case where there are only two possible values for each variable of the state vector. Consequently, the authors only demonstrate the method in binary numerical experiments. Secondly, the approximation to the forecast distribution is restricted to be a first-order Markov chain. This means that models with higherorder interactions, for example a higher-order Markov chain, cannot be considered. Thirdly, the procedure is not applicable in two-or three-dimensional problems since it requires that the state vector has a one-dimensional spatial arrangement. In the present article, we address the first and second of these three issues. Specifically, we present a modified and improved version of the updating procedure applicable also for K > 2 classes and which allows the use of a higher-order Markov chain as the approximate forecast distribution. In the procedure described in Loe and Tjelmeland (2021b), a directed acyclic graph (DAG) is put forward to update each prior realisation. The chosen structure of the DAG allows the corresponding optimal solution to be computed recursively using a dynamic programming algorithm where a piecewiselinear programming problem is solved in each recursive step. In the present article, the starting point is, instead of a DAG, an undirected graphical model (UGM). This UGM has a more flexible structure than the DAG in the sense that we can more easily consider different types of dependency properties without overcomplicating the computation of the optimal solution. The underlying graph is decomposable and this gives the UGM many convenient computational properties. The optimal updating distribution is computed by solving a linear program derived from a series of local computations on the UGM. The remains of this paper take the following outline. In Sect. 2, we review state space models and the associated filtering problem in more detail, and we also present some basic graph theory required to understand the proposed approach. In Sect. 3, we present a slightly modified version of the general ensemble updating framework in Loe and Tjelmeland (2021b), restricting the focus to categorical state vectors. In Sect. 4, we describe in detail how the general framework can be applied when a Markov chain model is adopted for the assumed forecast distribution. Thereafter, we present in Sect. 5 a simulation example where each element of the state vector can take K = 3 values. Finally, we finish off with a few closing remarks in Sect. 6.

Preliminaries
This section describes the filtering problem in more detail and also reviews some graph-theoretic concepts related to the proposed approach. The section also introduces notations that we use throughout the paper.

The filtering problem
A general state space model consists of an unobserved process {x t } T t=1 , x t ∈ x , called the state process, and a corresponding observed process {y t } T t=1 , y t ∈ y , called the observation process, where y t is a partial observation of x t at time t. The unobserved state process {x t } T t=1 is modelled as a first-order Markov chain with initial distribution p x 1 (x 1 ) and transition probabilities p x t |x t−1 (x t |x t−1 ), t = 2, . . . , T . Throughout this paper, we use the notations x s:t = (x s , . . . , x t ) and y s:t = (y s , . . . , y t ), s ≤ t, to denote the vector of all states and the vector of all observations, respectively, from time s to time t. The joint distribution of x 1:T follows from the Markov chain assumptions as For the observation process, it is assumed that the observations are conditionally independent given x 1:T , with y t only depending on x t . The conditional distribution of y 1:T given x 1:T thereby follows as p y 1:T |x 1:T (y 1:T |x 1: It is possible to adjust the model so that observations are only recorded at a subset of the time steps {1, . . . , T }. However, for the sake of simplicity, we assume in this work that an observation is recorded at every time step t = 1, . . . , T . The objective of the filtering problem is, for each time step t = 1, . . . , T , to compute the so-called filtering distribution, p x t |y 1:t (x t |y 1:t ), i.e. the distribution of the latent state x t given all the observations y 1:t available at time t. Because of the particular structure of the state space model, the series of filtering distributions can be computed recursively according to a recursion which alternates between a forecast step, and an update step, where p y t |y 1:t−1 (y t |y 1: The distribution p x t |y 1:t−1 (x t |y 1:t−1 ) computed in the forecast step is called the forecast distribution of x t . In the update step, this distribution is conditioned on the new observation y t in order to compute the filtering distribution of x t , p x t |y 1:t (x t |y 1:t ). The update step is essentially a standard Bayesian inference problem with the forecast distribution becoming the prior and the filtering distribution the posterior. There are two important special cases where the filtering recursions can be computed exactly. The first is the linear-Gaussian model where the initial distribution p x 1 (x 1 ) is Gaussian and where p x t |x t−1 (x t |x t−1 ) and p y t |x t (y t |x t ) are Gaussian with mean vectors being linear functions of x t−1 and x t , respectively. The forecast and filtering distributions are then also Gaussian and Eqs. (1) and (2) lead to the famous Kalman filter (Kalman 1960). The second situation where the filtering recursions are tractable is the finite state space hidden Markov model for which the state space x consists of a finite number of states. The integrals in Eqs. (1) and (3) then reduce to finite sums. If, however, the number of states in x is large, for example if x t is a high-dimensional vector of categorical variables, the summations become too computer-demanding to cope with, and the filtering recursions are left computationally intractable.
Generally, the integrals in Eqs.
(1) and (3) make the recursive solution to the filtering problem intractable, and approximate solutions therefore become necessary. The most popular approach is the class of ensemble-based methods, where a set of samples, an ensemble, is used to empirically represent the sequence of filtering distributions. A great advantage of the ensemble context is that it simplifies the forecast step. Specifically, if an ensemble { x t,(1) , . . . , x t,(M) } of independent realisations from the filtering distribution p x t |y 1:t (x t |y 1:t ) is available, a forecast ensemble {x t+1,(1) , . . . , x t+1,(M) } with independent realisations from the forecast distribution p x t+1 |y 1:t (x t+1 |y 1:t ) can be obtained by simulating . , x t+1,(M) } of independent realisations from the filtering distribution p x t+1 |y 1:t+1 (x t+1 |y 1:t+1 ) is obtained. In the present article, we propose an approximate way to do this when the elements of the state vector are categorical variables.

Decomposable graphical models (DGMs)
This section introduces decomposable graphical models (DGMs), a certain type of undirected graphical models, or Markov random fields (Kindermann and Snell 1980;Cressie 1993;Cowell et al. 1999). For simplicity, the focus is restricted to discrete DGMs. In the following, we start with a brief review of some basic theory on undirected graphs in Sects. 2.5 consider simulation from discrete DGMs. A more thorough introduction to graph theory and graphical models can be found in, e.g., Cowell et al. (1999).

Undirected graphs
An undirected graph G is an ordered pair G = (V , E) where V is a set of vertices, or nodes, and E ⊂ {V × V } is a set of edges. The elements of the edge set E are pairs of distinct nodes, {i, j}, i, j ∈ V , i = j. If {i, j} ∈ E then node i and node j are said to be neighbours, or adjacent. Figure 1 illustrates a simple undirected graph with four vertices where, as per convention, vertices are represented by labelled circles and edges by lines between the circles. For this graph we have V = {1, 2, 3, 4} and If there is an edge between every pair of nodes in a graph G, the graph is said to be complete.
A clique is called a maximal clique in G if it is not a subset of another clique. Throughout this article, we denote the set of maximal cliques by C. For the graph pictured in Fig. 1 A path of length n from node i to node j is a sequence (α 0 , . . . , α n ) of distinct nodes where α 0 = i and α n = j and {α k−1 , α k } ∈ E, k = 1, . . . , n. Note that if there is a path from node i to node j in an undirected graph, there is also a path from node j to node i. For the graph pictured in Fig. 1, there are two paths from node 1 to node 4: (1, 2, 3, 4) and (1, 3, 4). Two nodes i and j are said to be connected if there is a path from node i to node j, and an undirected graph is said to be connected if every pair of vertices are connected. A tree is a connected undirected graph with the additional property that the path between every pair of vertices is unique. The graph in Fig. 1 is thus not a tree since there are different paths between some of the vertices.

Decomposable graphs and junction trees
An undirected graph G = (V , E) is said to be decomposable if the set of maximal cliques can be ordered in a way so that all elements in maximal clique number i (say) that are also in any later maximal cliques, are all contained in one of the later maximal cliques. If an undirected graph is decomposable, a so called junction tree for the maximal cliques can be defined, which in turn allow efficient computations of many properties.
Mathematically, the requirement for an undirected graph to be decomposable can be formulated as follows. Let C = {c 1 , . . . , c |C| } denote the set of maximal cliques of an undirected graph G = (V , E), where |C| denotes the number of maximal cliques. The graph is then decomposable if the elements of C can be ordered as (c 1 , . . . , c |C| ) so that for each i = 1, . . . , |C| − 1 there is a j > i such that The property in Eq. (4) is called the running intersection property, and the sets s 1 , . . . , s |C|−1 are called the separators of the graph. The set of all separators, S = {s 1 , . . . , s |C|−1 }, and the set of maximal cliques, C, are uniquely determined by the structure of the graph G; however, the ordering (c 1 , . . . , c |C| ) is generally not unique. Figure From a decomposable graph, a corresponding junction tree can be derived. A junction tree J for a decomposable graph G is a tree with C = {c 1 , . . . , c |C| } as its node set and the additional property that for every pair c i , c j ∈ C every node on the unique path between c i and c j in J contains the intersection c i ∩ c j . In a visual representation of a junction tree, it is common to include the separators as squared labels on the edges. This is illustrated in Fig. 2b which shows one of the possible junction tree representations for the decomposable graph in Fig. 2a.
A junction tree is a nice way to organise a decomposable graph, and many computations are easier to perform on the junction tree. Depending on the structure of the graph, however, it can be a complicated task to construct a corresponding junction tree. There exist several algorithms for this purpose, see Cowell et al. (1999). In all the examples we encounter in the present article, the decomposable graph under study has a structure which makes it particularly simple to construct a junction tree, and therefore we do not focus on the problem of constructing junction trees in this paper.

Discrete decomposable graphical models
, and a probability distribution p x (x). Alternatively, a discrete DGM can be defined as a discrete Markov random field whose underlying graph is decomposable. In the following, the notation x A is used to denote the variables of x associated with subset A ⊆ V , and x A ⊆ x is the sample space of x A . Taking 0/0 = 0, the distribution p x (x) of a discrete DGM can always be expressed as where C is the set of maximal cliques in G and S is the set of separators (Cowell et al. 1999). This expression can easily be shown using basic probability rules and the Markov properties assumed for a DGM. Intuitively, the numerator in Eq. (5) includes the distribution of the elements in x that are in one or more separators more than once, and this is corrected for by the product in the denominator. DGMs support several efficient algorithms and are fundamental for the work of this article. In particular, it should be noted that, if (c 1 , . . . , c |C| ) is an ordering of the maximal cliques fulfilling the running intersection property in Eq. (4) and (s 1 , . . . , s |C|−1 ) is the corresponding ordering of the separators, we have

Simulation from discrete DGMs
Consider a discrete DGM p x (x) with respect to a graph G = (V , E). To simulate a realisation from p x (x), a recursive procedure can be adopted, which goes as Continuing in this manner, we ultimately end up with only one variable x l and corresponding marginal distribution p x l (x l ). A realisation x ∼ p x (x) can then be generated by recursively simulating from the series of conditional distributions, in the reverse order as they were computed. Without loss of generality, suppose that the vertex set is V = {1, . . . , n} and that the nodes have been numbered so that nodes are removed in the order from n to 1. This means that we make us of the following factorisation of p x (x): Having computed all the factors in Eq. (7), simulation from p x (x) follows easily by first simulating , and so on. The recursive procedure described above, as well as the factorisation in Eq. (7), is general and holds for any distribution p x (x), not necessarily a discrete DGM. However, for many models, it is inconvenient to factorise p x (x) in this manner, since it can be a complicated task to compute all the factors. If the model is a DGM, however, and a corresponding junction tree J is available, computations can become particularly easy and efficient, as we discuss in the following. First, note that the distribution in Eq. (5) can be expressed as where V c (x c ) in this context is called a potential function for clique c. With the junction tree J given, it is convenient to start the decomposition of p x (x) in a leaf of J . Denote the clique to which the chosen leaf corresponds by c * . Since c * is a leaf of J , there is at least one node i ∈ V which is only present in c * . Suppose, without loss of generality, that the nodes have been numbered so that this is the case for node n, i.e. that node n is only contained in clique c * . We can then easily decompose p x (x) into p x n |x 1:n−1 (x n |x 1:n−1 ) and p x 1:n−1 (x 1:n−1 ) as follows. Since node n is only contained in clique c * , the variable x n only enters the right-hand-side expression in Eq. (8) through the potential function V c * (x c * ). This means that p x n |x 1:n−1 (x n |x 1:n−1 ) can be computed as The other part, p x 1:n−1 (x 1:n−1 ), can be computed, up to a constant of proportionality, by summing out x n from Eq. (8), Using that node n is only contained in clique c * , we can rewrite this expression as That is, we only need to sum over we can rewrite Eq. (11) in the more convenient form It is not necessary to compute the normalising constant in Eq. (12) in order for the remaining computations to proceed. Next, we want to decompose p x 1:n−1 (x 1:n−1 ) into p x n−1 |x 1:n−2 (x n−1 |x 1:n−2 ) and p x 1:n−2 (x 1:n−2 ). For this, consider first the junction tree J V \{n} we obtain after removing node n from c * in J . Removing node n from c * can affect the structure of J V \{n} in two different ways: either J V \{n} has the same number of nodes as J , or it has one node less. To understand why, consider the clique c * \ {n} that we obtain after removing node n from c * . Moreover, letc denote the neighbour of c * in J and let G V \{n} denote the graph obtained by removing node n from G. For the clique c * \ {n}, there are now two possibilities: either it is a subset ofc, i.e. c * \ {n} ⊆c, or it is not a subset of c, i.e. c * \ {n} c. If c * \ {n} c, then c * \ {n} is a maximal clique in the graph G V \{n} , and J V \{n} is essentially the same tree as J except that c * is replaced with c * \ {n}. The clique c * \ {n} then represents a leaf in J V \{n} , and we can decompose p x 1:n−1 (x 1:n−1 ) into p x n−1 |x 1:n−2 (x n−1 |x 1:n−2 ) and p x 1:n−2 (x 1:n−2 ) in the same manner as we decomposed p x (x) above. If, on the other hand, c * \ {n} ⊆c, we must merge c * \ {n} andc before we can proceed. Specifically, this entails that we need to add the potential functions of the two cliques together, We can then rewrite Eq. (12) as After merging the cliques, we can decompose p x 1:n−1 (x 1:n−1 ) in Eq. (13) into p x n−1 |x 1:n−2 (x n−1 |x 1:n−2 ) and p x 1:n−2 (x 1:n−2 ) in the same manner as we decomposed p x (x) into p x n |x 1:n−1 (x n |x 1:n−1 ) and p x 1:n−1 (x 1:n−1 ). Notice, however, that it is possible thatc is not a leaf in J V \{n} . If so, we must move to a clique which does represent a leaf, and decompose p x 1:n−1 (x 1:n−1 ) by removing a node and corresponding variable from this clique. Ultimately, we end up computing p x 1 (x 1 ). A realisation from p x (x) can then be obtained by first simulating , and so on.

Conditional simulation from discrete DGMs
Suppose again that p x (x) is a discrete DGM with respect to a graph G = (V , E), and let J be a junction tree for G. In the previous section, we described how to simulate from p x (x). Now, we address the closely related problem of how to simulate from the conditional distribution p By inserting values for x V \A in Eq. (14), we obtain an expression for p is also a discrete DGM, we can simulate from p x A |x V \A (x A |x V \A ) using the recursive procedure described in Sect. 2.2.4, as this procedure only requires that p is known up to a constant of proportionality. Before starting the computations, however, a new graph G A and corresponding junction tree J A must be constructed for , and the clique potentials for the maximal cliques of G A must be computed. The graph G A is simply obtained by removing the nodes V \ A from V and all edges {i, j} from E where i ∈ V \ A and/or j ∈ V \ A.
As an illustrative example, consider a DGM with respect to the graph in Fig. 2a. Suppose values for x 3 and x 4 are given and that we want to simulate from the condi- For this toy example, we have A = {1, 2, 5, 6} and V \ A = {3, 4}. The graph G A is shown in Fig. 3a and the junction tree J A is shown in Fig. 3b. The graph G A only has two maximal cliques, {1, 2} and {5, 6}, and the separator is simply the empty set ∅. The where now x 3 and x 4 are constant values. With G A , J A and these potential functions given, we can simulate from using the procedure described in Sect. 2.2.4.

Updating framework for categorical vectors
In this section we present our framework for the ensemble updating of categorical state vectors. The framework is a modified version of what is presented in Loe and Tjelmeland (2021b). Our focus is all the time on the updating of one specific ensemble member, number i say. We start in Sect. 3.1 by formulating an assumed Bayesian model for all quantities involved in the updating of ensemble number i. Next, in Sect. 3.2.1, we characterise a class of updating procedures that is correct under this assumed Bayesian model. This class of updating procedures is computationally hard to work with, so our next step, in Sect. 3.2.2, is to formulate a class of updating procedures that is only approximately correct under the assumed Bayesian model, but which is computationally simpler to work with. The last step in our framework, which we discuss in Sect. 3.3, is to define a criterion for identifying, within the class of updating procedures that are (approximately) correct under the assumed model, an updating procedure that is robust against the assumptions made in the assumed Bayesian model. In Sect. 4 we develop the computational details of the resulting updating procedure when the assumed Bayesian model is based on a ν'th order Markov chain model. Fig. 4 Graphical illustration of the assumed Bayesian model for the updating of x t,(i) to x t,(i)

Assumed Bayesian model
A graphical illustration of our assumed Bayesian model for the variables involved in the updating of the forecast sample x t,(i) to a filtering sample x t,(i) is shown in Fig. 4.
The model includes an unknown parameter vector θ t ∈ θ for which a prior model f θ t (θ t ) is adopted. Moreover, the latent state vector x t and the prior samples x t,(1) , . . . , x t, (M) are all assumed to be conditionally independent and identically distributed given θ t , i.e.
where f x t |θ t (x t |θ t ) is an assumed prior model for x t |θ t . The observation y t is assumed to be conditionally independent of θ t and x t,(1) , . . . , x t,(M) given x t , and distributed according to an assumed likelihood model f y t |x t (y t |x t ). Given x t,(i) , θ t and y t , the posterior realisation x t,(i) is conditionally independent of x t,(1) , . . . , x t,(i−1) , x t,(i+1) , . . . , x t,(M) and x t . For simplicity, we denote in the following the set of prior samples except the sample x t,(i) by x t,−(i) , i.e.
Conceptually, the assumed models f x t |θ t (x t |θ t ) and f y t |x t (y t |x t ) can be any parametric distributions. In order for the framework to be useful in practice, however, they must be chosen so that the corresponding posterior model is tractable. Moreover, f θ t (θ t ) should be chosen as conjugate for f x t |θ t (x t |θ t ).

Class of updating distributions
Based on the assumed Bayesian model defined above we characterise in this section a class of updating procedures for generating the filtering ensemble element x t,(i) from forecast ensemble member x t, (i)

Derivation of a class of updating distributions
A natural minimal restriction for the updating of x t,(i) to x t,(i) is to require that the procedure is consistent with the assumed model. One can then say that the updating is correct under the assumed model. In addition one would naturally also like the updating to be robust against the assumptions made in the assumed Bayesian model, but this is not the main focus in this section. A naïve updating procedure that is consistent with the assumed model is simply to set x t,(i) equal to a sample from f x t |x t,(1) ,...,x t,(M) ,y t (·|x t,(1) , . . . , x t,(M) , y t ). This procedure may, however, be very sensitive to the assumptions of the assumed model. To get a more robust updating procedure, a better alternative is to generate x t,(i) as a modified version of x t,(i) , as indicated by the graph in Fig. 4. In such a setup, the role of x t,(i) is as a source of randomness in the generation of x t,(i) . One should therefore remove x t,(i) from the conditioning set in the naïve updating procedure and instead require that x t,(i) is a sample from f x t |x t,−(i) ,y t (·|x t,−(i) , y t ) under the assumed model. Thus, the updating of x t,(i) to x t,(i) should be such that for all x t , x t,−(i) and y t . In the following we study what implications the restriction in Eq. (16) have on how to generate x t,(i) from x t,(i) . Introducing the parameter vector θ t , the distribution on the left hand side of Eq. (16) is obtained by marginalising out θ t from the joint distribution Rewriting the distribution on the right hand side of Eq. (16) in a similar way it follows that Eq. (16) can be rewritten as Writing each of the joint distributions in these two integrands as a product of the marginal distribution for θ t and the conditional distribution for the other variable given θ t , the restriction reads A sufficient condition for this relation to hold is that the two integrands are equal for each θ t . From this it follows that a sufficient condition for Eq. (18) to hold is that for all x t , θ t and y t . Thereby, we understand that x t,(i) can be updated by first simulating and thereafter simulate where

θ t,(i) , y t ) is a distribution which fulfils Eq. (19). Generally, a class of updating distributions f x t,(i) |x t,(i) ,θ t,(i) ,y t ( x t,(i) |x t,(i) , θ t,(i) , y t )
consistent with the requirement in Eq. (19) exists. The simplest option is to use the assumed posterior model f x t |θ t,(i) ,y t (x t |θ t,(i) , y t ) and simulate x t,(i) independently of x t,(i) . However, this means that we possibly lose valuable information from x t,(i) about the true forecast and filtering distributions that we may not have been able to capture with the assumed model. To preserve more of this information from x t,(i) , it is important to simulate x t,(i) conditionally on x t,(i) . Conceptually, an updating distribution f x t,(i) |x t,(i) ,θ t,(i) ,y t ( x t,(i) |x t,(i) , θ t,(i) , y t ) can be constructed by first constructing a joint distribution f x t,(i) , x t,(i) |θ t,(i) ,y t (x t,(i) , x t,(i) | θ t,(i) , y t ), and thereafter condition this distribution on x t, (i) . The joint distribution f x t,(i) , x t,(i) |θ t,(i) ,y t (x t,(i) , x t,(i) |θ t,(i) , y t ) can be factorised as To be consistent with the requirement in Eq. (19), that is, when marginalising out x t,(i) we end up with the assumed posterior model. To be consistent with the model assumptions, and so that the factorised form in Eq. (20) holds, the distribution must also fulfil that is, when marginalising out x t,(i) we end up with the assumed prior model. In principle, infinitely many distributions f x t,(i) , x t,(i) |θ t,(i) ,y t (x t,(i) , x t,(i) |θ t,(i) , y t ) consistent with the requirements in Eqs. (21) and (22) exist. In practice, however, it is generally difficult to assess one of these distributions, except the naïve solution where f x t,(i) |x t,(i) ,θ t,(i) ,y t (·|x t,(i) , θ t,(i) , y t ) is set equal to the assumed posterior model f x t |θ t,(i) ,y t (·|θ t,(i) , y t ). Therefore, we must resort to approximations, which we consider in more detail below.

A class of approximate updating distributions
In this section we propose an approximation to

t ), also the approximation q(x t,(i) , x t,(i) |θ t,(i) , y t ) defines a joint distribution for x t,(i) and x t,(i) for given values of θ t,(i) and y t . However, whereas f x t,(i) , x t,(i) |θ t,(i) ,y t (x t,(i) , x t,(i) |θ t,(i) , y t ) defines a conditional distribution for the two variables consistent with the assumed Bayesian model, we do not require this for q(x t,(i) , x t(i) |θ t,(i) , t t ).
We define a class of allowed distributions q(x t,(i) , x t(i) |θ t,(i) , t t ) by replacing the restrictions in Eqs. (21) and (22) with two weaker restrictions, which we detail in the following. Let q(x t,(i) , x t,(i) |θ t,(i) , y t ) be a DGM with respect to a graph G with vertex set V = {1, . . . , 2n} and maximal clique set C = {c 1 , . . . , c |C| } where |C| is the number of maximal cliques. Associate the n variables of x t,(i) with the nodes 1, . . . , n and the n variables of x t,(i) with the nodes n + 1, . . . , 2n, so that, for j = 1, . . . , n, the variable x t,(i) j is associated with node j and the variable x t,(i) j is associated with node j + n. Next, let A 1 , . . . , A |C| , B 1 , . . . , B |C| denote a sequence of subsets of V 1:n = {1, . . . , n} such that the nodes of V that are associated with (x t,(i) and Thereby, q(x t,(i)

we know that since q(x t,(i) , x t,(i) |θ t,(i) , y t ) is a DGM it is fully specified by its clique probabilities and can be expressed as
Hence, in order to specify q(x t,(i) , x t,(i) |θ t,(i) , y t ), all we need to do is to appropriately specify each of the clique probabilities q(x t,(i) Recall that the goal is to specify q(x t,(i) , x t,(i) |θ t,(i) , y t ) such that it approximately represents the joint distribution in Eq. (20) subject to the constraints in Eqs. (21) and (22). To construct such a q(x t,(i) , x t,(i) |θ t,(i) , y t ), we replace the requirements in Eqs. (21) and (22) by (26) respectively. That is, instead of requiring that q(x t,(i) , x t,(i) |θ t,(i) , y t ) fully preserves f x t |θ t (x t |θ t ) and f x t |θ t (x t |θ t , y t ), as required in Eqs. (21) and (22), we only require that the marginal distributions |θ t ,y t (x t B j |θ t , y t ) are preserved. Another constraint we need to take into account when specifying q(x t,(i) , x t,(i) |θ t,(i) , y t ) is that the clique probabilities must be consistent in the sense that, if we let (c 1 , . . . , c |C| ) denote an ordering of the cliques in C which fulfils the running intersection property in Eq. (4), the probabilities for two consecutive cliques c j and c j+1 must return the same marginal distribution for the separator s j = c j ∩ c j+1 . Mathematically, this can be written as Assuming we are able to construct a DGM q(x t,(i) , x t,(i) |θ t,(i) , y t ) consistent with the requirements discussed above, we can condition this DGM on x t,(i) and simulate x t,(i) |x t,(i) as described in Sect. 2.2.5.

and maximises
Eq. (28) c) Simulate as described in Section 2.2.5 end

Defining an optimal solution
There may be infinitely many distributions q(x t,(i) , x t,(i) |θ t,(i) , y t ) which fulfil the requirements in Eqs. (25)-(27). In this section we formulate a criterion that can be used to identify an update distribution within the class defined above, that is robust with respect to the assumed Bayesian model. To preserve as much information from x t,(i) as possible in x t,(i) , we define the optimal q(x t,(i) , x t,(i) |θ t,(i) , y t ) as the one which maximises the expected number of elements in x t,(i) that remain unchanged, i.e. the q(x t,(i) , x t,(i) |θ t,(i) , y t ) that maximises where the expectation is taken over q(x t,(i) , x t,(i) |θ t,(i) , y t ). Intuitively, this is a reasonable optimality criterion for categorical variables which should make the update robust. By making minimal changes to x t,(i) , the updated sample x t,(i) may be able to capture properties of the true filtering distribution p(x t |y 1:t ) that we may not have captured with the posterior distribution f (x t |θ t , y t ) resulting from the assumed Bayesian model. The key steps of the resulting updating procedure are summarised in Algorithm 1.

Updating procedure using a Markov chain assumed prior
Using the updating framework introduced above, we in this section develop the resulting updating procedure when the assumed Bayesian model is based on a ν'th order Markov chain model. In particular we propose to choose the maximal cliques of the DGM q(x t,(i) , x t,(i) |θ t , y t ) in such a way that the optimal solution can be computed by solving a linear optimisation problem.

Model specifications
Assuming the vector x t to have a one-dimensional spatial arrangement, we choose the assumed prior distribution f x t |θ t (x t |θ t ) to be a Markov chain of order ν ≥ 1, For the likelihood model f y t |x t (y t |x t ), we assume that y t = (y t 1 , . . . , y t n ) contains n conditionally independent observations, with y t j depending only on x t j , This choice of assumed prior and likelihood yields a posterior model f x t |θ t ,y t (x t |θ t , y t ) which is also a Markov chain of order ν. The initial and transition probabilities of this posterior Markov chain can be computed efficiently with the forward filteringbackward smoothing algorithm for hidden Markov models (e.g., Künsch, 2000).
The parameter vector θ t should define the initial and transition probabilities of the assumed prior Markov chain. A Markov chain of order ν is specified by n − ν + 1 transition matrices, each matrix consisting of K ν rows and K columns. Denote in the following these transition matrices by θ t 1 , . . . , θ t n−ν+1 . Furthermore, let θ 0 be a vector representing the initial probabilities of the Markov chain, and consider θ t = (θ t 0 , θ t 1 , . . . , θ t n−ν+1 ).
Following the recommendations of Section 3.1, we choose f θ t (θ t ) as conjugate for f x t |θ t (x t |θ t ). Here, this entails adopting a Dirichlet distribution for θ t 0 and a Dirichlet distribution for each of the K ν row vectors in each transition matrix θ t j , j = 1, . . . , n − ν + 1, and to let all these Dirichlet distributed parameters be a priori independent. For simplicity, the remaining technical details of the specification of f θ t (θ t ) are presented in Appendix A, whereas how to simulate from

Class of updating distributions
Having specified the distributions f θ t (θ t ), f x t |θ t (x t |θ t ) and f y t |x t (y t |x t ) of the assumed Bayesian model, the next task is to characterise the class of DGMs q(x t,(i) , x t,(i) |θ t , y t ) introduced in Sect. 3.2. For this, we need to specify the cliques of the underlying decomposable graph of q(x t,(i) , x t,(i) |θ t , y t ) or, equivalently, the A j and B j -sets in Eqs. (23) and (24). For some integer d ≥ 1, the A j and B j -sets are specified as for j = 1, . . . , n − d + 1. Visually, the decomposable graph G can then be represented as a two-dimensional grid with two rows and n columns, or as a 2 × n matrix. The first row is associated with the nodes 1, . . . , n and the second row is associated with the nodes n + 1, . . . , 2n. Each maximal clique is formed by d consecutive columns, hence we call it a 2 × d clique. The variables associated with each 2 × d clique are  Figures 5a and 6a illustrate G when d = 2 and d = 3, respectively, when the state vector x t contains n = 5 variables. Figures 5b and 6b show corresponding junction tree representations. The chosen graphical structure makes it fairly trivial to construct corresponding junction trees. j = 1, . . . , n − d + 1. The objective function is to be maximised subject to the constraints in Eqs. (31) to (33), which are also linear functions of q(x t,(i) Because q(x t,(i) , x t,(i) |θ t,(i) , y t ) is a probability distribution, we must also include the constraint that q(x t,(i) A j , x t,(i) A j |θ t,(i) , y t ) sums to one, and that it can only take values between zero and one, These constraints are also linear functions of q(x t,(i) Thus, we have a linear optimisation problem, or a linear program, which can be efficiently solved with standard linear programming techniques.

Simulation example
In this section, the updating procedure described in Sect. 4 is demonstrated in a simulation example. The example involves a filtering problem where the unobserved Markov process {x t } T t=1 consists of T = 100 time steps, the dimension n of x t is n = 200, and there are three classes for each element x t j of x t : 0, 1, and 2.

Experimental setup
Our simulation experiment is a modified version of the binary simulation scheme in Loe and Tjelmeland (2021b). As for the example in that article, we let us inspire from the process when water comes through to an oil producing well in a petroleum reservoir. It should be stressed, however, that we do not claim our setup to be a realistic description for this process. The t in x t j represents time and j the location in the well, with j = 1 being at the top of the well and j = n at the bottom. We let the events x t j = 0 and x t j = 1 represent the presence of porous sand stone filled with oil and water, respectively, in location j of the well at time t, while the event x t j = 2 represents non-porous shale in the same location. One should note that the spatial distribution of sand stone and shale does not change with time, whereas the fluid in a sand stone may change.
We start by simulating a reference, or true, sequence of states, x 1 , . . . , x T . We simulate these as a Markov chain with an initial distribution for x 1 and a forward model for generating x 2 . . . , x T . The precise initial and forward models we are using are defined in Appendix C. The simulated reference states are shown in Fig. 7a.
The horizontal and vertical axes in the figure is time and depth, respectively. The yellow is shale and the green and blue are sand stone filled with water and oil, respec- tively. The initial and forward models are constructed so that at time t = 0 all the sand stone is filled with oil, whereas at time t = T most of the sand stone is filled with water.
Given the reference states x 1 , . . . , x T shown in Fig. 7a we simulate corresponding observations y 1 , . . . , y T . At each time t we assume the elements of y t to be conditionally independent given x t . To avoid that the likelihood effectively induces an ordering of the three possible values for each x t j , we let y j t be a vector of two conditionally independent normally distributed components, and choose the mean values and variances of these normal distributions to obtain a likelihood that has full symmetry between the three possible values of x t j . A detailed description of the likelihood function used is given Appendix C. Images of the two components of the simulated y 1 , . . . , y T are shown in Fig. 7b and c. As in Fig. 7a, the horizontal and vertical axes represent in these figures time and depth, respectively, whereas the colours represents real values as shown by the associated colour bars.
Having generated values for y 1 , . . . , y T we consider these as observed values and the underlying reference process as unknown. We then run the proposed filtering procedure proposed above. To generate the initial ensemble {x 1,(1) , . . . , x 1,(M) } we generate independent samples from the same distribution we used to generate the reference state at time t = 0. Moreover, in the filtering procedure we use the same forward model as we used when generating the reference state at times t > 0, and same likelihood function as we used to generate y 1 , . . . , y T from the reference states.
When running the proposed updating procedure, we need to set a value for ν, i.e. the order of the assumed Markov chain model f x t |θ t (x t |θ t ), and a value for the integer d in Eq. (30) which determines the structure of q(x t,(i) , x t,(i) |θ t , y t ). High values for ν and d, and high values for d especially, make the construction of q(x t,(i) , x t,(i) |θ t , y t ) Table 1 Results from simulation experiment: proportion of correctly classified variables x t j obtained with the MAP estimates in Eq. (37) computed in five independent runs computer-demanding. Below, we investigate the two values ν = 1 and ν = 2, and for each of these we consider the three values d = 1, d = 2 and d = 3. Thereby, we have six combinations, or cases, for (ν, d). For each of these six cases, we perform five independent runs, using ensemble size M = 20. The hyper-parameters for θ t (cf. Appendix A) at each time step t are all set equal to one, and 500 iterations are used in the MCMC simulation of θ t,(i) |x t,−(i) , y t (cf. Appendix B).

Results
To evaluate the performance of the proposed approach, we first compute, for each of the five runs of each of the six combinations of (ν, d), the maximum a posteriori probability (MAP) estimatex j t of x t j , t = 1, . . . , T , j = 1, . . . , n, is an estimate of the marginal filtering probability p x t j |y 1:t (k|y 1:t ). Figure 8 shows images of the computed MAP estimates {x t j , j = 1, . . . , n} T t=1 from one of the five runs performed for each of the six cases. From a visual inspection, it seems that we in all cases manage to capture the main characteristics of the true x t -process in Fig. 7a, but the MAPs shown in Fig. 8a and d, which are obtained using d = 1, are possibly a bit noisier than the others. Table 1 lists the ratio of correctly classified variables x t j based on the MAPs obtained from the five independent runs of each case.
According to Table 1 we classify around 85-90% of the variables correctly, and we obtain the best results when using the combinations ν = 1, d = 2 and ν = 1, d = 3. This may suggest that adopting a first-order Markov chain (i.e., ν = 1) for f x t |θ t (x t |θ t ) and using 2×2-or 2×3-cliques (i.e., d = 2 or d = 3) in the construction of q(x t,(i) , x t,(i) |θ t,(i) , y t ) is a robust strategy.
To further investigate the performance of the proposed approach, we estimate for each j and t the probability that x t,(i) j is equal to the true value x t j , and we do this for while if x t j = 1, we compute and if x t j = 2, we compute There are, in the latent x t -process shown in Fig. 7a, 11929 variables x t j taking the value 0, 7271 variables taking the value 1 and 800 variables taking the value 2. Thereby, since we run each of the six (ν, d) combinations five times, we obtain for each (ν, d) combination 5 · 11929 samples of π 0|0 , 5 · 7271 samples of π 1|1 and 5 · 800 samples of π 2|2 . We denote the corresponding sample means byπ 0|0 ,π 1|1 andπ 2|2 , and we let π = 1 3 π 0|0 +π 1|1 +π 2|2 . Figure 9 presents histograms constructed from the samples of π 0|0 , π 1|1 and π 2|2 for each case, and Table 2 summarises the corresponding computed values forπ 0|0 ,π 1|1 ,π 2|2 andπ . The values forπ indicate that, again, we obtain the best results using ν = 1, d = 2 and ν = 1, d = 3. Computationally, using d = 3 is more demanding, and since the improvement it offers over d = 2 is only minor, the best approach may be to use ν = 1, d = 2.

Closing remarks
An ensemble updating method for categorical state vectors is proposed. The proposed procedure is an improved version of the updating procedure for categorical vectors described in Loe and Tjelmeland (2021b). What is new is mainly in how the optimal solution of q(x t,(i) , x t,(i) |θ t , y t ) is computed. Loe and Tjelmeland (2021b) construct the conditional distribution q( x t,(i) |x t,(i) , θ t , y t ) directly based on a directed acyclic graph (DAG) for q(x t,(i) , x t,(i) |θ t , y t ). The chosen structure of the DAG allows the optimal solution of q( x t,(i) |x t,(i) , θ t , y t ) to be computed recursively using a combination of dynamic and linear programming. This strategy works well when the elements of x t are binary, but the algorithm is difficult to generalise to situations with more than two classes. Moreover, only one particular dependency structure for q( x t,(i) |x t,(i) , θ t , y t ) is considered, and it is difficult, or essentially impossible, to generalise the algorithm to cope with more complicated structures allowing for higherorder interactions. In the present article, we take a different approach which in the end results in a more flexible and general procedure. Instead of a DAG, the starting point is an undirected graphical model for q(x t,(i) , x t,(i) |θ t , y t ), specifically a DGM. When specifying the maximal cliques of this DGM one has a certain level of flexibility, which makes it possible to study different dependency structures between x t,(i) and x t, (i) . The optimal solution is computed by solving a single linear program which is both easier to implement and computationally more efficient than the dynamic programming algorithm proposed in Loe and Tjelmeland (2021b). Moreover, we can easily consider more than two classes. The proposed procedure is demonstrated in a simulation example with three classes, and the results look promising. Fig. 9 Results from the simulation experiment: Histograms of π 0|0 (left), π 1|1 (middle) and π 2|2 (right) In Sect. 3.2, we introduced an exact and an approximate class of distributions for the updating of the prior sample x t,(i) . Although it may seem disadvantageous to pursue an approximate approach over an exact one, we believe that in this case the approximate approach actually provides better results. The constraints of the approximate approach are less restrictive and allows the optimality criterion to affect the solution to a larger extent, which may result in an optimal updating distribution which is more robust against the assumptions of the assumed Bayesian model. That is, even if the assumed Markov chain model is far from the truth, the optimal model q(x t,(i) , x t,(i) |θ t,(i) , y t ) still provides reasonably good results.
Future work naturally includes to extend the proposed procedure to two dimensions. Assuming x t is defined on a two-dimensional grid, a possible choice of model for f x t |θ t (x t |θ t ) is then a Markov mesh model (Abend et al. 1965). However, when it comes to the construction of the DGM q(x t,(i) , x t,(i) |θ t , y t ), the two-dimensional situation causes trouble, and we probably need to introduce some sort of approximations to overcome this.

B Parameter simulation
In this appendix we explain how to simulate from θ t, is as specified in Appendix A and f x t |θ t (x t |θ t ) and f y t |x t (y t |x t ) constitute a finite state-space hidden Markov model as specified in Sect. 4.1. Generally, one can simulate θ t,(i) |x t,−(i) , y t by introducing x t as an auxiliary variable and construct a Gibbs sampler which simulates (x t , θ t ) from the joint distribution f x t ,θ t |x t,−(i) ,y t (x t , θ t |x t,−(i) , y t ) ∝ f θ t (θ t ) f x t |θ t (x t |θ t ) f y t |x t (y t |x t ) j =i f x t |θ t (x t,( j) |θ t ) by alternating between drawing x t from the full conditional distribution f x t |θ t ,x t,−(i) ,y t (x t |θ t , x t,−(i) , y t ) and θ t from the full conditional distribution f θ t |x t ,x t,−(i) ,y t (θ t |x t , x t,−(i) , y t ). From the dependency structure of the assumed Bayesian model, see Fig. 4, it follows that and f θ t |x t ,x t,−(i) ,y t (θ t |x t , x t,−(i) , y t ) = f θ t |x t ,x t,−(i) (θ t |x t , x t,−(i) ).
Thereby, we see that both of the full conditional distributions of the Gibbs sampler are tractable, so the Gibbs sampler can be implemented without difficulties. When f θ t (θ t ) is as specified in Appendix A and f x t |θ t (x t |θ t ) and f y t |x t (y t |x t ) constitute a HMM as specified in Sect. 4.1, the distribution in Eq. (39) becomes a first-order Markov chain whose initial and transition probabilities can be computed with the smoothing recursions for HMMs (Künsch 2000). For the distribution in Eq.
(40), it can easily be shown that θ t and p x t n |x t n−1 ,x t−1 (x t n |x t n−1 , x t−1 ) = p x t n |x t n−1 ,x t−1 n−1 ,x t−1 n (x t n |x t n−1 , x t−1 n−1 , x t−1 n ). (44) In the following, we first discuss the specification of Eq. (42). To obtain a model where the spatial distribution of sand stone and shale does not change in time we set for all x t j−1 , x t−1 j−1 , x t−1 j+1 ∈ {0, 1, 2}, and For the remaining probabilities in Eq. (42), we adopt the same values as used in Loe and Tjelmeland (2021b), see Table 3. The reasoning behind these probabilities is that if x t−1 j = 1 the probability for having x t j = 1 should be high, and in particular this probability should be high if also x t j−1 = 1. If x t−1 j = 0 the probability for having also x t j = 0 should be high unless x t j−1 = x t−1 j−1 = x t−1 j+1 = 1. The probabilities in Eqs. (43) and (44) we simply define from the values set for the probabilities in Eq. (42) by defining the values lying outside the simulated lattice to be zero. For x 1 we define that all the elements should be equal to 0 or 2, and assume the elements to be independent with p x 1 j (x 1 j = 2) = 1/40 and p x 1 j (x 1 j = 0) = 1 − 1/40. This results in a vector x 1 with a few (typically one node thick) layers of shales, with the remaining elements being oil filled sand stone. One realisation from the specified Markov process for {x t } T t=1 is shown in Fig. 7a. This realisation is also used as the reference process, and thereby to simulate the observations used in the simulation example.
For the likelihood f y t |x t (y t |x t ), we know from Sect. 4 that it is sufficient to specify f y t j |x t j (y t j |x t j ) since the elements of y t are assumed to be conditionally independent given x t , with y t j only depending on x t j . To avoid that the likelihood induces an ordering of the three possible values of x t j , we let y t j be a vector with two components, y t j = (y t j,1 , y t j,2 ), and choose f y t j |x t j (y t j |x t j ) as a bivariate Gaussian distribution N (y t j ; μ(x t j ), ) with a mean vector and covariance matrix = σ 2 I . As illustrated in Figure 10, the mean vectors μ(0), μ(1) and μ(2) are chosen to lie at the vertices of an equilateral triangle with unit sides. This is to avoid an ordering of the three classes. We assume in this simulation experiment that the true likelihood model p y t |x t (y t |x t ) and the assumed likelihood model f y t |x t (y t |x t ) are equal. As such, the assumed likelihood model f y t |x t (y t |x t ) is used to generate the observation process {y t } T t=1 . Specifically, using the simulated Markov process shown in Fig. 7a and setting σ = 1.0, we generate {y t } T t=1 by simulating, independently for each j = 1, . . . , 200 and t = 1, . . . , 100, y t j ∼ f y t j |x t j (·|x t j ). An image of {(y t j,1 , j = 1, . . . , n)} T t=1 is shown in Fig. 7b and an image of {(y t j,2 , j = 1, . . . , n)} T t=1 is shown in Fig. 7c.