Data assimilation for a quasi-geostrophic model with circulation-preserving stochastic transport noise

This paper contains the latest installment of the authors' project on developing ensemble based data assimilation methodology for high dimensional fluid dynamics models. The algorithm presented here is a particle filter that combines model reduction, tempering, jittering, and nudging. The methodology is tested on a two-layer quasi-geostrophic model for a $\beta$-plane channel flow with $O(10^6)$ degrees of freedom out of which only a minute fraction are noisily observed. The model is reduced by following the stochastic variational approach for geophysical fluid dynamics introduced in Holm (Proc Roy Soc A, 2015) as a framework for deriving stochastic parametrisations for unresolved scales. The reduction is substantial: the computations are done only for $O(10^4)$ degrees of freedom. We introduce a stochastic time-stepping scheme for the two-layer model and prove its consistency in time. Then, we analyze the effect of the different procedures (tempering combined with jittering and nudging) on the performance of the data assimilation procedure using the reduced model, as well as how the dimension of the observational data (the number of"weather stations") and the data assimilation step affect the accuracy and uncertainty of the results.


Introduction
In recent years, there has been an increased scientific effort in developing ensemble based data assimilation as an alternative to variational data assimilation which is currently used in operation centres for numerical weather prediction 1 . Such methods can be more suited for fully nonlinear systems and complex observation operators. The work presented in this paper is part of this wider effort (see the survey paper [22] and the references therein for recent developments in this direction).
The cornerstone of the current work is the introduction of stochastic parametrization to model uncertainity via the so-called Stochastic Advection by Lie Transport (SALT) approach [11]. The stochasticity is introduced into the advection part of the dynamics via a constrained variational principle. This is a general approach for deriving stochastic partial differential equation (SPDE) models for geophysical fluid dynamics (GFD). In this work we apply it to the N-layer quasi-geostrophic model (see Section 2 for details). By adding stochasticity into the advection operator, one can model uncertain transport behaviour. The uncertainty in our case occurs as we assimilate the data coming from observing a high resolution model, but use low resolution realisations of the model. This model reduction is crucial as it enables us to complete the task by using fairly modest computational resources 2 . The stochastic term used to model the missing uncertainties is calibrated by using a data driven approach as described in [5].
This paper complements the work done in [6] where the "true state" is chosen to be the solution of the Euler equation with forcing and damping. We choose the quasi-geostrophic (QG) equations for this work as it has a qualitatively different behaviour from Euler: As one can see in Fig. 1, the solution of the QG equations is far less homogeneous than the that of the Euler equation, exhibiting multiple large-scale zonally elongated jets as well as small-scale vortices. Indeed, one of the findings of our work is that the formation of jets is heavily influenced by the size of the grid: the coarser the grid the less jets are formed. Nevertheless, once data assimmilation is applied to the coarser model (stochastically parametrized and properly calibrated), the number of jets can be preserved. The occurrence of the jets makes the data assimilation problem harder. Whilst in [6] it sufficed to use only tempering and jittering to assimilate the data, in the current work we obtained far better results only after we added the nudging procedure to the two already used in [6]. Another difference from the work done in [6] was the choice of the initial ensemble, which here chosen as a set of independent realizations from the solution of the stochastically perturbed QG equation. We found this to be a more natural alternative to the one used in [6].
The use of the combination of tempering and jittering is theoretically justified. Indeed in [4] it is shown that the use of the two procedures can produce particle filters suitable for solving high dimensional problems. More precisely, it is proved that the effective sample size of the ensemble of particles remains under control as the dimension d of the underlying system increases with a computational cost that is at most quadratic in d. By contrast, a generic (bootstrap) particle filter would require a computational cost that is exponential in d.
As is usually the case in data assimilation, the particle filter proceeds by alternating between forecast and analysis cycles. In each analysis step, observations of the current (and possibly past) state of a system are combined with the results from a prediction model (the forecast) to produce an analysis. The tempering and jittering are used to complete the analysis step, whilst the nudging procedure is used in the forecast step. In the absence of nudging, the ensemble particles have trajectories that are independent solutions of the stochastic QG equations. Nudging consists in adding a drift to the trajectories of the particles with the aim of maximising the likelihood of their positions given the observation data. This introduces a bias in the system that is corrected at the analysis step. It follows that also the nudging procedure is theoretically justified through a standard convergence argument, see for example [8]. It follows that the data assimilation algorithm presented in this paper will give an asymptotically (as the number of particles increases) consistent approximation of the posterior distribution of the state given the data. That does not mean that the empirical distribution of the ensemble is a good approximation of the posterior. The size of the ensemble is 100 and this is certainly not enough to approximate a posterior distribution in a state space of dimension O(10 4 ). However, it offers a sound theoretical basis for the algorithms presented here. We give further details of this issue in Section 5.
The paper is structured as follows. In Section 2 we describe the deterministic N-layer QG equations and Hamiltonian formulation for the stochastic multi-layer QG model. Section 3 presents the deterministic and stochastic QG equations, and numerical methods for the QG model. In Section 4 we prove that the stochastic CABARET scheme is consistent with the stochastic QG equation in the mean square sense in time. In Section 5 we discuss different procedures used: Bootstrap Particle Filter, jittering, tempering and nudging procedures. In Section 6 we present and discuss the numerical experiments and results, and study how the data assimilation methods influence the quality of the forecast given by the stochastic QG model. The following is a summary of the main numerical experiments contained in this paper: -Dependence of the relative bias and the ensemble mean l 2 -norm relative error between the true deterministic solution and its stochastic parameterisation on the data assimilation step (Figs. 4 and 6), grid resolution ( Fig. 8), data assimilation methods (Figs. 10 and 12), and the number of weather stations (Figs. 14 and 15); -Analysis of how uncertainty of the stochastic spread is influenced by the data assimilation step (Figs. 5 and 7), grid resolution ( Fig. 9), data assimilation methods (Figs. 11 and 13), and the number of weather stations (Figs. 16 and 17); -Forecast reliability rank histrograms for the stochastic QG model with and without the data assimilation procedure (Fig. 18).
Finally, Section 7 concludes the present work and discusses the outlook for future research. Consider a stratified fluid of N superimposed layers of constant densities ρ 1 < · · · < ρ N ; the layers being stacked according to increasing density, such that the density of the upper layer is ρ 1 . The quasi-geostrophic (QG) approximation assumes that the velocity field is constant in the vertical direction and that in the horizontal direction the motion obeys a system of coupled incompressible shallow water equations. We shall denote by u i = (− ∂ y ψ i , ∂ x ψ i ) =ẑ × ∇ψ i the velocity field of the i th layer, where ψ i is its stream function, and the layers are numbered from the top to the bottom. We define the potential vorticity of the i th layer as where the potential vorticity is defined as ω i = q i + f i , the elliptic operator E ij defines the layer vorticity, where g is the gravitational acceleration, ρ 0 = (1/N )(ρ 1 +· · ·+ρ N ) is the mean density, D i is the mean thickness of the i th layer, R is the Earth's radius, Ω is the Earth's angular velocity, φ 0 is the reference latitude, and d(y) is the shape of the bottom. The N × N symmetric tri-diagonal matrix T ij represents the second-order difference operator, so that With these standard notations, the motion of the NLQG fluid is given by whereẑ is the vertical unit vector, u i =ẑ × ∇ψ i is the horizontal flow velocity in the i th layer, and the brackets in denote the usual xy canonical Poisson bracket in R 2 . The boundary conditions in a compact domain D ⊂ R 2 with smooth boundary ∪ j ∂D j are ψ j | (∂Dj ) = constant, whereas in the entire R 2 they are lim (x,y)→∞ ∇ψ j = 0. The space of variables with canonical Poisson bracket in (6) consists of N -tuples (q 1 , . . . , q N ) of real-valued functions on D (the "generalized vorticities") with the above boundary conditions and certain smoothness properties that guarantee that solutions are at least of class C 1 . The Hamiltonian for the N -layer vorticity dynamics in (5) is the total energy with stream function ψ i determined from vorticity ω i by solving the elliptic equation (1) for for the boundary conditions discussed above. Hence, we find that where E −1 ij * q j = ψ i denotes convolution with the Greens function E −1 ij for the symmetric elliptic operator E ij . The relation (9) means that δH/δq i = ψ i for the variational derivative of the Hamiltonian functional H with respect to the function q j .
Remark 1 (Lie-Poisson bracket) Equations (5) are Hamiltonian with respect to the Lie-Poisson bracket on the for arbitrary functions F and H, provided the domain of flow D is simply connected. 3 The motion equations (5) for q i now follow from the Lie-Poisson bracket (10) after an integration by parts to write it equivalently as and recalling that δH/δq i = −E −1 ij * q j = − ψ i , i = 1, 2, . . . , N , so that equations (5) follow.
Remark 2 (Constants of motion) According to equations (5), the material time derivative of ω i (t, x, y) vanishes along the flow lines of the divergence-free horizontal velocity u i =ẑ×∇ψ i . Consequently, for every differentiable is a conserved quantity for the system (5) for i = 1, . . . , N , provided the integrals exist. By Kelvin's circulation theorem, the following integrals over an advected domain S(t) in the plane are also conserved, wheren is the horizontal outward unit normal and ds is the arclength parameter of the closed curve ∂S(t) bounding the domain S(t) moving with the flow.

Hamiltonian formulation for the stochastic NLQG fluid
Having understood the geometric structure (Lie-Poisson bracket, constants of motion and Kelvin circulation theorem) for the deterministic case, we can introduce the stochastic versions of equations (5) by simply making the Hamiltonian stochastic while preserving the previous geometric structure, as done in the previous section. Namely, we choose so that where the ζ k i (x, y), k = 1, . . . , K represent the correlations of the Stratonovich noise we have introduced in (15). For this stochastic Hamiltonian, the Lie-Poisson bracket (10) leads to the following stochastic process for the transport of the N -layer generalised vortices, where we have defined the stochastic transport velocity in the i th layer in terms of its stochastic stream function determined from the variational derivative of the stochastic Hamiltonian in (15) with respect to the generalised vorticity q i in the i th layer.

Remark 3 (Constants of motion)
The constants of motion C Φi in (12) and the Kelvin circulation theorem for the integrals I i in (13) persist for the stochastic generalised vorticity equations in (16). This is because both of these properties follow from the Lie-Poisson bracket in (10). However, the stochastic Hamiltonian in (15) is not conserved, since it depends explicitly on time, t, through its Stratonovich noise term.
3 The two-dimensional multilayer quasi-geostrophic model

Deterministic case
The two-layer deterministic QG equations for the potential vorticity (PV) anomaly q in a domain Ω are given by the PV material conservation law augmented with forcing and dissipation [16,21]: where ψ is the stream function, β is the planetary vorticity gradient, µ is the bottom friction parameter, ν is the lateral eddy viscosity, and u = (u, v) is the velocity vector.  [10]. Forcing in (19) is introduced via a vertically sheared, baroclinically unstable background flow (e.g., [3]) where the parameters U i are background-flow zonal velocities. The PV anomaly and stream function are related through two elliptic equations: with stratification parameters s 1 , s 2 . System (19)-(21) is augmented by the integral mass conservation constraint [15] ∂ ∂t by the periodic horizontal boundary conditions, and no-slip boundary conditions set at northern and southern boundaries of the domain.

Numerical method
The QG model (19)-(24) is solved with the CABARET method, which is based on a second-order, non-dissipative and low-dispersive, conservative advection scheme [13]. The CABARET scheme can simulate large-Reynoldsnumber flow regimes at lower computational costs compared to conventional methods (see, e.g., [2,23,20,12]), since the scheme is low dispersive and non-oscillatory.
The CABARET method is a predictor-corrector scheme in which the components of the conservative variables are updated at half time steps. Algorithm 1 illustrates the principal steps of the CABARET method adopted from [13]. To make the notation more concise, we introduce the forward difference operators in space and omit spatial and layer indices wherever possible, unless stated otherwise. The time step of the CABARET scheme is denoted by τ .

Algorithm 1 CABARET scheme for the deterministic QG system (19)-(24)
Predictor q is added in the prediction step after the elliptic problem is solved.
Solve the elliptic system of equations with respect to (ψ 1 ) , (v l )

Correction of the computed cell-face PV anomaly values
If q n+1 Corrector q n+1 , v(q n+1 ) are computed in the extrapolation step.

Stochastic case
The stochastic version of the QG equations (19) is given by [11]: The stochastic terms marked in red color is the only difference from the deterministic QG model (19), all other equations remain the same as in the deterministic case. However, the CABARET scheme in the stochastic case differs from the deterministic version and therefore its use can only be justified if it is consistent with the stochastic QG model. In other words, the CABARET scheme should be in the Stratonovich form.

Numerical method
The CABARET scheme for the stochastic QG system (25) is given by Algorithm 2 (with the stochastic terms highlighted in red). To the best of our knowledge, the CABARET scheme has not been applied to the stochastic QG equations, and is used in this work for the first time.
In order to show that the CABARET scheme is consistent with the stochastic QG model, we rewrite the scheme as the improved Euler method (also known as Heun's method) [14], which solves stochastic differential equations (SDEs) in the form of Stratonovich.
In doing so, we omit the space indices for the potential vorticity anomaly q to emphasize the functional dependence on q, and introduce an extra variable which allows to recast (26) and (27) (see Algorithm 2) in the form Substitution of (28a) into (28b) and (26) into the forcing term F visc ψ q n+ 1 2 leads to where Algorithm 2 The CABARET scheme for the stochastic QG system Predictor q is added in the prediction step after the elliptic problem is solved.
Solve the elliptic system of equations with respect to (ψ 1 ) Update velocity components at the cell faces

Correction of the computed cell-face PV anomaly values
If q n+1 Corrector q n+1 where q n+1 , u(q n+1 ), v(q n+1 ) are computed in the extrapolation step.
Retaining the terms up to order ∆t in (29) we get where G β does not depend on q n , and H.O.T. denotes higher order terms. Thus we have shown that the CABARET scheme is in Stratonovich form up to order (∆t) 3/2 .

Consistency in time of the stochastic CABARET scheme
In this section we prove that the stochastic CABARET scheme (30) is consistent with the stochastic QG equation (25) in the mean square sense in time, since its consistency in space is guaranteed by its second order approximation [13]. We consider a Stratonovich process q = q(t, x), x = (x, y) satisfying the SPDE and rewrite it in the Itô form or alternatively with the stochastic and deterministic parts defined as q d : We define consistency for SPDE (31) as follows Definition 1 We say that a discrete time-space approximation q n = q n d + q n s of q = q d + q s with the time step ∆t and space steps ∆x = (∆x 1 , ∆x 2 , . . . , ∆x d ) is consistent in mean square of order α > 1 and β > 1 in time and space with respect to (31) if there exists a nonnegative function for all fixed values q n , time n = 0, 1, 2, . . . and space indices.
Since our focus in this section is on consistency in time, we have to prove that the following estimation holds: Theorem 1 Assuming that there exists a constant C > 0 such that the following assumptions hold with |r − s| ≤ ∆t, the stochastic CABARET scheme (30) is consistent in mean square with c(∆t) = (∆t) 2 .
Proof Integration of (31) with respect to time over the interval [s, t] gives Substitution of (30) and (33) into (32) leads to By combining the terms in (34), we get where Applying the triangle and Young's inequalities to (35) we arrive at Using A2, the Cauchy-Schwarz inequality and A1, we estimate the first term as Estimation of the second term is given by To estimate the term C in (35), we use the triangle inequality to get and then separately estimate each term on the right hand side. Applying the Cauchy-Schwarz inequality and A4 to C 1 , we get the following estimation The term C 2 is estimated as Using A3 for C 3 leads to Finally, we arrive at the following estimation which proves the theorem.

Data assimilation methods
We find it useful to describe the framework and the data assimilation methodology through the language of nonlinear filtering. For this purpose, let us consider a probability space (Ω, F, P) on which we define a pair of processes Z and Y . The process Z is normally called the signal process, and the process Y models the observational data and is called the observation process. In the context of this work the signal process, also called the true state, is given by the solution of the deterministic QG equation (19) computed on a fine grid G f = 2049 × 1025 and projected onto a coarse grid, denoted by G s (details below), by spatially averaging the high-resolution stream function ψ f over the corresponding coarse grid cells.
The filtering problem consists in computing/approximating the posterior distribution of the signal Z t , denoted by π t given the observations Y s , s ∈ [0, t]. In our context, the observations consist of noisy measurements of the true state recorded at discrete times (every 2 or 4 hours) and are taken at locations (called weather stations) on a data grid G d defined below. The data assimilation is performed at these times, which we call the assimilation times.
The most basic particle filter, called the bootstrap particle filter (see section 5.1 for details) uses ensembles of, say, N particles that evolve according to the law of the signal between assimilation times. At the data assimilation times, each particle is weighed according to the likelihood of its position given the new data. A new set of particles is then obtained by sampling N times (with replacement) from the set of weighted particles. The end result is that particles with high likelihoods (close to the true trajectory) are kept and possibly multiplied, and particles with small likelihoods (that are away from the true trajectory) are eliminated. As a result, the ensemble of particles should stay closer to the true trajectory when compared to the particles that evolve according to the signal distribution. The bootstrap particle filter described below uses multiple copies of the signal which, in our case, would require the resolution of the deterministic QG equation (19) on a fine grid G f = 2049 × 1025. Each run of the particle filter at such a high resolution is very expensive computationally (one data assimilation step takes approximately 15 minutes). Taking into account that we assimilate data over thousands steps, the computational resources needed are too large. For this reason we replace the true state with a proxy. We use a process X defined on the same probability space whose sample paths are a lot cheaper to simulate. In our case, the process X will be the solution of the stochastic QG equation (25) computed on the (signal) grid G s (each run of the process X requires around 20 seconds). This replacement is a form of model reduction: we reduce the dimension of the underlying state from G f = 2049 × 1025 to G s = 129 × 65. This model reduction is critical to successful implementation of the data assimilation procedure. It is also rigorously justified as we explain now.
The posterior distribution π t depends continuously on two constituents: the (prior) distribution of the signal and the observation data. That means that if we replace the original signal distribution with a proxy distribution, we will still obtain a good approximation of π t provided the original and the proxy distributions are close to each other in some suitably chosen topology on the space of distribution.
For the current work, the way in which we ensure that two distributions (original and proxy) remains close to each other is by adding the right type of stochasticity to the model and "in the right directions". This is done through the Stochastic Advection by Lie Transport (SALT) approach [11]. The stochasticity is calibrated to match the fluctuations of X as explained in see [7,5]. We emphasize that one does not seek a pathwise approximation of the true state, but only an approximation of its distribution.
In our case, the true state is deterministic, and the process X is random. As we saw in [7,5], one can visualise the distribution of X through ensembles of particles with trajectories that are solutions of the stochastic QG equation (25) computed on the grid G s and driven by independent families of Brownian motions. In the language of uncertainty quantification, the difference between the two distributions is interpreted as the "uncertainty of the model". Typically, the ensemble of particles is a spread "around" the true trajectory, the size of the spread measuring the (model) uncertainty. To visualise this one can look at projections of the true trajectory and the ensembles of particles at various grid points. Of course, the more refined the grid G s is, the closer the two distributions are, and the smaller the spread. However, refining the grid G s leads to an increase in the computational effort of generating the particle trajectories. One of the roles of data assimilation is to reduce the spread (the uncertainty) without refining the grid.
The average position of the ensemble of particles obtained through the data assimilation, denoted by Z t , is a pointwise estimate of the true state Z t , whilst the spread of the ensemble is a measure of the approximation error Z t − Z t . As explained in the introduction, the data assimilation methodology presented here is asymptotically consistent: the empirical distribution of the particles converges, as N → ∞ to the posterior distribution π t [8].
As a consequence, Z t converges to the conditional expectation of the true state Z t given the observations, and the empirical covariance matrix converges to the conditional expectation of (Z t − Z t )(Z t − Z t ) T given the observations. 4 . This is true if the particles evolve according to the original distribution of the signal. In our case we use a proxy distribution, so the limit will be an approximation of π t , the difference between this approximation and π t being controlled by the choice of the signal grid G s . The data assimilation methodology described below consists in combination between the bootstrap particle filter and three additional procedures: nudging, tempering and jittering. The bootstrap particle filter cannot be used on its own to solve the data assimilation problem. The reason is that the particle likelihoods vary wildly from each other. That is because the particle themselves stray away from the true state rapidly and in different directions. This is reflected through observation data. One particle likelihood or a small number of such likelihoods will become much higher than the rest, and only the corresponding particle(s) will be selected and multiplied. This will not offer a good representation of the posterior distribution. As we will explain below the additional procedures will eliminate this effect ensuring a reasonably spread set of particles.
In the next subsections we study how each of these individual procedures influences the accuracy of the estimation Z t and the quality of the forecast given by the stochastic QG model. In order to study how the dimension of the observation process Y (the number of weather stations) affects data assimilation, we consider two different data grids G d = {4 × 4, 8 × 4}. We also study stochastic solutions on two different signal grids G s = {129 × 65, 257 × 129} in order to highlight the effect of more accurate proxy distributions on the results.
As stated above, we will use ensembles S of solutions of the stochastic QG equation (25) driven by independent realizations of the Brownian noise W . For the purpose of this paper, the size of the ensemble is taken to be N = 100 and the number of Brownian motion (independent sources of stochasticity) is taken to be K = 32; as already stated, the elements of the ensembles will be called particles. It was numerically shown in [5] that N = 100 and K = 32 is enough to reasonably approximate the fluctuations of the original distribution. Through numerical experiments, we showed that the spread of the ensemble will not increase substantially by taking more particles and/or sources of noise (Brownian motions).
The observations data Y t is, in our case, an M -dimensional process that consists of noisy measurements of the velocity field u taken at a point belonging to the data grid G d : is a projection operator from the signal grid G s to the data grid G d , η = N (0, I σ ) is a normally distributed random vector, with mean vector 0 = (0, . . . , 0) and diagonal covariance matrix I σ = diag(σ 2 1 , . . . , σ 2 M ). Rather than choosing an arbitrary σ = (σ 1 , . . . , σ M ) for the standard deviation of the noise, we use the standard deviation of the velocity field computed over the coarse grid cell of the signal grid.
We introduce the likelihood-weight function with M being the number of grid points (weather stations). In order to measure the variability of the weights (36) of particles we use the effective sample size: which is close to the ensemble size N if the particles have weights that are close to each other, and decays to one, as the ensemble degenerates (i.e. there are fewer and fewer particles with large weights and the rest have small weights). One should resample for the weighted ensemble if the ESS drops below a given threshold, N * , We chose N * = 80 to be our threshhold.

Bootstrap particle filter
In this section we consider the most basic particle filter, called the bootstrap particle filter or Sampling Importance Resampling filter [9]. This method works as follows.
Given an initial distribution of particles, each particle is propagated forward according to the stochastic QG equation. Then, based on partial observations, Y tj+1 of the true state, the weights of new particles are computed, and if the effective sample size drops below the critical value N * , the particles are resampled to remove particles with small weights.

Algorithm 3 Bootstrap particle filter
Obtain observation Yt j+1 from weather stations Compute w := W(qt j+1 , Yt j+1 ) if ESS(w) < N * then qt j+1 := Resample(w) end if end for For a high dimensional problem such as the one studied in this paper, the effective sample size drops very quickly to 1 as the sample degenerates rapidly. The reason for this is that particles travel very quickly away from the true state, and this is picked up by the observation data (unless the measurement noise is large which is not in our case -the observations are accurate). To counteract this, the resampling procedure would need to be performed unreasonably frequently or a large number of particles would need to be used.
To maintain the diversity of the ensemble we use instead three additional procedures: the tempering technique and jittering based on the Metropolis-Hastings Markov chain Monte Carlo (MCMC) method. We explain each of these procedures in the following sub-sections.

Tempering and Jittering
We will explain briefly the usage of these two procedures, see [6] for further details. The idea behind tempering is to artificially flatten the weights thorough rescaling the log likelihoods by a factor φ ∈ (0, 1], which is called temperature. Once this is done resampling can be applied. This gives a much more diverse ensemble as the ESS will have more reasonable values (the temperature is specifically chosen to ensure this). However, some of resulting particles will still have duplicates. To eliminate these, one uses jittering.
Jittering is another technique which improves the diversity of the ensemble by computing new particles which have been duplicated during resampling. There are different ways how to diversify the ensemble. For example, one can jitter the particles by simply adding some random perturbations to them. However, in this case, the perturbed particles are not the solutions of the stochastic QG equation that, in turn, can lead to nonphysical behaviour of the model. Instead, we compute new particles by solving the stochastic QG equation driven by the Brownian motion ρW + 1 − ρ 2 d W , where W is the original Brownian motion W and W is a new Brownian motion independent of W . The perturbation parameter ρ is chosen so that particles are not placed too far from the original position, yet far enough to ensure the diversity of the sample. In our experiments, we use ρ = 0.9999. Each new proposal for the position of the particle is then accepted/rejected thorough a standard Metropolis-Hastings method, in which M 1 stands for the number of iterations; we choose M 1 = 20. This ensures that the perturbations do not change the sought distribution.
Of course, after the first tempering-jittering cycle has finished, the particles in the resulting ensemble are samples of the altered distribution which is not what we desire, therefore the procedure is repeated by finding the next temperature value in the range (ϕ, 1] that offers a reasonable ESS. This is repeated until the temperature scaling is 1 so that the original distribution is recovered. The tempering-jittering methodology is given by Algorithm 4 below.

Nudging
Tempering combined with jittering is a powerful technique which can correctly narrow the stochastic spread in the presence of informative data, while also maintaining the diversity of the ensemble over a long time period. Their combined success depends crucially on the quality of the original sample proposals. This quality is produced by evolving the particles using the SPDE (the proxy distribution) and not the true distribution. To reduce the discrepancy introduced in this way, one can use nudging. The idea of nudging is to correct the solution of SPDE (25) so as to keep the particles closer to the true state. To do so, we add a 'nudging term' (marked in blue) to SPDE (25), Note that q depends on the parameter λ. The trajectories of the particles will be solutions of this perturbed SPDE (38). To account for the perturbation, the particles will have new weights according to Girsanov's theorem, Obtain observation Yt j+1 from weather stations Compute w := W(qt j+1 , Yt j+1 ) if ESS(w) < N * then Find p such that ESS(w) ≥ N * , where w := W 1/p (qt j+1 , Yt j+1 ) for k = 1, 2, . . . , p do end if end for end for end for end if end for given by As explained above, these weights measure the likelihood of the position of the particles given the observation, and the last term accounts for the change of probability distribution from q to q(λ). It therefore makes sense to choose weights that maximize these likelihoods. In other words, we could look to solve the equivalent minimization problem together with (38). In general this is a challenging nonlinear optimisation problem, especially if one allows the λ k 's to vary in time.
To simplify the problem, we perturb only the corrector stage of the final timestep before t j+1 . Then the (discrete version of the) minimization problem (40) becomes where δt is the time step. Let us re-write , where q t j+1/2 andq tj+1 are computed in the prediction and the extrapolation steps, respectively (see Algorithm 2 for detail). We can then re-write the minimisation problem (41) as where This is a quadratic minimization problem with the optimal value λ depending (linearly) on the increments ∆W 1 , ..., ∆W K . This optimal choice is not allowed as the parameter λ can only be a function of all the approximationq tj+1 , q t j+1/2 and Y tj+1 (since it needs to be adapted to the forward filtration of the set of Brownian motions {W k }). To ensure that this constraint is satisfied, we minimise the conditional expectation of V(q(λ), Y, λ) given theq tj+1 , q t j+1/2 and Y tj+1 , that is min We note that Q is independent of λ and does not play any role in the minimization operation.
Obtain observation Yt j+1 from weather stations Compute w := W(qt j+1 , Yt j+1 ), with W(qt j+1 , Yt j+1 ) = e −Λ , Λ : .N ] for n=1,2,. . . ,N do We would like to draw the reader's attention to the fact that the solution significantly depends on the resolution (see Fig. 1). Namely, the number of jets for the highest resolution 2049 × 1025 (Fig. 1a) is four (four red striations in the interior of the computational domain; the boundary layers on the top and bottom boundaries are not counted as such). However, there are only two jets for the lower-resolution flows: G = 1025×513 (Fig. 1b), G = 513 × 257 (Fig. 1c), G = 257 × 129 (Fig. 1d). Moreover, the lowest resolution flow (computed on the grid G = 129 × 65, Fig. 1e) shows no jets at all, and this is the flow that we paramaterise and then apply the data assimilation methods described above. We also use a finer grid G = 257 × 129 to study the effect of the resolutions on the data assimilation methods. It is important to note that there is no smooth transition between solutions at different resolutions like, for instance, in the double-gyre problem (e.g. [18]), and this makes lowerresolution solutions much harder to parameterise, since the parameterisation should compensate not only for the information lost because of coarse-graining, but also for the missing physical effects. For example, in the channel flow, the backscatter mechanism (e.g, [19]) at low resolutions is very weak, and thus it is not capable of cascading energy up to larger scales to maintain the jets.
In the following, we compare the dependence of the performance of the various procedures discussed above on the following parameters: the resolution of the signal grid, the number of observations (also referred to as weather stations), and the size of the data assimilation step. The methods will be applied to the parameterised QG model (25) which has been studied at length in [5].
Before going into further details, we remind the reader how we compute the true solution, which is denoted as q a , and also referred to as the truth or the the true state. For the purpose of this paper, we have computed two versions of the true solution q a . The first one is computed on a signal grid G s = 257 × 129 (dx ≈ dy ≈ 15 km), and the other one is computed on a signal grid G s = 129 × 65 (dx ≈ dy ≈ 299 km) (Fig. 2). Each true solution is computed as the solution of the elliptic equation (21) with the stream function ψ a , where ψ a is computed by spatially averaging the high-resolution stream function ψ f (computed on the fine grid G f = 2049 × 1025, dx ≈ dy ≈ 1.9 km) over the coarse grid cell G s . From now on, we focus only on the first layer, since the flow in the first layer is more energetic and exhibits an interesting dynamics including small-scale vortices and large-scale zonally elongated jets. In order to study how the number of weather stations influences the accuracy of data assimilation methods we will consider two different setups including M = 16 and M = 32 weather stations. Clearly, the location of weather stations can be optimized in such a way so as to give the most accurate data assimilation results. For the purpose of this work, we locate the weather stations at the nodes of equidistant Eulerian grids G d = {4×4, 8×4} shown in Fig. 3. In all numerical experiments we use N = 100 particles (ensemble members) and K = 32 (the number of ξ's); the choice of these parameters has been justified in [5]. It is worth noting that the initial conditions for the stochastic model have been computed over the spin up period T spin = [−8, 0] hours. The method of computing physically consistent initial conditions for the stochastic model is given in [5].

Tempering and jittering
We start with the study to the data assimilation algorithm that uses tempering and jittering, but not nudging (Algorithm 4). In the first experiment we run the stochastic QG model at the coarsest resolution (G s = 129×65) and use M = 16 weather stations. We compare the results of the data assimilation methodology with the forward run of the stochastic model. For this experiment we take the data assimilation step to be ∆t = 4 hours (the data assimilation step is time between to consecutive analysis cycles. The results are presented in Fig. 4 in terms of the relative bias (RB) and the ensemble mean l 2 -norm relative error (EME) given by with u p n being the n-th member of the stochastic ensemble, u a is the true solution, andū p := 1 N N n=1 u p n .
As Fig. 4 shows, the data assimilation method presented by Algorithm 4 (blue line) offers little improvement over the SPDE run without data assimilation methodology (red line) both in terms of the relative bias and in terms of the EME at the weather stations (Fig. 4a) and in the whole domain (Fig. 4b) throughout the time period of 20 days. As we will see later, the situation will improve as we decrease the data assimilation step and/or increase the resolution of the signal grid. But before, let us first look at the uncertainty quantification results for this particular setting. As expected, the spread for the stochastic QG model (25) decreases (to a certain extend) after the application of the data assimilation methodology. We illustrate the shrinkage of the stochastic spreads in Fig. 5 for the velocity computed at weather stations located in the slow flow region (blue stripes) and fast flow region (red jets) in Fig. 3.   5 shows that the truth (green line) is contained within the stochastic spread computed with and without the data assimilation method. Moreover, the spread computed with the data assimilation method (blue spread) is narrower than that computed only with the SPDE (red spread). To reduce it further, one can vary the data assimilation step ∆t. In particular, halving ∆t brings further reduction in both RB and EME (Fig. 6) and also reduces the uncertainty of the stochastic solution ( Fig. 7) (the spread is narrower).  The same as in Fig. 4, but for the data assimilation step ∆t = 2h. Fig. 7 The same as in Fig. 5, but for the data assimilation step ∆t = 2h.
Further substantial improvements in the performance of the data assimilation methodology are obtained when the resolution of the signal grid gets higher (G s = 257 × 129). In particular, the results are much more accurate both at the observation points (weather stations) (Fig. 8a) and over the whole domain (Fig. 8b). Moreover, the spread of the sample reduces dramatically as shown in figure Fig. 9.   Based on the above results, we conclude that both the reduction of the data assimilation step and the increase of the resolution of the signal grid can enhance the accuracy of the stochastic solution and reduce the spread of the stochastic ensemble. The effect of the increase of the resolution on both the accuracy and uncertainty appears to be much more pronounced compared to that of the reduction of the data assimilation step.

Nudging
We will now look at the performance of the data assimilation methodology that includes nudging (Algorithm 5) compared with the one that does not (Algorithm 4). As above we start with the stochastic QG model at the coarsest resolution (G s = 129 × 65) and use M = 16 weather stations. We present the results in Figs. 10 and 11. The improvements are obvious straightaway. The average of the stochastic spread becomes closer to the true solution compared with the same case but without using the nudging procedure (Fig. 10). However, in some cases, the true solution leaves the spread of the ensemble (Fig. 11). We do not have a clear explanation for this behaviour.   Again, at the higher resolution (G s = 257 × 129), the nudged solution is even more accurate than its low-resolution version (Fig. 12), and the uncertainty is further reduced (Fig. 13). The same as in Figure 11, but for the signal grid Gs = 257 × 129.
It is important to note that the number of observation locations (weather stations) used in the simulations above is only 0.19% and 0.05% of all degrees of freedom for the grids G s = 129 × 65 and G s = 257 × 129, respectively. Obviously, this number of weather stations is not enough to significantly reduce the uncertainty and decrease the error between the true solution and its parameterisation. Therefore, in the next simulation we double the number of observation locations (M = 32) and compare how Algorithm 5 (data assimilation with the nudging method) performs when more observational data is available.
As can be seen in Fig. 14, adding more weather stations does not significantly influence the results for the low-resolution (Fig. 14) and higher-resolution (Fig. 14) simulation.  Fig. 14 Evolution of the ensemble mean relative l 2 -norm error (EME) and relative bias (RB) for (a) all weather stations and (b) the whole domain; u a is the true solution, u p is the stochastic solution. In order to assimilate data we use tempering, jittering, and nudging (Algorithm 5); the data is assimilated from M = 32 weather stations every 4 hours; the grid size is Gs = 129 × 65.
When more observational data is available (32 weather stations), the uncertainty of the stochastic solution remains virtually the same compared with the case of using 16 weather stations for both G s = 129 × 65 (Fig. 16) and G s = 257 × 129 (Fig. 17).   Fig. 15 The same as in Figure 14, but for the signal grid Gs = 257 × 129.
Fig. 16 Shown are typical stochastic spreads of velocity u p 1 = (u p 1 , v p 1 ) at the weather stations located in the slow flow region (upper row) and fast flow region (lower row) for the SPDE without (red) and with (blue) using tempering, jittering, and nudging (Algorithm 5). The green line is the true solution; the grid size is Gs = 129 × 65.
Fig. 17 The same as in Fig. 16, but for the signal grid Gs = 257 × 129.
Again, we conclude that the Algorithm 4 improves the accuracy and reduces the uncertainty of the stochastic spread. The three parameters that we studied, the data assimilation step, and the size of the grid and the number of weather station, influence the performance of the data assimilation methodology. In particular, the smaller the data assimilation step is, the higher the accuracy becomes. We have found that the number of weather stations has a minor effect on the accuracy. However, if the resolution of the signal grid increases so does the accuracy of the solution computed with using the data assimilation methodology. Moreover, increasing the resolution of the signal grid G s dramatically reduces the uncertainty. The same conclusions are true for the Algorithm 5 that incorporates the nudging method. Moreover, the nudging procedure gives even more accurate solutions and further reduces the spread when compared with Algorithm 4.
We have also assessed the ensemble reliability on the grid G s = 257 × 129 by analysing the rank histograms (see, e.g. [1]) computed by simulating the stochastic model (without using the data assimilation methodology) (Fig. 18a) and compared them with the rank histograms for the ensemble produced by Algorithm 5 (Fig. 18b). As the plots show, the stochastic ensemble, computed without the regular corrections made through the data assimilation method, has a bias (rank histograms are not flat) at many observation locations. As a result it will makes the ensemble prediction unreliable.  Rank histograms for velocity u p 1 = (u p 1 , v p 1 ) at six different locations (not shown). Three observation points are located in the fast flow within red jets (first three upper rows in the plot), and the other three observation points are located in the slow flow between the jets (next three rows in the plot). Each histogram is based on 500 forecast-observation pairs generated by solving the stochastic QG model without (a) and with (b) using data assimilation. For simulating the stochastic QG model we use N = 100 (ensemble size), K = 32 (number of ξ's); for the data assimilation method we use Algorithm 5 (tempering, jittering, and nudging), the number of observation locations M = 32, the data assimilation step ∆t = 4h. Each ensemble member for the rank histogram is selected randomly from the ensemble of 100 members every 4 hours.
The situations changes when one uses the data assimilation method based one tempering, jittering, and nudging (Algorithm 5); see Fig. 18b. All the rank histograms now show no sign of a pronounced bias and demonstrate a well-calibrated ensemble. In other words, the application of the data assimilation methodology proposed in the paper corrects the bias introduced by the paramaterisation and produces a reliable forecast.
Finally, we have compared true PV anomaly q fields and the ones computed with using the data assimilation methodology based on tempering, jittering, and nudging (Algorithm 5), see Fig. 19.  As Fig. 19 shows, the paramaterised solutionq p 1 using the data assimilation methodology based on tempering, jittering, and nudging (Algorithm 5) gives an accurate forecast of the true state q a 1 even with a very small number of weather stations.

Conclusion and future work
In this paper, we have reported on our recent progress in developing an ensemble based data assimilation methodology for high dimensional fluid dynamics models. Our methodology involves a particle filter which combines model reduction, tempering, jittering, and nudging. The methodology has been tested on the twolayer quasi-geostrophic model with O(10 6 ) degrees of freedom. Only a minute fraction of these are noisily observed (16 and 32 weather stations). The model is reduced by following the stochastic variational approach for geophysical fluid dynamics introduced in [11]. We have also introduced a stochastic time-stepping scheme for the quasi-geostrophic model and have proved its consistency in time. In addition, we have analyzed the effect of different procedures (tempering, jittering, and nudging) on the accuracy and uncertainty of the stochastic spread. Our main findings are as follows: -The tempering and jittering procedure (Algorithm 4) improves the accuracy by reducing the uncertainty of the stochastic spread; -The nudging procedure (Algorithm 5) brings major improvements to the combinations of the tempering and jittering (Algorithm 4), both in terms of the relative bias (RB) and ensemble mean relative l 2 -norm error (EME) -The number of weather stations has a minor effect on the RB and EME; -The size of the data assimilation step has a substantial effect; namely, the smaller the data assimilation step, the higher the accuracy; -The resolution of the signal grid significantly improves the accuracy and reduces the uncertainty of the stochastic spread; -The proposed data assimilation methodology corrects the bias introduced by the paramaterisation and produces a reliable forecast.
We regard the data assimilation method based on tempering and jittering combined with the nudging method proposed here as a potentially valuable addition to data assimilation methodologies. We also expect it to be useful in developing data assimilation methodologies for larger, more comprehensive ocean models. The combination of the four components presented here (model reduction, tempering, jittering and nudging) can be enhanced by further improvements including localization, space-time data assimilation, etc. Applications of these combined components will form the subject of subsequent work.