1 Introduction

Due to present day computational resources, it is possible to build a sophisticated ocean models. There are always unavoidable inaccuracies in the observations and in the models, therefore a realistic oceanographic system must be capable of modeling these uncertainties. Data assimilation methods provide a good way of reducing uncertainties in ocean dynamic systems.

Data assimilation (DA) is usually used to enhance model predictions by constraining their outputs with available observations. DA methods are usually divided into two main categories: variational methods based on least-squares fitting (Bennet 1992), and sequential methods based on the so called Kalman filter (Evensen 2003). Recently, sequential ensemble Kalman filter methods became popular in geophysical applications as they are easy to implement, are robust, and computationally reasonable. Different Ensemble Kalman filter (EnKF) variants were developed in recent years depending on whether or not the observations are perturbed before assimilation (Tippett et al.2003; Burgers et al. 1998; Verlaan and Heemink 1997; Houtekamer and Mitchell1998; Anderson 2001; Bishop et al. 2001; Whitaker and Hamill2002; Hoteit et al. 2002).

EnKFs are often used with relatively small ensembles as compared to the numerical dimension of the system state (e.g., ensembles are often in the order of \(\mathcal {O}(10-100)\) while the dimension of state vectors in an ocean model can easily be in millions). This means that the estimation error covariance matrix is always singular and can only represent the model errors in a low dimensional space. A small ensemble size could certainly lead to systematically underestimated variances and spuriously large cross covariances in the sample error covariance matrix (Hamill et al. 2001). In order to negate these effects, two auxiliary techniques are used. These are covariance inflation (Anderson and Anderson 1999) and localization (Hamill et al. 2001). Covariance inflation tackles with the underestimation of the variances, while covariance localization addresses the problems of singularity and overestimation of the cross covariances.

Implementing a correct and efficient DA method for a model can be an elaborate task. There are various software packages that provide standard implementations that take away the need to develop custom implementations for each simulation model. All these packages have their strong and weak points. The package that suits best for a particular application depends on many factors. Among these are PDAF (Nerger and Hiller 2013), a toolbox written in FORTRAN 90 that contains highly optimized implementation of LETKF, and LSEIK for usage in strong parallel environments. Sequoia (Le Hénaff and Marsaleix 2009) contains a collection of FORTRAN 95 routines that allow users to build their own data assimilation system. DART (Raeder et al. 2012) is a framework containing ensemble-based data assimilation algorithms and tools. The NERSC repository, containing implementations of various EnKFs. OAK (Barth et al. 2008) is a toolbox containing various filters and tools that interacts with the model using NetCDF files. SESAM (Brasseur 2006) is a set of data assimilation building blocks that are combined together using NetCDF files. OpenDA is a data assimilation framework written in JAVA and some parts in C. Models can be coupled to OpenDA in two ways: in memory coupling through a software library or a black box coupling in which all the interactions between data assimilation method and model goes though input and output files. OpenDA is described in more detail in Section 4. A thorough description of all the frameworks mentioned above can be found in (Heemink et al. 2012).

In this work, we generalize the OpenDA framework to the specific needs for ocean applications. We have coupled the NEMO (Madec 2014) ocean model with OpenDA toolbox using the black box approach. Several synthetic experiments, assimilating the simulated altimetry into a double-gyre NEMO ocean model, are performed with the objective to investigate the impact of different EnKF setups in the quality of analysis. These setups mainly focused on the observation distribution, the ensemble size, and the localization length scale. Furthermore, attention will also be paid to investigate the efficiency and usefulness of localization. Presently, OpenDA only supports distance-based localization for in memory coupling. The use of the automatic localization technique which is part of the OpenDA toolbox is not limited to in memory coupled models and can be easily applied for the NEMO model using the black box coupling. The development of automatic localization techniques is still new and limited to few studies, e.g., (Anderson and Anderson 1999; Zhang and Oliver 2011). From these, only the later is suitable for operational usage due to the relative limited additional computational costs. It is interesting to investigate whether automatic localization is able to improve the performance of EnKF, when it is applied to a medium sized model like the double-gyre NEMO configuration.

The double-gyre NEMO ocean model corresponds to an idealized configuration of the NEMO model: a square and 5000-m-deep flat bottom ocean at mid latitudes (the so called square-box or SQB configuration). The domain extends from 24N to 44N, over 30 in longitude (60–30 W) with 11 vertical levels between 152 and 4613 m in depth. The minimum horizontal resolution of the model is 1/4 . This setup is known as the SEABASS configuration (Bouttier et al. 2012).

The paper is organized as follows. Section 2 presents an overview of the EnKF implemented in this study. Section 3 describes the automatic localization technique. A brief overview of OpenDA data assimilation toolbox with its new features, parallelization of ensembles, and automatic localization is given in Section 4. The ocean model, NEMO, configuration, and the assimilation setup are explained in Section 5. In Section 6, assimilation results of the EnKF filter are presented and discussed. Concluding remarks follow in Section 7.

2 Ensemble Kalman filter

Consider the following discrete dynamical system

$$ \mathbf{x}^{t}_{k} = \mathbf{M}_{k,k-1}\mathbf{x}^{t}_{k-1} + \eta_{k}, $$
(1)

where \(\mathbf {x}^{t}_{k}\) denotes the vector representing the true state of the system at time k, M k, k−1 is the state transition operator that takes as inputs the state at time k−1 and outputs the state at time k, and η k is the system noise with covariance matrix Q k . At time k, the observed data vector is given by

$$ \mathbf{y}_{k} = \mathbf{H}_{k} \mathbf{x}^{t}_{k} + \epsilon_{k}. $$
(2)

Here, y k is the vector with observed values and 𝜖 k is the observational noise with covariance matrix R k . If both the dynamical and observation systems are linear, the state estimation problem is solved by the Kalman filter (Kalman 1960). The main procedures of the Kalman filter involve recursively computing the means and covariance matrices of the system state x k (conditioned on available observations), as outlined below.

  • Forecast step: At time instant (k−1), propagate the state mean (also called analysis) \(\mathbf {x}_{k-1}^{a}\) and the associated error covariance matrix \(\mathbf {P}_{k-1}^{a}\) to obtain the forecast state (also referred to as background) \(\mathbf {x}_{k}^{f}\) and the associated covariance \(\mathbf {P}_{k}^{f}\) at next time k respectively as

    $$ \mathbf{x}_{k}^{f} = \mathbf{M}_{k,k-1} \mathbf{x}_{k-1}^{a} \, , $$
    (3)
    $$ \mathbf{P}_{k}^{f} = \mathbf{M}_{k,k-1} \, \mathbf{P}_{k-1}^{a} \, \mathbf{M}_{k,k-1}^{T} + \mathbf{Q}_{k}\, . $$
    (4)
  • Analysis step: When a new observation y k becomes available at time k, update \(\mathbf {x}_{k}^{f}\) and \(\mathbf {P}_{k}^{f}\) to their analysis counterparts, \(\mathbf {x}_{k}^{a}\) and \(\mathbf {P}_{k}^{a}\) with the Kalman update formula:

    $$ \mathbf{x}_{k}^{a} = \mathbf{x}_{k}^{f} + \mathbf{K}_{k} \left( \mathbf{y}_{k} - \mathbf{H}_{k} \mathbf{x}_{k}^{f} \right) \, , $$
    (5)
    $$ \mathbf{P}_{k}^{a} = \mathbf{P}_{k}^{f} - \mathbf{K}_{k} \mathbf{H}_{k} \mathbf{P}_{k}^{f} \, , $$
    (6)
    $$ \mathbf{K}_{k} = \mathbf{P}_{k}^{f} \mathbf{H}_{k}^{T} (\mathbf{H}_{k} \mathbf{P}_{k}^{f} \mathbf{H}_{k}^{T} + \mathbf{R}_{k} )^{-1} \, , $$
    (7)

    with K k being Kalman gain.

Geophysical fluid dynamical (and sometimes observational) models are nonlinear, and thus the conventional Kalman filter cannot be directly applied for data assimilation into these systems. To account for nonlinearities, modifications are introduced in the original Kalman filter using a Monte Carlo approach and refers usually as ensemble Kalman filtering (EnKF) (Evensen 1994). The EnKF algorithm uses an ensemble of the system states to represent the model error. The ensemble size n is typically much smaller than the dimension m x of the state vector. Suppose that an n-member analysis ensemble \(\mathbf {X}^{a}_{k-1,i}: i= 1,2,...,n\) is available at the (k−1)th filtering step, we take the set \(\mathbf {X}^{f}_{k}\) as the forecast ensemble at instant k, and the forecast sample mean \(\hat {\mathbf {x}}_{k}^{f}\) and covariance \(\hat {\mathbf {P}}_{k}^{f}\) are given by

$$\begin{array}{@{}rcl@{}} \hat{\mathbf{x}}_{k}^{f} = \frac{1}{n} \sum\limits_{i=1}^{n} \mathbf{x}^{f}_{k,i}; \qquad \hat{\mathbf{P}}_{k}^{f} &=& \frac{1}{n-1} \sum\limits_{i=1}^{n} \left( \mathbf{x}^{f}_{k,i} - \hat{\mathbf{x}}_{k}^{f} \right) \\&&\times\left( \mathbf{x}^{f}_{k,i} - \hat{\mathbf{x}}_{k}^{f} \right)^{T} \, . \end{array} $$
(8)

In practice, \(\hat {\mathbf {P}}_{k}^{f}\) needs not be calculated. The Kalman gain K k is then approximated by

$$ \mathbf{K}_{k} = \hat{\mathbf{P}}_{k}^{f} \mathbf{H}_{k}^{T} (\mathbf{H}_{k} \hat{\mathbf{P}}_{k}^{f} \mathbf{H}_{k}^{T} + \mathbf{R}_{k} )^{-1} $$
(9)

When observation y k is made available, one updates each member of the forecast ensemble by

$$ \mathbf{x}_{k,i}^{a} = \mathbf{x}_{k,i}^{f} + \mathbf{K}_{k} \left( \mathbf{y}_{k,i}^{s} - \mathbf{H}_{k} \mathbf{x}_{k,i}^{f} \right), i = 1, 2, ... , n, $$
(10)

where \( \mathbf {y}_{k,i}^{s}\) are the “surrogate” (i.e., perturbed) observations generated by drawing n samples from the normal distribution of mean y k and covariance R k (Burgers et al. 1998). Propagating \(\mathbf {X}^{a}_{k}\) forward to the next time instant, one starts a new assimilation cycle, and so on.

Most of the computational time of an ensemble based data assimilation method is dominated by the forward model simulations. The ensemble members can propagate the model state forward in time in parallel but still due to limited computational resources, we often need to use relatively small ensemble sizes. The use of a small ensemble may lead to filter divergence due to spurious correlations and unrealistic updates of the ensemble members (Hamill et al. 2001). Distance-dependent localization techniques are a popular way to compensate for small ensemble size. The distance-based localization techniques determine the weighting coefficients β between model variables/measurement locations. These weights β∈[0,1] are determined based on the physical distance of the variables. Weights will be close to zero for elements with no correlation and near one for elements we do expect strong correlations. Localization can be applied directly on the covariance matrices or to the computed gain matrix. In this paper, we only consider the latter and update the model using a localized gain matrix according to

$$ \mathbf{K}^{Loc}= \mathbf{K} \circ {\beta}_{xy}, $$
(11)

where K denotes the gain matrix, K Loc the localized gain matrix, and β x y the weighting coefficients between the observation locations and the model state elements.

The localization weights are often determined using a correlation function with local support that takes the distance as input. This function is non zero for distances up to a given threshold like the fifth-order piecewise rational function as defined in Gaspari and Cohn (1999). Non smooth functions assigning only 0 and 1 weights can also be used (Anderson and Anderson 1999), but these have shown to be inferior (Anderson 2004). Although this approach assumes to be a good measure for correlation, the distances cannot always be properly defined for some sources of observations (Anderson 2004) or the correlation scales can be time dependent. Hence, this method often needs quite some tuning in order to find the optimal distances.

3 Automatic localization

A step forward in fully generic data assimilation methods is to handle the localization in an automatic way avoiding manual configuration, modeling, and tuning of the localization distances. Anderson (2004) suggests a hierarchical ensemble filter. This method tries to tackle the heuristic issues and determine localization weights based on a theoretical basis. The idea is to split all the ensembles into N g groups of N e ensembles of ensembles. The gain matrices K j , j = 1,.., N g for all the groups are computed. The elements k i, j of the gain matrices are assumed to be randomly drawn from the same distribution that contains the true values of k i . The localization weights α i are selected such that it minimizes

$$ \sqrt{{\sum}^{N_{g}}_{j=1} {\sum}^{N_{g}}_{k=1, k \neq j} \left( \alpha_{i} k_{i,k} - k_{i,j} \right)} . $$
(12)

This method requires a large ensemble size and is therefore computationally inefficient compared to a traditional implementation using distance-based localization. However, it can be used to derive localization statistics which can produce high-quality localization strategy for production runs.

Another approach suggested by Zhang and Oliver (2011) uses a bootstrap technique to identify spurious correlation in the gain matrix and then generate weights accordingly. From the original ensemble x i with i = 1,.., N e , N B ensembles of size N e are generated by randomly drawing ensemble members from the original ensemble. The bootstrapped ensemble will contain some of the ensemble members of the original ensemble several times. From each bootstrapped ensembles, we generate gain matrices \({\hat {\mathbf {K}}_{k}}, k\in [1,..,N_{B}]\) for each element \(\hat {k}_{i,j}\) of the gain matrix we compute the estimator of the variance as

$$ \hat{\sigma}^{2}_{{k}_{i,j}}=\frac{{\sum}_{m=1}^{N_{B}}\left( \hat{{k}}^{*}_{i,j,m}-\bar{{k}}_{i,j} \right)^{2}}{N_{B}}, $$
(13)

where \(\bar {{K}}\) denotes the mean of the bootstrapped gain matrices \(\hat {{K}_{k}}\). The squared ratio between mean and variance is defined as

$$ \hat{C}^{2}_{v_{i},j}=\frac{\hat{\sigma}^{2}_{{k}_{i,j}}}{\bar{{k}}^{2}_{i,j}}. $$
(14)

The localization weights are then defined by

$$ \beta_{i,j}=\frac{1}{1+\left( 1+ 1/\sigma^{2}_{\alpha} \hat{C}^{2}_{v_{i},j}\right)}. $$
(15)

The parameter \(\sigma ^{2}_{\alpha }\) can be used to balance between the deletion of spurious correlations and possibly deleting real correlations. Zhang and Oliver (2010), investigated the best possible value of σ and came up with an optimal value of \(\sigma ^{2}_{\alpha }=0.36\).

This method can easily be incorporated into an existing implementation of an EnKF. The analysis step will become roughly N B times more expensive computationally which might be acceptable for assimilating a moderate amount of observations but is not be practical for assimilating large number of observations. A possible solution for assimilating large amounts of observations is to define a large region around each observation isolating it from the surrounding to avoid any correlations from which it is safe to assume there is no correlation between the observation and the points outside this area. The automatic localization can then be applied only to the observations inside this area.

4 OpenDA data assimilation toolbox

OpenDA is an open-source toolbox for developing and applying data assimilation and calibration algorithms to dynamic models (OpenDA 2013). The design philosophy of OpenDA is to breakdown data assimilation into a set of building blocks using the principles of object-oriented programming. The OpenDA toolbox consists of two parts; a description of the building blocks with their interfaces, and a library of default implementations of those building blocks. It stems from the merger of two data assimilation initiatives: COSTA (van Velzen and Segers 2010; Altaf et al. 2009; Heemink et al. 2010; Altaf et al. 2011) and DATools (Weerts et al. 2010).

OpenDA is written in Java. The object oriented nature of java fits well to the modular design of OpenDA and allows easy extension of functionality. The system is set up in such a way that end users can setup experiments without any coding or compiling requirements. Java was historically considered slow, the performance has improved significantly by many improvements in the Java virtual machine a.o. Just-In-Time compilation and Adaptive optimization (Sestoft 2010). In order to get a performance similar to a data assimilation algorithms written purely in languages like C and Fortran, some computationally demanding parts are written in C.

Once a model is prepared for OpenDA, it is possible to try out various off-the-shelf data assimilation algorithms without any further programming. It has been shown that the performance is close to dedicated implementations (van Velzen and Verlaan 2007) and furthermore can automatically propagate the ensemble of models either serially or concurrently.

4.1 OpenDA components

As illustrated in Fig. 1, OpenDA defines three major building blocks: method (the data assimilation or calibration algorithms), observations (the stochastic observer for handling the observations), and the model. The end user can make arbitrary combinations of these building blocks to set up an experiment without requiring any programming. The system is configured using an XML schema where a user defines options regarding: algorithm type (EnSR, EnKF, etc.) and ensemble size; the variables to assimilate and grid location within the model; and associated uncertainties of the model, forcing data, parameters, and observations. Using the default implementations, observations can be stored in various formats like NetCDF, CSV, NOOS, or in an SQL database. OpenDA can produce output in various formats such as ASCII, Matlab, Python, and NetCDF to easily analyze the filtering performance and by plotting useful graphs.

Fig. 1
figure 1

OpenDA defines three building blocks: method (the data assimilation or calibration algorithms), observations (the stochastic observer for handling the observations), and the model

4.2 Coupling a model to OpenDA

The OpenDA model interface has already been implemented for over 20 dynamic model codes (van Velzen et al. 2012). There are two approaches to link a model to OpenDA. The most simple and often the best choice is a so called black box coupling. In case of a black box coupling, OpenDA interacts with the model only by writing model input files, running the model executable and interpretation of the model output files. The coupling can be realized without access and changes to the model code. There are situations where the black box coupling is not the best choice especially when the overhead of restarting the model is high or when the model lacks proper restarting functionality. In these situations, it is possible to setup an “in-memory coupling.” The model must be made available as a software library that implements a specified list of routines, called the OpenDA model interface. OpenDA supports in-memory coupling for models written in various programming languages, e.g., Fortran, C, C#, and Python. The in-memory coupling is most efficient computationally and potentially unlock all data. This allows easy creation of the complex interpolation operators and the noise processes. The amount of work needed to create an in memory coupling is very model dependent. The amount of work can be small when the model itself is available as a library with functionality to access the model state, e.g., the models that implement the OpenMI interface (Gregersen et al. 2007; Gijsbers et al. 2010). In such an ideal situation, there is no need to have access to the model source code. Writing some bridging code will be sufficient (Ridler et al. 2014). On the other hand, it can be a serious amount of work to realize an in memory coupling especially for some large legacy simulation codes.

OpenDA contains a black box coupling tool that minimizes the effort needed to create a highly configurable black box coupling as illustrated in Fig. 2. The user must write a wrapper code for reading and writing the input and output files of the model. Depending on the type of file, input, output, or both, the wrapper code will be able to read and/or write the relevant data from/to the model files. This is the only programming effort required to realize the black box coupling. All other configuration and specification is done using black box configuration files.

Fig. 2
figure 2

The OpenDA black box coupling tool simplifies coupling a model and arbitrary can be run in parallel. The user only needs to provide a small part of model specific code to read and write the model output and input files in order to couple the model. The parallelization is handled by in interface model that handles all parallel interaction between the data assimilation method and the models

The black box configuration consists of three layers:

  • Wrapper layer: contains the specification of location of (template) input files, the names of executables needed to perform a model run and the input and output files that have to be processed using the wrapper code.

  • Model layer: contains a list of the variables from the required input and output files. In this layer, it is possible to rename the ID’s of the variables in order to ensure uniqueness.

  • StochModel layer: The model as presented to the data assimilation method is constructed. The state vector is composed by listing the associated variables and arrays, noise models on model forcings can be specified and information is provided about matching model predictions to observations.

The advantage of this approach is that it offers a lot of flexibility without the need to do any additional programming. The end user can easily add or remove values to the part of the state that is updated by assimilation and experiment with various sources of observations or noise models.

4.3 Parallel computing

To reduce the computational time, the model runs in ensemble-based data assimilation methods can be executed in parallel. The setup of OpenDA allows automatic execution of the models in a natural way. Parallelization is handled using the so called interface model. This is a model that extends the functionality of arbitrary OpenDA model. In this case, it handles and hides the parallelisation of ensemble members. Other examples of interface models in OpenDA are models for automatic bias correction and Kalman smoothing. These interface model create an interface between the actual model and the data assimilation method. It supports both parallelisation of models using threads and running the models on remote machines. The interface model works for arbitrary models since it only makes use of the same methods from the model interface as the algorithm does as illustrated by the shape of the puzzle peaces in Fig. 2.

4.4 Localization in OpenDA

The current release of OpenDA (2.1) supports localization using the shur product on the Kalman gain matrix. The user provides a function that takes the meta-information on the observations as input and produces the weights β x y . For numerical models with a fixed, regular computational grid, such an implementation is not difficult. However, OpenDA has been coupled to many simulation suites, e.g., for modeling water quality, rivers, groundwater. The computational grids are often irregular and the state vector contains various quantities depending on the user settings. Writing a generic function for determining the localization weights, based on some parameters, is an elaborate task for these models. Moreover, some of the end users are novice data assimilation users and require a more black box approach.

The method suggested by Zhang and Oliver (2011) is part of OpenDA and is conceptually very suitable. It does not need any configuration except for the bootstrap size N B and also adds at least some necessary form of localization to all models in OpenDA. Though OpenDA still provides provision for a custom localization function when needed.

5 NEMO ocean model

The model setup used in this study is based on an idealized configuration of the NEMO primitive equation ocean model (Cosme et al. 2010). The domain is a square box (between 25 and 45 N), 5000-m-deep flat bottom ocean at mid latitudes and refers as SQB configuration. The model domain is closed with lateral boundaries. A double gyre circulation is created with constant zonal wind forcing blowing westward in the northern and southern parts of the basin and eastward in the middle part of the basin. Figure 3 shows the snapshot of the model sea surface height (SSH). The flow is dominated by chaotic mesoscale dynamics, with largest eddies that are about 100 km wide (Beckers et al. 2012), and to which correspond velocities of about 1 m/s and dynamic height differences of about 1 m. These properties are quite similar in shape and magnitude to the Gulf Stream (North Atlantic).

Fig. 3
figure 3

A snapshot of SSH (m) for the SQB model configuration (April 23, 2009)

The primitive equations of the NEMO ocean model are discretized on an Arakawa C grid, with a horizontal resolution of \(1/4 \times 1/4 \cos (\phi )\) and 11 z-coordinate levels along the vertical. The model uses a “free surface” formulation, with an additional term in the momentum equation to damp the fastest external gravity waves and the bi-harmonic diffusion operator. The model time integration is based on a leap-frog scheme and an Asselin filter (Cosme et al. 2010).

5.1 Assimilation setup

For the present SQB configuration, a time step is chosen as 15 min. The model is spin-up for a period of 40 years. The model calendar is idealized to 30 days per month. The model state vector is multivariate, containing 2 time instances of 17 variables. The two time instances are due to the leap-frog scheme. The user can select which variables of the state are updated by the data assimilation algorithm by selecting them in the OpenDA configuration file. In order to keep the leap-frog scheme in tact, the user should always select both time instances of a variable to be updated.

The ensemble Kalman filter (EnKF) is used for all the experiments. The assimilation is performed during 1 year (360 days calendar) with analysis frequency of 2 days. The model output of the year 75 is taken as the truth. Observations of sea surface heights (SSH) are generated from the truth by adding spatially uncorrelated Gaussian noise with standard deviation of 6 cm. The observation grid is generated from the ENVISAT and Jason-1 flight parameters: 35 days and 1002 passes one cycle for ENVISAT and 10 days and 254 passes one cycle for Jason-1. Figure 4 shows SSH observation for one assimilation period. In our experiments, we update the sea surface height variables of the state vector.

Fig. 4
figure 4

SSH Observations for assimilation period September 18, 2009

The beginning pass numbers for both satellites of the year 2009 are taken. Since the observation grid is different from the model grid, while generating the SSH observations, bi-linear interpolation is used to generate SSH values from model grid to observation grid (Yan et al. 2014). Figure 5 presents the root mean square errors (RMSE) in SSH (m), between the truth and the free model run with an interval of every 2 days (analysis steps). The errors are in the range between 0.15 and 0.30 m.

Fig. 5
figure 5

Root mean square errors (RMSE) in SSH (m) between the free model run and truth taken at every 2 days period for the year 2009

The initial ensemble is generated with 100 members from a 10-year interval (years 41–50) of free model simulation with outputs every 30 days are used. The last 100 outputs are used to initialize the ensemble. The 10-year interval is selected far from the true state year, for better evaluation of the assimilation performance. From this 100 ensemble members, 4 sets of different ensemble members are picked (15, 30, 50, 100). This is helpful in evaluating the results with respect to ensemble members and localization.

The computational time is dominated by the NEMO model runs. Depending on the number of ensemble members, 1–3 % is consumed by the classical EnKF implementation in OpenDA when 8 NEMO models are run in parallel.

6 Results and discussion

As a first step, a set of four assimilation experiments are performed with different set of ensemble members (15, 30, 50, 100). Figure 6 presents the snapshots of SSH for an assimilation experiment with an ensemble of 100 members. Comparison of forecast and analysis with the truth and free run demonstrates that the figure clearly shows that the assimilation experiments have improved the model results significantly and the analysis at each step brings the model close to the truth. Keeping in mind, we are only assimilating and updating SSH, the forecast results look reasonable as well.

Fig. 6
figure 6

Snapshots of sea surface height (m) at analysis step 1, 50, and 90 for top: forecast using EnKF, second row: analysis using EnKF, third row: truth, and bottom: free run without data assimilation

To see the filter performance with lower ensemble member, the RMSE in SSH over the whole model domain (with respect to each analysis step) are compared for different ensemble members. We have plotted these results in Fig. 7 for 15 and 100 ensemble members. Decreasing the number of ensemble members has a negative effect on the performance of the EnKF. The results demonstrate that the RMSE are lower for 100 ensemble case specially for later part of the assimilation. The results also demonstrate that a small ensemble size also gives reasonable assimilation results. One should also note that the data assimilation process has improved the model results significantly. The range of errors with no data assimilation is between 0.15 and 0.30 m which is decreased to 0.05 and 0.15 with data assimilation (Fig. 7).

Fig. 7
figure 7

RMSE in SSH (m) between EnKF and truth at analysis steps using ensemble members of sizes 15 and 100

So far, we have examined the EnKF performance based on RMSE which is the most commonly used measure to evaluate the filter performance. A rank histogram is another common diagnostic to measure the ensemble reliability by repeatedly tallying the rank of the verifying observations at each assimilation step (Hamill 2000). A reliable rank histogram will show up a flat rank histogram. Figure 8 presents the rank histogram of the ensembles forecast at assimilation step 20, 70, and 140 for an ensemble size of 15 and an ensemble size of 100. The U-shape histograms are observed for both the ensemble sizes and can be an indication that the ensembles suffer from conditional biases or lack of variability (Hamill 2000).

Fig. 8
figure 8

Rank histogram of the ensembles forecast step 20, 70, and 140 (from left to right) using EnKF. The top row corresponds to an ensemble size of 15 and the bottom row to an ensemble size of 100. The U shape can be an indication that the ensembles suffer from conditional biases or lack of variability

The EnKF showed reasonable improvements in assimilation results and showed that the black box coupling between the NEMO model and OpenDA works quite well. The next step is to apply and test the automatic localization strategy explained in Section 2 with the NEMO-OpenDA framework using the same experimental setup. We have selected the bootstrap size N B = 50 which is approximately the average ensemble size in our experiments.

Figure 9 shows the error results in the snapshots of the SSH taken at 20th, 70th, and 140th analysis steps with an ensemble of 100 members. These error plots are shown for EnKF, EnKF with localization and free run without data assimilation. It is evident from these plots that the localization has improved the EnKF results. Figure 10 presents the differences in the snapshots between the truth and localized EnKF using different ensemble sizes (15, 30, and 100). These error snapshots are taken at 20th, 70th, and 140th analysis steps. Increasing the ensemble size greatly influences the simulation results as evident from Fig. 10. As an example, consider the first column of the Fig. 10 (analysis step 20), it can be seen that the scale of these differences has notably reduced and large differences in the upper domain have almost vanished as we increase the ensemble size from 15 to 100. This is true for all three analysis steps shown here.

Fig. 9
figure 9

Snapshots of sea surface height errors from truth at analysis step 20, 70, and 140 for top: using EnKF, middle: EnKF with localization, and bottom: free run without data assimilation

Fig. 10
figure 10

Snapshots of errors in SSH from truth at analysis step 20, 70, and 140 using Localized EnKF with top: ensemble size: 15, middle: ensemble size: 30, and bottom: ensemble size: 100

To see the overall improvements at different analysis steps, Fig. 11 presents the RMSE of SSH over the whole model domain with and without localization for 100 ensemble cases. The RMSE clearly demonstrate that filter performance has improved with the introduction of localization. From the experimental results, we see that the automatic localization has a positive impact on the performance of the filter. To get some idea on what the automatic localization does, we have made plots of the localized and non localized gain matrix and the correction factor ρ for the SSH correction. Figure 12 shows these plots for an arbitrary observation at analysis step 60 created with ensemble size 15 and 100. Note that we cannot do a direct comparison between the gain of the two ensemble sizes since they are created with a different ensemble.

Fig. 11
figure 11

RMSE in SSH (m) between EnKF and truth at analysis steps with and without auto localization using ensemble of 100 members

Fig. 12
figure 12

Illustration of the impact of automatic localization for a single row of the K matrix for SSH for various ensemble sizes. The top row is created for an ensemble size of 15 and the lower row with an ensemble size of 100. The K matrices are generated for the same observation on analysis step 60. The first column shows the localized K matrix, the second column the K matrix without localization, and the last column shows the automatically computed correction factors ρ. The correction factors are not distance based but, they are a measure for uncertainty in the computed elements of K. This is different from factors which are determined using classical localization techniques. However, the method is well capable of removing structures in the K matrix which are expected to spurious, sharpening the important structures in the K matrix

We can clearly see that the localization enhances the structure in the gain matrix by removing some of the small structures which are likely to be created by spurious correlations. At the same time, we see that the non localized gain matrix created with 15 ensembles shows more spurious correlations than the gain created with 100 ensembles. The correction factors are very different from what we would expect from classical localization techniques. The corrections are not location based but are based on the uncertainty of gain elements. As a result, it shows large values as well on locations where the gain has a small value with high probability.

Finally, RMSE are computed for both forecast (before assimilation) and analysis (after assimilation). Both forecast and analysis RMSE are plotted to produce sawtooth diagrams. Figure 13 presents RMSE in SSH (m) between the truth and free model run and RMSE sawtooth patterns of the Forecast-Analysis with localized EnKF as we progress in time using 100 ensemble members. It is evident that not only analysis but the forecasts results are also improved with significant margin. One important point to notice here is that we are only assimilating SSH observations, and the model consists of 17 variables including temperature, salinity, etc. So for, a realistic case we expect different sources of observations to be part of assimilation process and that will further improve our analysis and forecasts results.

Fig. 13
figure 13

RMSE in SSH (m) between the truth and free model run (red line), RMSE sawtooth patterns with with localized EnKF—100 ensemble members (blue line), forecast with localized EnKF—100 ensemble members (red line)

7 Conclusions and future work

We have created an OpenDA-NEMO ocean data assimilation framework using OpenDA black box coupling. With limited programming effort, we have managed to read and write the NEMO netcdf restart file and configuration file. With the present framework, a set of twin experiments was performed for the NEMO model where we assimilate satellite SSH values using EnKF. We have used the automatic parallelization functionality of OpenDA to run the ensemble members in parallel.

From the results of these experiments, we see that a run without data assimilation has an RMSE in SSH that varies around 23 cm which is quite large since the SSH varies between −1 and 1 m. The EnKF, assimilating SSH improves the model predictions over the whole domain. Increasing the ensemble size does improve the performance of the EnKF filter but only to some extent. The experiments were repeated using the auto localization technique which is now a part of OpenDA. We have used the auto localization technique as the distance based localization technique was not yet available for black box coupling, however, will be present in the future version of OpenDA. The auto localization method does not determine weights based on the distances but based on the uncertainty of elements in the gain matrix. The introduction of automatic localization significantly improved the performance of EnKF for all ensemble sizes. A visual inspection shows that this method yields a very different localization weights as we are used to see in distance based localization. The method is, however, capable of keeping the important structures in the gain matrix intact while reducing artifacts that are likely caused by spurious correlations. The performance of the method justifies more research. In future work, we should compare the automatic localization technique to the distance-based localization and find out which method performs best for various types of observations and situations.

The automatic localization technique is very easy to use but increases the computational time of the analysis significantly. Using a bootstrap size of 50, the computational time of the analysis increased with a factor of 10 in our experiments. This makes the method only feasible to use for a moderate number of observations. We have to look for ways to reduce the computational time of the bootstrapping techniques. Possible solutions are to perform the bootstrapping in parallel and to combine the automatic localization method with a local analysis strategy such that the auto localization only needs to be computed for a limited computational domain.