1 Introduction and Motivations

Assimilation of observations into models is a well-established practice in the meteorological community. In view of the numbers (on the order of \(10^7\) or \(10^8\)) of model variables and (on the order of \(10^6\)) of observations in use for operational models, a variety of approaches have been proposed for reducing the complexity of an assimilation method to a computationally affordable version while retaining its advantages. Ensemble approaches and reduced order models are the most significant approximations. Other approaches take full advantage of existing Partial Differential Equations(PDEs)-based solvers, based on spatial Domain Decomposition (DD) methods, where the DD-solver is suitably modified to also handle the adjoint system. A different approach is the combination of DD-methods in space and DA, where spatial domain-decomposed uncertainty quantification approach performs DA at the local level by using Monte Carlo sampling [1, 2, 27, 54]. The Parallel Data Assimilation Framework [41] implements parallel ensemble-based Kalman Filters algorithms coupled within the PDE-model solver.

The above mentioned methods reduce the spacial dimensionality of the predictive model and the resulting reduced order model is then resolved in time via numerical integration, typically with the same time integrator and time step employed for the high-fidelity model leading to high time synchronization. In last decades, parallel-in-time methods have been investigated for reducing the temporal dimensionality of evolutionary problems. Since Nievergelt, in 1964, which proposed the first time decomposition algorithm for finding the parallel solutions of evolutionary ordinary differential equations, and Hackbusch in 1984, who noted that relaxation operators in multigrid can be employed on multiple time steps simultaneously, the methods of time-parallel time integration have been extensively expanded and several relevant works can be found in the literature. An extensive and updated literature list can be found at the website [47] collecting information about people, methods and software in the field of parallel-in-time integration methods. Among them, we mention the Parallel Full Approximation Scheme in Space and Time (PFASST), introduced in [14]. PFASST is based on a simultaneous approach reducing the optimization overhead by integrating the PDE-based model directly into the optimization process, thus solving the PDE, the adjoint equations and the optimization problem simultaneously. Recently, a non-intrusive framework for integrating existing unsteady partial differential equation (PDE) solvers into a parallel-in-time simultaneous optimization algorithm, using PFASST, is provided in [22]. Finally, we cite the parallel PDE solvers based on Schwarz preconditioner in space-time, proposed in [19, 29, 53].

We propose the design of an innovative mathematical model, and the development and analysis of the related numerical algorithms, based on the simultaneous introduction of space-time decomposition in the overlapping case on the PDEs equations governing the physical model and on the DA model. The core of our approach is that the DA model acts as coarse/predictor operator solving the local PDE model, by providing the background values as initial conditions of the local PDE models. Moreover, in contrast to the other decomposition in time approaches, in our approach local solvers (i.e. both the coarse and the fine solvers) run concurrently from the beginning. As a consequence, the resulting algorithm only requires exchange of boundary conditions between adjacent sub-domains. It is worth to mention that the proposed method belongs to the so-called reduced-space optimization techniques, in contrast to the full-space approaches such as the PFASST method, reducing the runtime of the forward and the backward integration time loops. As a consequence, we could combine the proposed approach with the PFASST algorithm. Indeed, PFASST could be concurrently employed as local solver of each reduced-space PDE-constrained optimization subproblem, exposing even more temporal parallelism.

The article will describe the more general DA problem setup for describing the domain decomposition approach, then it will focus on parallel algorithm solving the reduced order model analysing the impact of the space and time decomposition on the performance of the algorithm and finally we provide the analysis of the algorithm’s scalability. Results presented here are should be intended as the starting point for the software development to make decisions about computer architecture, future estimates of the problem size (e.g., the resolution of the model and the number of observations to be assimilated) and the performance and parallel scalability of the algorithms.

In conclusion, specific contributions of this work include:

  • a novel decomposition approach in space-time leading to a reduced order model of the coupled PDE-based 4DVar DA problem.

  • strategies for computing the ‘kernels’ of the resulting Regularized Nonlinear Least Square computational problem.

  • a priori performance analysis that enables a suitable implementation of the algorithm in advanced computing environments.

The article is organized as follows. Sect. 2 gives a brief introduction to the Data Assimilation framework, where we follow the discretize-then-optimize approach. Main result is the 4DVar functional decomposition, which is given in Sect. 3. In Sect. 4 we review the whole parallel algorithm while its performance analysis is discussed in Sect. 5 on the Shallow Water Equations on the sphere. The number of state variables in the model, the number of observations in an assimilation cycle, as well as numerical parameters as the discretization step in time and in space domain are defined on the basis of discretization grid used by data available, at repository Ocean Synthesis/Reanalysis Directory of Hamburg University (see [15]). Scalability prediction of the case study based on the Shallow Water Equations is performed in Sect. 6. Finally, conclusions are provided in Sect. 7.

2 The Data Assimilation framework

We start with the more general DA problem setup and then, for simplicity, for describing the domain decomposition approach, we will consider a more convenient setup.

Let \(\mathcal {M}^{\varDelta \times \varOmega }\) denote a forecast model described by nonlinear Navier-Stokes equationsFootnote 1 where \(\varDelta \subset \Re \) is the time interval, and \(\varOmega \subset \Re ^N\) is the spatial domain. If \(t \in \varDelta \) denotes the time variable and \(x \in \varOmega \) the spatial variable, letFootnote 2

$$\begin{aligned} u^b(t,x) : \, \varDelta \times \varOmega \mapsto \Re \end{aligned}$$

be the function, which we assume belonging to the Hilbert space \(\mathcal {K}(\varDelta \times \varOmega )\) equipped with the standard euclidean norm, representing the solution of \(\mathcal {M}^{\varDelta \times \varOmega }\). Following [8], we assume that \(\mathcal {M}^{\varDelta \times \varOmega }\) is symbolically described as the following initial value problem:

$$\begin{aligned} \left\{ \begin{array}{ll} u^b(t,x)= \mathcal {M}^{\varDelta \times \varOmega }[u^b(t_0,x)], &{} \forall \, (t,x) \in \varDelta \times \varOmega , \\ u^b(t_0, x)=u_0^b(x), &{} t_0\in \varDelta \, , \,x \in \varOmega \,. \\ \end{array} \right. \end{aligned}$$
(1)

Function \(u^b(t,x)\) is said background state in \(\varDelta \times \varOmega \). Function \(u^b_0(x)\) is the initial condition of \(\mathcal {M}^{\varDelta \times \varOmega }\) and this is the value of the background state in \(t_0\times \varOmega \). Let:

$$\begin{aligned} v(\tau ,y)=\mathcal {H}(u(t,x)), \quad (t,x) \in \varDelta \times \varOmega , \quad (\tau ,y) \in \varDelta ' \times \varOmega ' \end{aligned}$$
(2)

where \(\varDelta '\subset \varDelta \) be the observations time interval and \(\varOmega ' \subset \Re ^{nobs}\), with \(\varOmega '\subset \varOmega \) be the observation spatial domain. Finally,

$$\begin{aligned} \mathcal {H}: \mathcal {K}(\varDelta \times \varOmega ) \mapsto \mathcal {K}(\varDelta '\times \varOmega ') \end{aligned}$$

denotes the observations mapping, where \( \mathcal {H}\) is a nonlinear operator which includes transformations and grid interpolations. According to the practical applications of model-based assimilation of observations, we will use the following definition of a Data Assimilation (DA) problem associated to \(\mathcal {M}^{\varDelta \times \varOmega }\).

Definition 1

(The DA problem set-up) LetFootnote 3

  • \(\{t_k\}_{k=0,M-1}\), where \(t_k=t_0+k \varDelta t\), be a discretization of \(\varDelta \), such that \(\varDelta _M:=[t_0,t_{M-1}] \subseteq \varDelta .\)

  • \(D_{K}(\varOmega ):=\{x_j\}_{j=1,K}\in \Re ^{K}\), be a discretization of \(\varOmega \), such that \(D_{K}(\varOmega ) \subseteq \varOmega \quad .\)

  • \(\varDelta _M\times \varOmega _{K}=\{\mathbf {z}_{ji}:=(t_j,x_i)\}_{i=1,K;j=1,M}\);

  • \(\mathbf {u}_0^{b} :=\{u_0^j\}_{j=1,K}^{b} \equiv \{ u(t_0,x_j)^{b} \}_{j=1,K}\in \Re ^{K}\) be the discretization of initial value in (1);

  • \(\mathbf {u}_k^b:= \{ u^b(t_k,x_j) \}_{j=1,K}\in \Re ^K\) be the numerical solution of (1) at \(t_k\);

  • \(\mathbf {u}^b= \{\mathbf {u}_k^b\}_{k=0,M-1}\);

  • \(nobs<<K\);

  • \(\varDelta '_M=[\tau _0, \tau _{M-1}]\subseteq \varDelta _M\);

  • \(D'_{nobs}(\varOmega '):=\{x_j\}_{j=1,nobs}\in \Re ^{nobs}\), be a discretization of \(\varOmega '\), such that

    $$\begin{aligned} D_{nobs}(\varOmega ') \subseteq \varOmega '\quad . \end{aligned}$$
  • \(\mathbf {v}_k:=\{v(\tau _k,x_j)\}_{j=1,nobs}\in \Re ^{nobs}\) be the values of the observations on \(x_j\) at \(\tau _k\);

  • \(\mathbf {v}= \{\mathbf {v}_k\}_{k=0,M-1}\in \Re ^K\);

  • \(\{\mathbf {H}^{(k)}\}_{k=0,M-1}\), the Tangent Linear Model (TLM) of \(\mathcal {H}(u(t_k,x))\) at time \(t_k\);

  • \(\mathbf {M}^{\varDelta _M \times \varOmega _{K}}\) be a discretization of \(\mathcal {M}^{\varDelta \times \varOmega }\).

  • \(\mathbf {M}^{0,M-1}\), is the TLM of \(\mathcal {M}^{\varDelta \times \varOmega }\), i.e. it is the first order linearizationFootnote 4 of \(\mathcal {M}^{\varDelta \times \varOmega }\) in \(\varDelta _M \times \varOmega _{K}\) [24];

  • \(\mathbf {M}^T\)is the Adjoint Model (ADM)Footnote 5 of \(\mathbf {M}^{0,M-1}\) [20]Footnote 6.

The aim of DA is seeking to produce the optimal combination of the background and observations throughout the assimilation window \(\varDelta '_M\) i.e. to find an optimal tradeoff between the estimate of the system state \(\mathbf {u}^b\) and \(\mathbf {v}\). The best estimate that optimally fuses all these information is called the analysis, and it is denoted as \(\mathbf {u}^{DA}\). It is then used as an initial condition for the next forecast.

Definition 2

(The 4DVar DA problem: a regularized nonlinear least square problem (RNL-LS)) Given the DA problem set-up, the 4DVar DA problem consists in computing the vector \(\mathbf {u}^{DA}\in \Re ^{K}\) such that

$$\begin{aligned} \mathbf {u}^{DA} = \arg \min _{\mathbf {u} \in \Re ^{K}} J(\mathbf {u}) \end{aligned}$$
(5)

with

$$\begin{aligned} J(\mathbf {u})= \Vert \mathbf {u}- \mathbf {u}^{b}_0\Vert _{\mathbf{B}^{-1}}^{2} + \lambda \sum _{k=0}^{M-1} \Vert \mathbf {H}^{(k)}(\mathbf {M}^{\varDelta _M \times \varOmega _{K}} ( \mathbf {u}))-\mathbf {v}_k \Vert _{\mathbf{R}_k^{-1}}^{2} \end{aligned}$$
(6)

where \(\lambda >0\) is the regularization parameter, \(\mathbf{B}\) and \(\mathbf{R}_k\) (\(\forall k= 0,M-1\)) are the covariance matrices of the errors on the background and the observations respectively, while \(\Vert \cdot \Vert _{\mathbf{B}^{-1}}\) and \(\Vert \cdot \Vert _{\mathbf{R}_k^{-1}}\) denote the weighted euclidean norm.

The first term of the (6) quantifies the departure of the solution \(\mathbf {u}^{DA}\) from the background state \(\mathbf {u}^{b}\). The second term measures the sum of the mismatches between the new trajectory and observations \(\mathbf {v}_k\), for each time \(t_k\) in the assimilation window. The weighting matrices \(\mathbf{B}\) and \(\mathbf{R}_k\) need to be predefined, and their quality influences the accuracy of the resulting analysis [3].

As K exceeds \(10^6\) in general this problem can be considered as a Large Scale Nonlinear Lest Square problem. We provide a mathematical formulation of a domain decomposition approach, which starts from decomposition of the whole domain \(\varDelta \times \varOmega \), namely both the spatial and temporal domain; it uses a partitioning of the solution and a modified functional describing the RNL-LS problem on the subdomain of the decomposition. Solution continuity equations across interval boundaries are added as constraints of the assimilation functional. We will first introduce domain decomposition of \(\varDelta \times \varOmega \) then, restriction and extension operators will be defined on functions given on \(\varDelta \times \varOmega \). These definitions will subsequently generalized to \(\varDelta _M \times \varOmega _{K}\).

3 The Space-Time Decomposition of the Continuous 4DVar DA Model

In this section we give a precise mathematical setting for space and function decomposition then we state some notations used later. In particular, we first introduce the function and domain decomposition, then by using restriction and extension operators, we associate to the domain decomposition a functional decomposition. So, we prove the following result: the minimum of the global functional, defined on the entire domain, can be obtained by collecting the minimum of each local functional.

For simplicity we assume that the spacial and temporal domains of the observations are the same of the background state, i.e. \(\varDelta '=\varDelta \) and \(\varOmega '=\varOmega \); furthermore we assume that \(t_k=\tau _k\).

Definition 3

(Domain Decomposition) Let \(P\in \mathbf {N}\) and \(Q\in \mathbf {N}\) be fixed. The set of bounded Lipschitz domains \({\varOmega _i}\), overlapping sub-domains of \(\varOmega \):

$$\begin{aligned} DD(\varOmega )=\left\{ \varOmega _i \right\} _{i=1,P} \end{aligned}$$
(7)

is called a decomposition of \(\varOmega \), if

$$\begin{aligned} \bigcup _{i=1}^{P} \varOmega _i =\varOmega \end{aligned}$$
(8)

with

$$\begin{aligned} \varOmega _{jh}:=\varOmega _j \cap \varOmega _h\ne \emptyset \end{aligned}$$

when two subdomains are adjacent. Similarly, the set of overlapping sub-domains of \(\varDelta \):

$$\begin{aligned} DD(\varDelta )=\left\{ \varDelta _j \right\} _{j=1,Q} \end{aligned}$$
(9)

is a decomposition of \(\varDelta \), if

$$\begin{aligned} \bigcup _{j=1}^{Q} \varDelta _j =\varDelta \end{aligned}$$
(10)

with

$$\begin{aligned} \varDelta _{ik}:=\varDelta _i\cap \varDelta _k =\ne \emptyset \end{aligned}$$

when two subdomains are adjacent. We call domain decomposition of \(\varDelta \times \varOmega \) and we denote it as \(DD(\varDelta \times \varOmega )\), the set of \(P\times Q\) overlapping subdomains of \(\varDelta \times \varOmega \):

$$\begin{aligned} DD(\varDelta \times \varOmega )= \left\{ \varDelta _j \times \varOmega _i \right\} _{j=1,Q;\ i=1,P}\,. \end{aligned}$$
(11)

From (11) it follows that

$$\begin{aligned} \varDelta \times \varOmega = \cup \varDelta _j \times \cup \varOmega _i = \cup (\varDelta _j \times \varOmega _i) . \end{aligned}$$

Associated to the decomposition (11) we define the Restriction Operator of functions belonging to \(\mathcal {K}(\varDelta \times \varOmega )\):

Definition 4

(Restriction of a function) Let

$$\begin{aligned} RO_{ji}: f \in \mathcal {K}(\varDelta \times \varOmega ) \mapsto RO_{ji}[f]\in \mathcal {K}(\varDelta _j \times \varOmega _i) \end{aligned}$$

be the Restriction Operator (RO) of f in \(DD(\varDelta \times \varOmega )\) as in (11) be such that:

$$\begin{aligned} RO_{ji}[f(t,x)]\equiv \left\{ \begin{array}{ll} f(t,x), &{} \quad \forall \,\,(t,x) \in \varDelta _j \times \varOmega _i \\ \frac{1}{2}f(t,x), &{} \forall \,(t,x) \,s.t.\,x \in \varOmega _i, \quad \exists \, \bar{k}\ne j: t\in \varDelta _j \cap \varDelta _{\bar{k}},\\ \frac{1}{2}f(t,x), &{} \forall \,(t,x)\,t \in \varDelta _j, \quad \exists \, \bar{h}\ne i: x\in \varOmega _i \cap \varOmega _{\bar{h}},\\ \frac{1}{4}f(t,x), &{} \exists \,(\bar{h},\bar{k}) \ne (j,i): (t,x) \in (\varDelta _j \cup \varDelta _{\bar{h}}) \times (\varOmega _i \cup \varOmega _{\bar{k}}),\\ \end{array} \right. \end{aligned}$$

We pose:

$$\begin{aligned} f^{RO}_{ji}(t,x) \equiv RO_{ji}[f(t,x)] . \end{aligned}$$

For simplicity, if \(i \equiv j\), we denote \(RO_{ii}=RO_{i}\).

In line with this, given a set of \(Q\times P\) functions \(g_{ji}\), \(j=1,Q\), \(i=1,P\) each belonging to \(\mathcal {K}(\varDelta _j\times \varOmega _i)\), we define the Extension Operator of \(g_{ji}\):

Definition 5

(Extension of a function) Let

$$\begin{aligned} EO: g_{ji} \in \mathcal {K}(\varDelta _j\times \varOmega _i) \mapsto EO[g_{ji}] \in \mathcal {K}(\varDelta \times \varOmega ) \end{aligned}$$

be the Extension Operator (EO) of \(g_{ji}\) in \(DD(\varDelta \times \varOmega )\) as in (11)be such that:

$$\begin{aligned} EO[(g_{ji}(t,x)]= \left\{ \begin{array}{ll} g_{ji}(t,x) &{} \forall \,\, (t,x)\in \varDelta _j \times \varOmega _i \\ 0 &{} elsewhere \end{array} \right. \end{aligned}$$

We pose:

$$\begin{aligned} g^{EO}_{ji}(t,x) \equiv EO[g_{ji}(t,x)]. \end{aligned}$$

For any function \(u\in \mathcal {K}(\varDelta \times \varOmega )\), associated to the decomposition (8), it holds that

$$\begin{aligned} u(t,x)= \sum _{i=1,P; j=1,Q}EO \left[ u^{RO}_{ji}(t,x) \right] . \end{aligned}$$
(12)

Given \(P\times Q\) functions \(u_{ji}(t,x) \in \mathcal {K}(\varDelta _i\times \varOmega _j)\), the summation

$$\begin{aligned} \sum _{i=1,P;j=1,Q}u^{EO}_{ji}(t,x) \end{aligned}$$
(13)

defines a function \(u \in \mathcal {K}(\varDelta \times \varOmega )\) such that:

$$\begin{aligned} RO_{ji}[u(t,x)]=RO_{ji} \left[ \sum _{i=1,P;j=1,Q}u^{EO}_{ji}(t,x) \right] = u_{ji}(t,x). \end{aligned}$$
(14)

Main outcome of this framework is the definition of the operator \(RO_{ji}\) for the 4DVar functional defined in (6). This definition originates from the definition of the restriction operator of \(\mathcal {M}^{\varDelta \times \varOmega }\) in (1), given as follows.

Definition 6

(Reduction of \(\mathcal {M}^{\varDelta \times \varOmega }\)) If \(\mathcal {M}^{\varDelta \times \varOmega }\) is defined in (1), we introduce the model \(\mathcal {M}^{\varDelta _j \times \varOmega _i}\) to be the Reduction of \(\mathcal {M}^{\varDelta \times \varOmega }\):

$$\begin{aligned} RO_{ji}: \mathcal {M}^{\varDelta \times \varOmega }(t,x)[u(t_0,x)] \mapsto RO_{ji}[\mathcal {M}^{\varDelta \times \varOmega }[u(t_0,x)]] \end{aligned}$$

defined in \(\varDelta _j \times \varOmega _i\), such that:

$$\begin{aligned} \left\{ \begin{array}{ll} u^R(t,x)= \mathcal {M}^{\varDelta _j \times \varOmega _i}[u^b(t_j,x)] &{} \forall \,\, (t,x)\in \varDelta _j \times \varOmega _i \\ u^b(t_j,x)= u^b_j(x) &{} t_j \in \varDelta _j\, , \quad x \in \varOmega _i \end{array} \right. \end{aligned}$$
(15)

It is worth noting that initial condition \(u^b_j(x)\) is the value in \(t_j\) of the solution of \(\mathcal {M}^{\varDelta \times \varOmega }[u(t_0,x)]\) defined in (1).

3.1 Space-Time Decomposition of the Discrete Model

Let assume that \(\varDelta _M\times \varOmega _{K}\) can be decomposed into a sequence of \(P\times Q\) overlapping subdomains \(\varDelta _j \times \varOmega _i\) such that

$$\begin{aligned} \varDelta _M \times \varOmega _{K}=\bigcup _{i=1,P; \ j=1,Q} \varDelta _j \times \varOmega _i \end{aligned}$$

where \(\varOmega _i \subset \Re ^{r_i}\) with \(r_i \le K\) and \(\varDelta _j \subset \Re ^{s_j}\) with \(s_j\le M\). Finally, let us assume that

$$\begin{aligned} \varDelta _j:= [t_j, t_{j+s_j}] \quad . \end{aligned}$$

Hence

$$\begin{aligned} \mathbf {u}_{ji}:=RO_{ji}(\mathbf {u}) \equiv \mathbf {u^{RO_{ji}}} \equiv (u(z_{ji}))_{z_{ji}\in \varDelta _j \times \varOmega _i}, \quad \mathbf {u}_{ji} \in \Re ^{s_j\times r_i }. \end{aligned}$$

In this respect, we define the Extension Operator (EO) also. If \(\mathbf {u}=(u(z_{ji}))_{z_{ji} \in \varDelta _j \times \varOmega _i}\), it is

$$\begin{aligned} EO(\mathbf {u})= \left\{ \begin{array}{cc} u(z_{kh}) &{} z_{kh} \in \varDelta _k \times \varOmega _h \\ 0 &{} elsewhere \end{array} \right. \end{aligned}$$

and \( EO(\mathbf {u}) \equiv \mathbf {u^{EO}} \in \Re ^{M\times K}\).

Definition 7

(Restriction of the Covariance Matrix) Let \(\mathbf {C}(\mathbf {w})\in \Re ^{K\times K}\) be the covariance matrix of a random vector \(\mathbf {w}=(w_1, w_2, \ldots ,w_{K}) \in \Re ^{K}\), that is coefficient \(c_{i,j}\) of \(\mathbf {C}\) is \(c_{i,j}=\sigma _{ij} \equiv Cov(w_i,w_j)\). Let \(s<K\), we define the Restriction Operator \(RO_{st}\) onto \(\mathbf {C}(\mathbf {w})\) as follows:

$$\begin{aligned} RO_{st}: \mathbf {C}(\mathbf {w})\in \Re ^{K\times K} \mapsto RO_{st}[\mathbf {C}(\mathbf {w})]\overbrace{=}^{def}\mathbf {C}(\mathbf {w^{RO_{st}}})\in \Re ^{s\times s} \end{aligned}$$

i.e., it is the covariance matrix defined on \(\mathbf {w^{RO_{st}}}\).

Hereafter, we refer to \(\mathbf {C}(\mathbf {w^{RO_s}})\) using the notation \(\mathbf {C_{st}}\).

Definition 8

(Restriction of the operator \(\mathbf {H}^{(k)}\)) We define the Restriction Operator \(RO_{ji}\) of \(\mathbf {H}^{(k)}\) in \(DD(\varDelta \times \varOmega )\) as in (11) as the TLM at time \(t_k\) of the restriction of \(\mathcal {H}\) on \(\varDelta _j \times \varOmega _i\).

Definition 9

(Restriction of \(\mathbf {M}^{\varDelta _M \times \varOmega _{K}} \)) We let \(\mathbf {M}^{\varDelta _j \times \varOmega _{i}}\) be the Restriction Operator \(RO_{ji}\) of \(\mathbf {M}^{\varDelta _M \times \varOmega _{K}} \) in \(\varDelta _j \times \varOmega _i\) where:

$$\begin{aligned} RO_{ji}: \mathbf {M}^{\varDelta _M \times \varOmega _{K}} (\mathbf {u}_0^b) \mapsto \mathbf {M}^{\varDelta _j \times \varOmega _{i}} (\mathbf {u}_0^b)=\mathbf {u}^{b}_{ji} \end{aligned}$$

defined in \(\varDelta _j \times \varOmega _i\).

Definition 10

(Restriction of the operator \(\mathbf {M}^{0,M-1}\)) We define \(\mathbf {M}^{j,j+1}_i \) to be the Restriction Operator \(RO_{ji}\) of \(\mathbf {M}^{0,M-1}\) in \(DD(\varDelta \times \varOmega )\), as in (11). It is the TLM of the Restriction of \(\mathbf {M}^{\varDelta _M \times \varOmega _{K}}\) on \(\varDelta _j \times \varOmega _i\).

Finally, we are now able to give the following definition.

Definition 11

(Restriction of 4DVar DA) Let

$$\begin{aligned} RO_{ji}[J]: \mathbf {u}_{{ji}}\mapsto RO_{ji}[J](\mathbf {u}_{{ji}}) \end{aligned}$$

denotes the Restriction operator of the 4DVar DA functional defined in (6). It is defined as

$$\begin{aligned} \begin{array}{ll} RO_{ji}[J](\mathbf {u}_{{ji}})=&{} \Vert \underbrace{RO_{ji}(\mathbf {u})}_{\mathbf {u}_{ji}}- \underbrace{RO_{ji}[\mathbf {M}^{\varDelta _M \times \varOmega _{K}} (\mathbf {u}^{b}_0)]}_{\mathbf {u}^{b}_{ji}}\Vert _{(\mathbf{B}^{-1})_{ji}}\\ &{} + \lambda \sum _{k:t_k \in \varDelta _j} \Vert \underbrace{RO_{ji}[\mathbf {H}^{(k)}]RO_{ji}[\mathbf {M}^{\varDelta _M \times \varOmega _{K}} (\mathbf {u})}_{(\mathbf {H}^{(k)})_{ji}RO_{ji}[(\mathbf {M}^{\varDelta _M \times \varOmega _{K}} ) (\mathbf {u}_{ji})]}-\underbrace{RO_{ji}[\mathbf {v_k}]}_{\mathbf {v}_{ji}} \Vert _{(\mathbf{R}^{-1})_{ji}}^{2} \quad .\\ \end{array} \end{aligned}$$
(16)

Local 4DVar DA functional \( J_{ji}(\mathbf {u}_{{ji}})\) in (16) becomes:

$$\begin{aligned} J_{ji}(\mathbf {u}_{{ji}})=&\underbrace{\Vert {\mathbf {u}_{ji}}-{\mathbf {u}^b_{ji}}\Vert _{(\mathbf {B}^{-1})_{ji}}}_{local \,\,state\,\,trajectory} + \end{aligned}$$
(17a)
$$\begin{aligned}&\lambda \sum _{k: t_k \in \varDelta _j}\underbrace{\Vert (\mathbf {H}^{(k)})_{ji}[\mathbf {M}^{k,k+1}_i(\mathbf {u}_{ji})]-\mathbf {v}_{ji}\Vert _{(\mathbf {R}^{-1})_{ji}}}_{local \,\,observations}. \end{aligned}$$
(17b)

This means that the approach we are following is to firstly decompose the 4DVar functional J then to locally linearize and solve each local functional \(J_{ji}\). For simplicity of notations we let

$$\begin{aligned} RO_{ji}[J] \equiv J_{ \varDelta _j \times \varOmega _i}\,. \end{aligned}$$

We observe that \(RO_{ji}[J](\mathbf {u}_{{ji}})\) is made of a first term which quantifies the departure of the state \(\mathbf {u}_{{ji}}\) from the background state \(\mathbf {u}^b_{{ji}}\) at time \(t_j\) and space \(x_i\). The second term measures the mismatch between the state \(\mathbf {u}_{{ji}}\) and the observation \(\mathbf {v}_{{ji}}\).

Definition 12

(Extension of 4DVar DA) Given \(DD(\varDelta \times \varOmega )\) as in (11) let

$$\begin{aligned} EO[J]: J_{\varDelta _j \times \varOmega _i} \mapsto J^{EO}_{\varDelta _j \times \varOmega _i}\,\,, \end{aligned}$$

be the Extension Operator (EO) of the 4DVar functional defined in (6), where

$$\begin{aligned} EO[J](J_{\varDelta _j \times \varOmega _i})= \left\{ \begin{array}{cc} J_{\varDelta _j \times \varOmega _i} &{} (t,x) \in \varDelta _j \times \varOmega _i \\ 0 &{} elsewhere \end{array} \right. \end{aligned}$$
(18)

From (19), it follows the decomposition of J as follows.

$$\begin{aligned} J\equiv \sum _{i=1,P;j=1,Q}J^{EO}_{ \varDelta _j\times \varOmega _i}\,\,. \end{aligned}$$
(19)

Main outcome of (19) is the capability of defining local 4D Var problems which contribute all together to the 4DVar problem as detailed in the following.

3.2 Local 4DVar DA Problem: The Local RNL-LS Problem

Starting from Local 4DVar functional in (17) which is obtained applying the Restriction Operator to the 4DVar functional defined in (6), we add a local constraint to such restriction. This is a sort of regularization of the local 4DVar functional introduced in order to enforce the continuity of each solution of the local problem onto the overlap region between adjacent subdomains. Local constraint consists of the overlapping operator \(\mathcal {O}_{(jh)(ik)}\) defined as

$$\begin{aligned} \mathcal {O}_{(jh)(ik)}:= \mathcal {O}_{jh}\circ \mathcal {O}_{ik} \end{aligned}$$
(20)

where the symbol \(\circ \) denotes the operators composition. Each operator in the (20) tackles the overlapping of the solution in the spatial dimension and in the temporal dimension, respectively. More precisely, for \(j = 1\dots Q; \, i=1 \ldots P\), operator \(\mathcal {O}_{(jh)(ik)}\) represents the overlap of temporal subdomains j and h and spatial subdomains i and k, where h and k are given as in Definition 4 and

$$\begin{aligned} \mathcal {O}_{ik}:\mathbf {u}_{ji} \in \varDelta _j \times \varOmega _i \mapsto \mathbf {u}_{(j)(ik)}\in \varDelta _j \times (\varOmega _i\cap \varOmega _k) \end{aligned}$$
(21)

and

$$\begin{aligned} \mathcal {O}_{jh}: \mathbf {u}_{(j)(ik)})\mapsto \mathbf {u}_{(jh)(ik)}\in (\varDelta _j \cap \varDelta _h) \times (\varOmega _i \cap \varOmega _k) \end{aligned}$$
(22)

Remark 1

We observe that, in the overlapping domain \(\varDelta _{jh} \times \varOmega _{ik}\) we get two vectors: \(\mathbf {u}_{(jh)(ik)}\), which is obtained as the restriction of \(\mathbf {u}_{(ji)}= argmin J_{ji}(\mathbf {u}_{ji})\) to that region, and \(\mathbf {u}_{(hj)(ki)}\), which is the restriction of \(\mathbf {u}_{(hk)}= argmin J_{hk}(\mathbf {u}_{hk})\) to the same region. The order of the indexes plays a significant rule from the computing perspectives.

From (20), three cases derive

  1. 1.

    decomposition in space, i.e. \(j=Q=1\) and \(P>1\). Here we get \(j=Q=1\), i.e. time interval is not decomposed, and \(P>1\), i.e. the spatial domain \(\varOmega \) is decomposed according to the domain decomposition in (11). The overlapping operator is defined as in (21). In particular we assume that

    $$\begin{aligned} \mathcal {O}_{ik}(\mathbf {u}_{ji}):=\Vert \underbrace{RO_{ji}(\mathbf {u}_{jk})}_{\mathbf {u}_{j(ki)}} - \underbrace{RO_{jk}(\mathbf {u}_{ji})}_{\mathbf {u}_{(j)(ik)}}\Vert _{(\mathbf {B}^{-1})_{ik}} \end{aligned}$$
  2. 2.

    decomposition in time, i.e. \(Q>1\) and \(P=1\). We get that \(i=P=1\), i.e. the spatial domain is not decomposed, and \(Q>1\), i.e. the time interval is decomposed according to the domain decomposition in (11). The overlapping operator is defined as in (22). In particular we assume that

    $$\begin{aligned} \mathcal {O}_{jh}(\mathbf {u}_{ji}):=\Vert \underbrace{RO_{ji}(\mathbf {u}_{hi})}_{\mathbf {u}_{(hj)i}} - \underbrace{RO_{hi}(\mathbf {u}_{ji})}_{_{\mathbf {u}_{(jh)i}}}\Vert _{(\mathbf {B}^{-1})_{jh}} \end{aligned}$$
  3. 3.

    decomposition in space-time, i.e. \(Q>1\) and \(P>1\). We assume that \(Q>1\) and \(P>1\) i.e. both the time interval and the spatial domain are decomposed according to the domain decomposition in (11). The overlapping operator is defined as in (20). In particular we assume that

    $$\begin{aligned} \mathcal {O}_{(jh)(ik)}(\mathbf {u}_{ji}):=\Vert \mathbf {u}_{(hj)(ki)} - \underbrace{RO_{hi}(RO_{jk}(\mathbf {u}_{ji}))}_{_{\mathbf {u}_{(jh)(ik)}}}\Vert _{(\mathbf {B}^{-1})_{(jh)(ik)}} \end{aligned}$$

We now give the new definition of the local 4DVar DA functional

Definition 13

(Local 4DVar DA) Given \(DD(\varDelta \times \varOmega )\) as in (11), let:

$$\begin{aligned} J_{ji}(\mathbf {u}_{{ji}})= & {} RO_{ji}[J](\mathbf {u}_{{ji}})+\mu _{ji} \ {O}_{(jh)(ik)}(\mathbf {u}_{{ji}}) \end{aligned}$$
(23)

where \(RO_{ji}[J](\mathbf {u}_{{ji}})\) is given in (16), \(\mathcal {O}_{(jh)(ik)}\) will be suitably defined on \(\varDelta _{jh}\times \varOmega _{ik}\), be the local 4DVar functional. Parameter \(\mu _{ji}\) is a regularization parameter. Finally, let

$$\begin{aligned} {\mathbf {u}}_{ji}^{DA} = \arg \min _{\mathbf {u}_{{ji}}}{J}_{ji}(\mathbf {u}_{{ji}}) \end{aligned}$$
(24)

be the global minimum of \({J}_{ji}\) in \(\varDelta _j \times \varOmega _i\).

More precisely, the local 4DVar DA functional \( J_{ji}(\mathbf {u}_{{ji}})\) in (23) becomes:

$$\begin{aligned} J_{ji}(\mathbf {u}_{{ji}})=&\underbrace{\Vert {\mathbf {u}_{ji}}-{\mathbf {u}^b_{ji}}\Vert _{(\mathbf {B}^{-1})_{ji}}}_{local \,\,state\,\,trajectory} + \end{aligned}$$
(25a)
$$\begin{aligned}&\lambda \sum _{k: t_k \in \varDelta _j}\underbrace{\Vert (\mathbf {H}^{(k)})_{ji}[\mathbf {M}^{k,k+1}_i(\mathbf {u}_{ji})]-\mathbf {v}_{ji}\Vert _{(\mathbf {R}^{-1})_{ji}}}_{local \,\,observations}+\end{aligned}$$
(25b)
$$\begin{aligned}&\mu \underbrace{\Vert \mathbf {u}_{(hj)(ki)}-\mathbf {u}_{(jh)(ik)}\Vert _{(\mathbf {B}^{-1})_{(jk)(ih)}}}_{overlap} \end{aligned}$$
(25c)

where the three terms contributing to the definition of the local DA functional clearly come out. We note that in (17) the operator \(\mathbf {M}^{k,k+1}_i\) which is defined in (4) replaces \(\mathcal {M}^{\varDelta _j \times \varOmega _i}\).

Finally, we have to guarantee that the global minimum of the operator J, can be searched among the global minima of local functionals.

3.3 Local 4DVar DA Minimization

Let

$$\begin{aligned} \mathbf {\widetilde{u}_{ji}}:=({\mathbf {u}}_{ji}^{DA})^{EO}\in \Re ^{M \times K} , \quad \forall \, j=1,Q;i=1,P \ \end{aligned}$$
(26)

where \({\mathbf {u}}_{ji}^{DA}\) is defined in (24), be (the extension of) the minimum of the (global) minimums of the local functionals \({J}_{ji}\) as in (24). Let

$$\begin{aligned} \mathbf {\widetilde{u}^{DA}}:= \arg \min _{j=1,Q;i=1,P} \left\{ J \left( \mathbf {\widetilde{u}_{ji}}\right) \right\} \end{aligned}$$
(27)

be its minimum.

Theorem 1

If \( DD( \varDelta \times \varOmega )\) is a decomposition of \( \varDelta \times \varOmega \) as defined in (11). It follows that:

$$\begin{aligned} J(EO(\mathbf {{u}^{DA}})) \le J( \mathbf {\widetilde{u}^{DA}}), \end{aligned}$$
(28)

with \( \mathbf {u}^{DA}\) defined in (5). Moreover, the equality in (28) holds if J is convex.

Proof

Let \({\mathbf {u}}_{ji}^{DA}\) be defined in (24), it is

$$\begin{aligned} \nabla {J}_{ji}[ {\mathbf {u}}_{ji}^{DA}]=\underline{0}\in \Re ^{NP}, \quad \forall (j,i) : \varDelta _j \times \varOmega _i\in DD(\varDelta \times \varOmega ). \end{aligned}$$
(29)

From the (29) follows

$$\begin{aligned} \nabla EO\left[ {J}_{ji}\left( {\mathbf {u}}_{ji}^{DA}\right) \right] =\underline{0}\,, \end{aligned}$$
(30)

which gives from the (19):

$$\begin{aligned} \nabla {J}\left[ ({\mathbf {u}}_{ji}^{DA})^{EO}\right] =\underline{0} \end{aligned}$$
(31)

then \(({\mathbf {u}}_{ji}^{DA})^{EO}\) is a stationary point for J in \(\Re ^{M \times K}\). As \( \mathbf {u}^{DA}\) in (5) is the global minimum of J in \(\Re ^K\), it follows that:

$$\begin{aligned} J(EO(\mathbf {{u}}^{DA})) \le J\left( (\mathbf {u}_{ji}^{DA})^{EO}\right) , \quad \forall \, j=1,Q;i=1,P \end{aligned}$$
(32)

then, from the (27) it follows that

$$\begin{aligned} J( EO(\mathbf {u}^{DA}) )\le J\left( \mathbf {\widetilde{u}^{DA}}\right) , \quad . \end{aligned}$$
(33)

Now we prove that if J is convex, then

$$\begin{aligned} J(EO( \mathbf {u}^{DA} ))= J(\mathbf {\widetilde{u}^{DA}}) \end{aligned}$$

by reduction to the absurd. Assume that

$$\begin{aligned} J(EO( \mathbf {u}^{DA}) )< J( \mathbf {\widetilde{u}^{DA}} ). \end{aligned}$$
(34)

In particular,

$$\begin{aligned} J(EO(\mathbf {u}^{DA})) <J( RO_{ji}( \widetilde{\mathbf {u}}^{DA})) \quad . \end{aligned}$$

This means that

$$\begin{aligned} RO_{ji} \left[ J(EO(\mathbf {u}^{DA} ))\right] < RO_{ji}\left[ J(\mathbf {\widetilde{u}^{DA} })\right] . \end{aligned}$$
(35)

From the (35) and the (27), it is:

$$\begin{aligned} RO_{ji} \left[ J(EO(\mathbf {u}^{DA}) )\right] < RO_{ji}\left[ min_{i} ( J\left( \mathbf {u}_{ji}^{DA})^{EO} \right) \right] \end{aligned}$$

then, from the (14):

$$\begin{aligned} J_{ji} \left( RO_{ji} [\mathbf {u}^{DA}]^{EO} \right) < J_{ji}\left( RO_{ji}\left[ \mathbf {u}_{ji}^{DA}\right] ^{EO}\right) = J_{ji}({\mathbf {u}}_{ji}^{DA} )\quad \quad . \end{aligned}$$
(36)

The (36) is an absurd as the values of \({\mathbf {u}}_{ji}^{DA}\) is the global minimum for \(J_{ji}\). So the (28) is proved.

\(\square \)

4 The Space-Time RNLLS Parallel Algorithm

We introduce the algorithm solving the RNL-LS problem by using the space-time decomposition, i.e. solving the \(QP=q\times p\) local problems in \(\varDelta _j \times \varOmega _i\), where \(j=1,Q\) and \(i=1,P\) (see Fig. 1 to see an example of domain decomposition where \(Q=4\) and \(P=2\).).

Definition 14

(DD-RNLLS Algorithm) Let \(\mathcal {A}^{loc}_{RNLLS}(\varDelta _j \times \varOmega _i)\) denote the algorithm solving the local 4DVar DA problem defined in \(\varDelta _j \times \varOmega _i\). The space-time DD-RNLLS parallel algorithm solving the RNL-LS problem in \(DD(\varDelta \times \varOmega )\), is symbolically denoted as

$$\begin{aligned} \mathcal {A}^{DD}_{RNNLS}(\varDelta _M \times \varOmega _{K}), \end{aligned}$$

and it is defined as the merging of the \(QP=Q\times P\) local algorithms \(\mathcal {A}^{loc}_{RNLLS}(\varDelta _j \times \varOmega _i)\), i.e.:

$$\begin{aligned} \mathcal {A}^{DD}_{RNLLS}(\varDelta _M \times \varOmega _{NP}):=\bigcup _{j=1,Q;i=1,P} \mathcal {A}^{loc}_{RNLLS}(\varDelta _j \times \varOmega _i). \end{aligned}$$
(37)
Fig. 1
figure 1

Configurations of the decomposition \(DD(\varDelta _M \times \varOmega _{K})\), if \(\varOmega \subset \Re \), and \(Q=4\), \(P=2\)

The DD-RNLLS algorithm can be sketched as described by Algorithm 1. Similarly, the Local RNLLS algorithm \(\mathcal {A}^{loc}_{RNLLS}\) is described by Algorithm 2.

figure a
figure b

Remark 2

We observe that \(\mathcal {A}^{DD}_{RNNLS}(\varDelta _M \times \varOmega _{K})\) algorithm is based on two main steps, i.e. the domain decomposition step (see line 1) and the model linearization step (see line 6). This means that this algorithm uses a convex approximation of the objective DA functional so that Theorem 1 holds.

The common approach for solving RNL-LS problems consists in defining a sequence of local approximations of \(\mathbf {J}_{ij}\) where each member of the sequence is minimized by employing Newton’s method or one its variants (such as Gauss-Newton, L-BFGS, Levenberg-Marquardt). Approximations of \(\mathbf {J}_{ij}\) are obtained by expanding \(\mathbf {J}_{ij}\) in a truncated Taylor’s series, while the minimum is obtained by using second-order sufficient conditions [13, 44]. Let’s consider Algorithm 3 solving the RNL-LS problem on \(\varDelta _j \times \varOmega _i\).

figure c

Main computational task occurs at step 5 of Algorithm 3 concerning the minimization of \(\tilde{\mathbf {J}_{ji}}\), which is the local approximation of \(\mathbf {J}_{ij}\). Two approaches could be employed in Algorithm 3:

  1. (a)

    by truncating Taylor’s series expansion of \(\mathbf {J}_{ij}\) at the second order, we get

    $$\begin{aligned} \mathbf {J}_{ij}^{QD}(\mathbf {{u}}_{ji}^{l+1})=\mathbf {J}_{ij}(\mathbf {{u}}_{ji}^{l})+\nabla \mathbf {J}_{ij}(\mathbf {{u}}_{ji}^{l})^T \delta \mathbf {{u}}_{ji}^l+\left( \delta \mathbf {{u}}_{ji}^l\right) ^T \nabla ^2 \mathbf {J}_{ij}(\mathbf {{u}}_{ji}^{l}) \delta \mathbf {{u}}_{ji}^l \end{aligned}$$
    (38)

    giving a quadratic approximation of \(\mathbf {J}_{ji}\) at \(\mathbf {u}_{ji}^l\). Newton’methods (including LBFGS and Levenberg-Marquardt) use \(\varvec{\tilde{J}}_{{ji}}=\mathbf {J}_{ij}^{QD}\).

  2. (b)

    by truncating Taylor’s series expansion of \(\mathbf {J}_{ij}\) at the first order we get the following linear approximation of \(\mathbf {J}_{ij}\) at \(\mathbf {{u}}_{ji}^{k}\):

    $$\begin{aligned} \mathbf {J}_{ij}^{TL}(\mathbf {{u}}_{ji}^{l+1})=\mathbf {J}_{ij}(\mathbf {{u}}_{ji}^{l})+\nabla \mathbf {J}_{ij}(\mathbf {{u}}_{ji}^{l})^T \delta \mathbf {{u}}_{ji}^l=\frac{1}{2} \Vert \nabla \mathbf {F}_{ji}(\mathbf {{u}}_{ji}^{l})\delta \mathbf {{u}}_{ji}^l+\mathbf {F}_{ji}(\mathbf {{u}}_{ji}^{l})\Vert _2^2 \end{aligned}$$
    (39)

    where we letFootnote 7\(\mathbf {J}_{ij}:= \Vert \mathbf {F}_{ji}\Vert _2^2\) which gives a linear approximation of \(\mathbf {J}_{ji}\) at \(\mathbf {u}_{ji}^l\). Gauss-Newton’s methods (including Truncated or Approximated Gauss-Newton [21]) use \(\mathbf {J}^{TL}_{ji}=\tilde{\mathbf {J}}_{ji}\).

Observe that from (38) it follows

$$\begin{aligned} \mathbf {J}_{ij}^{QD}(\mathbf {{u}}_{ji}^{l+1})=\mathbf {J}_{ij}^{ TL}(\mathbf {{u}}_{ji}^{l})+\frac{1}{2} \left( \delta \mathbf {{u}}_{ji}^l\right) ^T \nabla ^2 \mathbf {J}_{ij}(\mathbf {{u}}_{ji}^{l}) \delta \mathbf {{u}}_{ji}^l. \end{aligned}$$
(40)

Algorithm 3 becomes Algorithm 4, described below.

figure d
  1. (a)

    \(\mathcal {A}_{QN}^{loc}\): computes a local minimum of \(\mathbf {J}^{QN}_{ji}\) following the Newton’s descend direction. The minimum is computed solving the linear system involving the Hessian matrix \(\nabla ^2\mathbf {{J}}_{ij}\), and the negative gradient \(-\nabla \mathbf {{J}}_{ij}\) at \(\mathbf {{u}}_{ji}^l\), for each value of l (see Algorithm 5 described below);

  2. (b)

    \(\mathcal {A}_{LLS}^{loc}\): computes a local minimum of \(\mathbf {J}^{TL}_{ji}\) following the steepest descend direction. The minimum is computed solving the normal equations arising from the local Linear Least Squares (LLS) problem (see Algorithm 6 described below).

figure e
figure f

Remark 3

: We observe that if, in \(\mathcal {A}^{loc}_{QN}\) algorithm, matrix \(\mathbf {Q}(\mathbf {{u}}_{ij}^l)\) (see line 6 of Algorithm 5) is neglected we get the Gauss-Newton method described by \(\mathcal {A}^{loc}_{LLS}\) algorithm. More generally, term \(\mathbf {Q}(\mathbf {{u}}_{ij}^l)\)

  1. 1.

    in case of Gauss Newton, \(\mathbf {Q}(\mathbf {{u}}_{ij}^l)\) is neglected;

  2. 2.

    in case of Levenberg-Marquardt, \(\mathbf {Q}(\mathbf {{u}}_{ij}^l)\) equals to \(\lambda I\), where the damping term, \(\lambda >0\), is updated at each iteration and I is the identity matrix [26, 30];

  3. 3.

    in case of the L-BFGS, the Hessian matrix is Rank-1 updated at every iteration [45].

According to the most common implementation of the 4DVar DA [15, 50], we focus the attention on Gauss-Newton(G-N) method described in \(\mathcal {A}^{loc}_{LLS}\) in Algorithm 6.

For each l, let \(\mathbf {G}_{ji}^l = RO_{ji}[\mathbf {G}^l]\), where \(\mathbf {G}^l \in \Re ^{( M\times nobs)\times ( NP \times M)}\), be the block diagonal matrix such that

$$\begin{aligned} \mathbf {G}^l= \left\{ \begin{array}{ll} diag\, [\mathbf {H}_0,\mathbf {H}_1\mathbf {M}_{0,1}^l, \ldots , \mathbf {H}_{M-1}\mathbf {M}_{M-2,M-1}^l] &{} M>1; \\ \mathbf {H}_0&{} M=1. \end{array} \right. \end{aligned}$$
(41)

while \((\mathbf {G}_{ji}^T)^l=RO_{ji}[(\mathbf {G}^T)^k]\) is the restriction of the transpose of \(\mathbf {G}^l\), and

$$\begin{aligned} \mathbf {M}_{0,1}^l, \ldots , \mathbf {M}_{M-2,M-1}^l \end{aligned}$$

are the TLMs of \(\mathbf {M}_{k,k+1}\), for \(s=0,M-1\), around \( \mathbf {u}^l_{ji}\), respectively. Finally, let

$$\begin{aligned} \mathbf {d}_{ji}^l=\mathbf {v}_{ji}-\mathbf {H}_{ji}\mathbf {u}^l_{ji}\quad \end{aligned}$$

be the restriction of the misfit vector. In line 7 of Algorithm 6, it is

$$\begin{aligned} \nabla \mathbf {F}_{{ji}}^T(\mathbf {{u}}_{ji}^l)\nabla \mathbf {F}_{{ji}}(\mathbf {{u}}_{ji}^l)=\mathbf {B}^{-1}_{ji}+(\mathbf {G}_{ji}^T)^l\mathbf {R}_{ji}\mathbf {G}_{ji}^l, \end{aligned}$$
(42)

and,

$$\begin{aligned} -\nabla \mathbf {F}_{{ji}}^T(\mathbf {{u}}_{ji}^l) \mathbf {F}_{{ji}}(\mathbf {{u}}_{ji}^l)=(\mathbf {G}_{ji}^T)^l\mathbf {R}_{ji}^{-1}\mathbf {d}_{ji}. \end{aligned}$$
(43)

Most popular 4DVar DA software implement the so called \(\mathbf {B}\)-preconditioned Krylov subspace iterative method [21, 23, 50] arising by using the background error covariance matrix as preconditioner of a Krylov subspace iterative method.

Let \(\mathbf{B}_{ji}=\mathbf {V}_{ji}\mathbf {V}_{ji}^T\) be expressed in terms of the deviance matrix \(\mathbf {V}_{ji}\), and \(\mathbf {w}_i\) such that

$$\begin{aligned} \mathbf {w}_{ji}^l = \mathbf {V}_{ji} ^+ (\mathbf {u}_{ji}^l-\mathbf {u}_{ji}^b) \end{aligned}$$
(44)

with \(V_i ^+\) generalised inverse of \(\mathbf {V}_i \), the (42) becomes

$$\begin{aligned} \mathbf {B}^{-1}_{ji}+(\mathbf {G}_{ji}^T)^l\mathbf {R}_{ji}\mathbf {G}_{ji}^l=\mathbf{I}_{ji}+(\mathbf{G}_{ji}^l\mathbf {V}_{ji})^T(\mathbf{R}^{-1})_{ji}\mathbf{G}_{ji}^l\mathbf {V}_{ji}, \end{aligned}$$
(45)

while the (43) becomes

$$\begin{aligned} (\mathbf {G}_{ji}^T)^l(\mathbf {R}^{-1})_{ji}\mathbf {d}_{ji}=(\mathbf{G}_{ji}\mathbf {V}_{ji})^T)^k(\mathbf{R}^{-1})_{ji}\mathbf {d}_{ji}. \end{aligned}$$
(46)

The normal equation system (see line 7 of \(\mathcal {A}^{loc}_{LLS}\)), i.e. the linear system

$$\begin{aligned} ((\mathbf {B}^{-1})_{ji}+(\mathbf {G}_{ji}^T)^l\mathbf {R}_{ji}\mathbf {G}_{ji}^l )\delta \mathbf {{u}}_{ji}^l=(\mathbf {G}_{ji}^T)^l(\mathbf {R}^{-1})_{ji}\mathbf {d}_{ji} \end{aligned}$$

becomes

$$\begin{aligned} (\mathbf{I}_{ji}+(\mathbf{G}_{ji}^l\mathbf {V}_{ji})^T(\mathbf{R}^{-1})_{ji}{} \mathbf{G}_{ji}^l\mathbf {V}_{ji})\delta \mathbf {{u}}_{ji}^l=(\mathbf{G}_{ji}^l\mathbf {V}_{ji})^T(\mathbf{R}^{-1})_{ji}\mathbf {d}_{ji}\quad . \end{aligned}$$

Definition 15

Let \(\mathcal {A}^{loc}_{4DVar}(\varDelta _j \times \varOmega _i)\) denote the algorithm solving the local 4DVar DA problem defined in \(\varDelta _j \times \varOmega _i\). The space-time 4DVar DA parallel algorithm solving the 4DVar DA problem in \(DD(\varDelta _{M} \times \varOmega _{K})\), is symbolically denoted as \(\mathcal {A}^{DD}_{4DVar}(\varDelta _M \times \varOmega _{K})\), and it is defined as the union of the \(QP=q\times p\) local algorithms \(\mathcal {A}^{loc}_{4DVar}(\varDelta _j \times \varOmega _i)\), i.e.:

$$\begin{aligned} \mathcal {A}^{DD}_{4DVar}(\varDelta _M \times \varOmega _{K}):=\bigcup _{j=1,q;i=1,p} \mathcal {A}^{loc}_{4DVar}(\varDelta _j \times \varOmega _i). \end{aligned}$$
(47)

Algorithm \( \mathcal {A}^{loc}_{4DVar}\) is algorithm \( \mathcal {A}^{loc}_{LLS}\) (see Algorithm 6) specialized for the 4D Var DA problem and it is described by Algorithm 7 and Algorithm 8 described below [23].

figure g
figure h

In the next section we will show that this formulation leads to local numerical solutions convergent to the numerical solution of the global problem.

5 Convergence Analysis

In the following we pose \(\Vert \cdot \Vert =\Vert \cdot \Vert _{\infty }\).

Proposition 1

Let \(u_{j,i}^{ASM,r}\) be the approximation of the increment \(\delta \mathbf {{u}}_{ji}\) to the solution \(\mathbf {{u}}_{ji}\) obtained at step r of ASM-based inner loop on \(\varOmega _{j}\times \varDelta _{i}\). Let \(u_{j,i}^{n}\) be the approximation of \(\mathbf {u}_{j,i}\) obtained at step n of the outer loop i.e. the space-time decomposition approach on \(\varOmega _{j}\times \varDelta _{i}\). Let us assume that the numerical scheme discretizing the model \(\mathbf {M}_i^{j,j+1}\) is convergent. Then, given i and j fixed, it holds that:

$$\begin{aligned} \forall \epsilon>0\ \ \exists M(\epsilon )>0 \ \ : \ \ n>M(\epsilon ) \ \ \Rightarrow \ \ E_{j,i}^{n}:=\Vert \mathbf {u}_{j,i}-u_{j,i}^{n}\Vert \le \epsilon . \end{aligned}$$
(48)

Proof

let \(u_{j,i}^{\mathbf {M}_i^{j,j+1},n+1}\) be the numerical solution of \(\mathbf {M}_i^{j,j+1}\) at step n; taking into account that, according to the incremental update of the solution of the 4D Var DA functional (for instance, see line 10 of the Algorithm 7), the approximation \(\mathbf {u}_{j,i}^n\) is computed as

$$\begin{aligned} \mathbf {u}_{j,i}^n= u_{j,i}^{\mathbf {M}_i^{j,j+1},n+1}+[u_{j,i}^{ASM,r}-u_{j,i}^{\mathbf {M}_i^{j,j+1},n}] \end{aligned}$$

then, it is

$$\begin{aligned} \begin{array}{ll} E_{j,i}^{n}:=\Vert \mathbf {u}_{j,i}-u_{j,i}^{n}\Vert &{}=\Vert \mathbf {u}_{j,i}-u_{j,i}^{\mathbf {M}_i^{j,j+1},n+1}-[u_{j,i}^{ASM,r}-u_{j,i}^{\mathbf {M}_i^{j,j+1},n}]\Vert \\ {} &{}\le \Vert \mathbf {u}_{j,i}-u_{j,i}^{ASM,r}\Vert +\Vert {u}_{j,i}^{\mathbf {M}_i^{j,j+1},n}-u_{j,i}^{\mathbf {M}_i^{j,j+1},n+1}\Vert \end{array} \end{aligned}$$
(49)

from the hypothesis, we have

$$\begin{aligned}&\forall \epsilon ^{\mathbf {M}_i^{j,j+1}}>0 \,\exists M^{1}(\epsilon ^{\mathbf {M}_i^{j,j+1}})>0 : n>M^{1}(\epsilon ^{\mathbf {M}_i^{j,j+1}})\nonumber \\&\quad \Rightarrow E_{j,i}^{n}:=\Vert u_{j,i}^{\mathbf {M}_i^{j,j+1},n+1}-u_{j,i}^{\mathcal {M}^{j},n}\Vert \le \epsilon ^{\mathbf {M}_i^{j,j+1}} \end{aligned}$$
(50)

and (49) can be rewritten as follows

$$\begin{aligned} \begin{array}{ll} \Vert \mathbf {u}_{j,i}-u_{j,i}^{n}\Vert \le \Vert \mathbf {u}_{j,i}-u_{j,i}^{ASM,r}\Vert +\epsilon ^{\mathbf {M}_i^{j,j+1}}. \end{array} \end{aligned}$$
(51)

Convergence of ASM is proved in [5], similarly, applying ASM to 4D Var DA problem, it holds that

$$\begin{aligned} \forall \epsilon ^{ASM}>0\ \ \exists M^{2}(\epsilon ^{ASM})>0 \ \ : \ \ n >M^{2}(\epsilon ^{ASM}) \ \ \Rightarrow \ \ \Vert u_{j,i}-u_{j,i}^{ASM,r}\Vert \le \epsilon ^{ASM}, \end{aligned}$$
(52)

and for \(n\ >M^{2}(\epsilon ^{ASM})\), we get

$$\begin{aligned} \begin{array}{ll} \Vert u_{j,i}-u_{j,i}^{ASM,n}\Vert \le \epsilon ^{ASM} +\epsilon ^{\mathbf {M}_i^{j,j+1}}. \end{array} \end{aligned}$$
(53)

Hence, by using \(\epsilon :={\epsilon ^{ASM}}+\epsilon ^{\mathbf {M}_i^{j,j+1}}\) and \(M(\epsilon ):= \max \{M^{1}(\epsilon ^{ASM}),M^{2}(\epsilon ^{\mathcal {M}_i^{j,j+1}})\}\), we get the thesis in (52). \(\square \)

6 Performance Analysis

Performance metrics we will employ are time complexity and scalability. Our aim is to highlight the benefits arising from using the decomposition approach instead of solving the problem on the whole domain. As we shall discuss later, the performance gain that we get from using the space and time decomposition approach is two fold:

  1. 1.

    Instead of solving one larger problem we can solve several smaller problems which are better conditioned than the former problem. This result leads to a reduction of each local algorithm’s time complexity.

  2. 2.

    Subproblems reproduce the whole problem at smaller dimensions and they are solved in parallel. This result leads to a reduction of software execution time.

We give the following

Definition 16

An uniform bi-directional decomposition of the space and time domain \(\varDelta _M \times \varOmega _{K}\), is such that if we let

$$\begin{aligned} size(\varDelta _M \times \varOmega _{K})= M \times K, \end{aligned}$$

be the size of the whole domain, then each subdomain \(\varDelta _j \times \varOmega _i\) is such that

$$\begin{aligned} size(\varDelta _j \times \varOmega _i)=D_t \times D_s, \quad j=1,\ldots ,q; \quad i=1,\ldots ,p \end{aligned}$$

where \(D_t=\frac{M}{q}\ge 1\), and \(D_s=\frac{K}{p}\ge 1\).

In the following we let

$$\begin{aligned} N:= M \times K; \quad N_{loc}:= D_t \times D_s ; \quad QP:= q \times p \quad . \end{aligned}$$

Let \(T(\mathcal {A}^{DD}_{4DVar}(\varDelta _M \times \varOmega _{K}))\) denote time complexity of \(\mathcal {A}^{DD}_{4DVar}(\varDelta _{M} \times \varOmega _{K})\).

We now provide an estimate of the time complexity of each local algorithm, denoted as \(T(\mathcal {A}^{Loc}_{4DVar}(\varDelta _j \times \varOmega _i))\). This algorithm consists of two loops. The outer-loop, over l-index, for computing local approximations of \(\mathbf {J}_{ji}\), and the inner-loop over m-index, for performing Newton’s or Lanczos’ steps. The major computational tasks to be performed at each step of the outer-loop is the computation of \(\mathbf {J}_{ji}\). The major computational tasks to be performed at each step l of the inner-loop, in case of G-N method (see algorithm \(\mathcal {A}^{Loc}_{4DVar}\)), involving the predictive model isFootnote 8

  1. 1.

    the computation of the tangent linear model \(RO_{ji}[\mathbf {M}_{k,k+1}]\) (the time complexity of such operation scales as the problem size squared) ,

  2. 2.

    the computation of the adjoint model \(RO_{ji}[\mathbf {M}_{k,k+1}^T]\) ( which is at least 4 times more expensive than the computation of \(RO_{ji}[\mathbf {M}_{k,k+1}]\),

  3. 3.

    the solution of the normal equations, involving at each iteration, two matrix-vector products with \(RO_{ji}[\mathbf {M}_{k,k+1}^T]\) and \(RO_{ji}[\mathbf {M}_{k,k+1}]\) (whose time complexity scales as the problem size squared).

As the most time consuming operation involving the predictive model is the computation of the tangent linear model, we prove that

Proposition 2

Let

$$\begin{aligned} P(N_{loc})= a_{d} N_{loc}^{d} + a_{d-1} N_{loc}^{d-1}+ \ldots + a_0, \quad a_d \ne 0 \end{aligned}$$

be the polynomial of degree \(d=2\) denoting the time-complexity of the tangent linear model \(RO_{ji}[\mathbf {M}_{s,s+1}]\). Let \(m_{ji}\) and \(l_{ji}\) be the number of steps of the outer/inner-loop, of \(\mathcal {A}^{Loc}_{4DVAR}\), respectively. We get

$$\begin{aligned} T(\mathcal {A}^{Loc}_{4DVAR}(\varDelta _j \times \varOmega _i)))= O\left( m_{ji}l_{ji}P(N_{loc})\right) \end{aligned}$$

Proof

It is:

$$\begin{aligned} T(\mathcal {A}^{Loc}_{4DVAR}(\varDelta _j \times \varOmega _i))=&\nonumber \\ l_{ji} \times \left[ T(RO_{ji}[\mathbf {M}_{k,k+1}]) + m_{ji}\times O\left( T(RO_{ji}[\mathbf {M}_{k,k+1}])+T(RO_{ji}[\mathbf {M}_{k,k+1}^T])\right) \right] = \nonumber \\ l_{ji} \times \left[ T(RO_{ji}[\mathbf {M}_{k,k+1}]) + m_{ji}\times O\left( T(RO_{ji}[\mathbf {M}_{k,k+1}])+T(\mathbf {M}_{ji}^T)\right) \right] =&\nonumber \\ = O\left( m_{ji}l_{ji}P(N_{loc})\right)&\nonumber \\ \end{aligned}$$
(54)

Let

$$\begin{aligned} m_{max}:= \max _{ji} \,m_{ji};\quad l_{max}:= \max _{ji} \,l_{ji}. \end{aligned}$$

Observe that \(m_{max}\) and \(l_{max}\) actually are the number of steps of the outer and inner loops of \(\mathcal {A}^{DD}(\varDelta _M \times \varOmega _{K})\), respectively. Let \(m_G\) and \(l_G\) denote the number of iterations of inner and outer loop of \(\mathcal {A}^{G}(\varDelta _M \times \varOmega _{K})\) algorithm, we give the following:

Definition 17

Let

$$\begin{aligned} \rho ^G:= m_G \times l_G\quad ;\quad \rho ^{ji}:= m_{ji} \times l_{ji}\quad ;\quad \rho ^{DD}:= m_{max} \times l_{max}\quad \end{aligned}$$

denote the total number of iterations of \(\mathcal {A}^{G}_{4DVAR}(\varDelta _M \times \varOmega _{K})\), of \(\mathcal {A}^{Loc}_{4DVAR}(\varDelta _j \times \varOmega _{i})\) and of \(\mathcal {A}^{DD}_{4DVAR}(\varDelta _M \times \varOmega _{K})\), respectively.

If we denote by \(\mu (J)\) the condition number of the DA operator , as it holds that [3]

$$\begin{aligned} \forall \, i,j \quad \mu (J_{4DVAR}^{Loc}) < \mu (J_{4DVAR}) \end{aligned}$$

then it is

$$\begin{aligned} \rho ^{ji}< \rho ^G, \end{aligned}$$

and

$$\begin{aligned} \rho ^{DD} < \rho ^G . \end{aligned}$$

This result says that the number of iterations of \(\mathcal {A}^{DD}_{4DVar}(\varDelta _M \times \varOmega _{K})\) algorithm is always smaller than the number of iterations of \(\mathcal {A}^{G}_{4DVar}(\varDelta _M \times \varOmega _{K})\) algorithm. This is one of the benefits of using the space and time decomposition.

Algorithm scalability is measured in terms of the strong scaling (which is the measure of the algorithm’s capability to exploit performance of high performance computing architectures in order to minimise the time to solution for a given problem with a fixed dimension) and of the weak scaling (which is the measure of the algorithm’s capability to use additional computational resources effectively to solve increasingly larger problems). A variety of metrics have been developed to assist in evaluating the scalability of a parallel algorithm, speedup, model throughput, scale up, efficiency are the most used. Each one highlights specific needs and limits to be answered by the parallel algorithm. In our case, as we intend to mainly focus on the benefits arising from the use of hybrid computing architectures we consider the so called scale-up factor firstly introduced in [7].

First result straightforwardly derives from the definition of the scale-up factor:

Proposition 3

(DD-4DVar Scale up factor) The (relative) scale-up factor of \(\mathcal {A}^{DD}_{4DVar}(\varDelta _M \times \varOmega _{K})\) related to \(\mathcal {A}^{loc}_{4DVar}(\varDelta _j \times \varOmega _i)\), denoted as \(Sc_{QP}(\mathcal {A}^{DD}_{4DVar}(\varDelta _M \times \varOmega _{K}))\) is:

$$\begin{aligned} Sc_{QP}(\mathcal {A}^{DD}(\varDelta _M \times \varOmega _{K})):=\frac{1}{QP}\times \frac{T(\mathcal {A}^{G}_{4DVar}(\varDelta _M \times \varOmega _{K}))}{ T(\mathcal {A}^{loc}_{4DVar}(\varDelta _j \times \varOmega _i))}\,\,\,, \end{aligned}$$

where \(QP:=q \times p\) is the number of sub domains. It is:

$$\begin{aligned} Sc_{QP}(\mathcal {A}^{DD})\ge \frac{\rho ^{G}}{\rho ^{DD}} \alpha (N_{loc},QP)\,(QP)^{d-1} \end{aligned}$$
(55)

where

$$\begin{aligned} \alpha (N_{loc},QP)=\frac{a_d+a_{d-1}\frac{1}{N}+\ldots +\frac{a_0}{N_{loc}^{d}}}{a_d+a_{d-1}\frac{QP}{N_{loc}}+\ldots +\frac{a_0(QP)^{d}}{N_{loc}^{d}}} \quad , \end{aligned}$$

and

$$\begin{aligned} \lim _{QP\rightarrow N_{loc}} \alpha (N_{loc},QP)=\beta \in ]0,1] \end{aligned}$$

Corollary 1

If \( a_i= 0 \quad \forall i\in [0,d-1]\), then \(\beta =1\), i.e.

$$\begin{aligned} \lim _{QP\rightarrow N_{loc}} \alpha (N_{loc},QP)=1 \end{aligned}$$

Finally

$$\begin{aligned} \lim _{N_{loc}\rightarrow \infty } \alpha (N_{loc},QP)=1. \end{aligned}$$

Corollary 2

If \(N_{loc}\) is fixed, it is

$$\begin{aligned} \lim _{QP\rightarrow N_{loc} }Sc_{1,QP}(\mathcal {A}^{DD})= \beta \cdot N_{loc}^{d-1} \quad ; \end{aligned}$$

while, if QP is fixed

$$\begin{aligned} \lim _{N_{loc}\rightarrow \infty }Sc_{1,QP}(\mathcal {A}^{DD})= const \ne 0\quad . \end{aligned}$$

From (55) it results that, considering one iteration of the whole parallel algorithm, the growth of the scale up factor essentially is one order less than the time complexity of the reduced model. In other words, the time complexity of the reduced model mostly impacts the scalability of the parallel algorithm. In particular, as parameter d equal to 2, it follows that the asymptotic scaling factor of the parallel algorithm, with respect to QP, is bounded above by two.ECMWF

Besides the time complexity, scalability is also affected by the communication overhead of the parallel algorithm. The surface-to-volume ratio is a measure of the amount of data exchange (proportional to surface area of domain) per unit operation (proportional to volume of domain). We prove that

Theorem 2

The surface to volume ratio of a uniform bi-dimensional decomposition of space-time domain \(\varDelta _M \times \varOmega _{K}\), is

$$\begin{aligned} \frac{\mathcal {S}}{\mathcal {V}}=2\left( \frac{1}{D_t}+\frac{1}{D_s} \right) \quad . \end{aligned}$$
(56)

Let \(\mathcal {S}\) denote the surface of each subdomain, then

$$\begin{aligned} \mathcal {S}= 2\left( \frac{M}{q}+\frac{K}{p}\right) \end{aligned}$$

and \(\mathcal {V}\) denote its volume, then

$$\begin{aligned} \mathcal {V}= \frac{M}{q}\times \frac{K}{p} \quad . \end{aligned}$$

It holds that

$$\begin{aligned} \frac{\mathcal {S}}{\mathcal {V}}=\frac{2\left( \frac{M}{q}+ \frac{K}{p}\right) }{ \frac{M}{q}\times \frac{K}{p}}=2\left( \frac{1}{D_t}+\frac{1}{D_s}\right) \end{aligned}$$

and the (56) follows.

Definition 18

(Measured Software Scale-up) Let

$$\begin{aligned} Sc_{1,QP}^{meas}(\mathcal {A^{DD}}):= \frac{T_{flop}(N_{loc})}{QP \cdot (T_{flop}(N_{loc})+T_{oh}(N_{loc}))}. \end{aligned}$$
(57)

be the measured software scale-up in going from 1 to QP.

Proposition 4

If

$$\begin{aligned} 0 \le \frac{S}{V}(\mathcal {A}^{loc}_{4DVar})<1-\frac{1}{s_{nproc}^{loc}(\mathcal {A}^{loc}_{4DVar})}\quad , \end{aligned}$$

then, it holds that

$$\begin{aligned} Sc_{1,QP}^{meas}(\mathcal {A}^{DD}_{4DVar})= \alpha (N_{loc},QP) Sc_{1,QP}(\mathcal {A}^{DD}_{4DVar}) \end{aligned}$$
(58)

with

$$\begin{aligned} \alpha (N_{loc},QP)^{meas}(\mathcal {A}^{DD}_{4DVar})= \frac{T_{flop}(N_{loc})}{\frac{QP \,T_{flop}(N_{loc})}{s_{nproc}^{loc}}+QP\,T_{oh}(N_{loc})}= \frac{s_{nproc}^{loc}\frac{T_{flop}(N_{loc})}{QP\,T_{flop}(N_{loc})}}{1+\frac{s_{nproc}^{loc}T_{oh}(N_{loc})}{T_{flop}(N_{loc})}}.\nonumber \\ \end{aligned}$$
(59)

If

$$\begin{aligned} \alpha (N_{loc},QP):= \frac{s_{nproc}^{loc}}{1+\frac{s_{nproc}^{loc}T_{oh}(N_{loc})}{T_{flop}(N_{loc})}}= \frac{s_{nproc}^{loc}}{1+s_{nproc}^{loc}\frac{S}{V}} \end{aligned}$$

from (59) it comes the thesis in (58).

In the following we will denote the measured scale up as \(Sc_{1,QP}^{meas}(\mathcal {A}^{DD}_{4DVar})\) or as \(Sc_{1,QP}^{meas}(N)\), respectively.

Finally, last proposition allows us to examine the benefit on the measured scale up arising from the speed up of the local parallel algorithm, mainly in the presence of a multilevel decomposition, where \(s_{nproc}^{loc}(\mathcal {A}^{loc}_{4DVar}) >1\).

Proposition 5

It holds that

$$\begin{aligned} s_{nproc}^{loc}(\mathcal {A}^{loc}_{4DVar}) \in [1,QP]\Rightarrow Sc_{QP}^{meas}(\mathcal {A}^{DD}_{4DVar}) \in ]Sc_{1,QP}(\mathcal {A}^{DD}_{4DVar}) , QP \,Sc_{1,QP}(\mathcal {A}^{DD}_{4DVar}) [. \end{aligned}$$

Proof

  • if \(s_{nproc}^{loc}(\mathcal {A}^{loc}_{4DVar}) =1\) then

    $$\begin{aligned} \alpha (N,QP)<1 \Leftrightarrow Sc_{1,QP}^{meas}(\mathcal {A}^{DD}_{4DVar}) < Sc_{1,QP}(\mathcal {A}^{DD}_{4DVar}) \end{aligned}$$
  • if \(s_{nproc}^{loc}(\mathcal {A}^{loc}_{4DVar}) >1\) then

    $$\begin{aligned} \alpha (N,QP)>1 \Leftrightarrow Sc_{1,QP}^{meas}(\mathcal {A}^{DD}_{4DVar}) > Sc^f_{1,QP}(\mathcal {A}^{DD}_{4DVar}) ; \end{aligned}$$
  • if \(s_{nproc}^{loc} (\mathcal {A}^{loc}_{4DVar})=QP\) then

    $$\begin{aligned} 1< \alpha (N,QP)<QP \Rightarrow Sc_{1,QP}^{meas}(\mathcal {A}^{DD}_{4DVar}) < QP \cdot Sc^f_{1,QP}(\mathcal {A}^{DD}_{4DVar}) ; \end{aligned}$$

\(\square \)

We may conclude that

  1. 1.

    strong scaling: if QP increases and \(M\times K\) is fixed, the scale up factor increases but the surface-to-volume ratio increases.

  2. 2.

    weak scaling: if QP is fixed and \(M \times K\) increases, the scale up factor stagnate and the surface-to-volume ration decreases.

This means that it needs to find the appropriate value of the number of sub domains, QP, giving the right trade off between the scale up and the overhead of the algorithm.

7 Scalability Results

Results presented here are just a starting point towards the assessment of the software scalability. More precisely, we introduce simplifications and assumptions appropriate for a proof-of-concept study in order to get values of the measured scale up of the one iteration of the parallel algorithm.

As the main outcome of the decomposition is that the parallel algorithm is oriented to better exploit the high performance of new architectures where concurrency is implemented both at the coarsest and finest levels of granularity, such as a distributed memory multiprocessor (MIMD) and a Graphic Processing Units (GPUs) [56], we consider a distributed computing environment located in the University of Naples Federico II campus, connected by local-area network made of

  • \(PE_1\) (for the coarsest level of granularity): a Multiple-Instruction, Multiple-Data (MIMD) architecture made of 8 nodes which consist of distributed memory DELL M600 blades connected by a 10 Gigabit Ethernet technology. Each blade consists of 2 Intel Xeon@2.33GHz quadcore processors sharing the same local 16 GB RAM memory for a total of 8 cores per blade and of 64 total cores.

  • \(PE_2\) (for the finest level of granularity): a Kepler architecture of the GK110 GPU [46], which consists of a set of 13 programmable Single-Instruction, Multiple-Data (SIMD) Streaming Multiprocessors (SMXs), connected to a quad-core Intel i7 CPU running at 3.07GHz, 12 GB of RAM. For host(CPU)-to-device(GPU) memory transfers CUDA enabled graphic cards are connect to a PC motherboard via a PCI-Express (PCIe) BUS [48]. For this architecture the maximum number of active threads per multiprocessor is 2048, which means that the maximum number of active warps per SMX is 64.

Our implementation uses the matrix and vector functions in the Basic Linear Algebra Subroutines (BLAS) for \(PE_1\) and the CUDA Basic Linear Algebra Subroutines (CUBLAS) library for \(PE_2\). The routines used for computing the minimum of J on \(PE_1\) and \(PE_2\) are described in [28] and [10] respectively.

The case study is based on the Shallow Water Equations (SWEs) on the sphere. The SWE have been used extensively as a simple model of the atmosphere or ocean circulation since they contain the essential wave propagation mechanisms found in general circulation models [52].

The SWEs in spherical coordinates are:

$$\begin{aligned} \frac{\partial u}{\partial t}= & {} - \frac{1}{a \cos {\theta }} \left( u \frac{\partial u}{\partial \lambda } + v \cos {\theta }\frac{\partial u}{\partial \theta }\right) + \left( f + \frac{u\tan {\theta }}{a}\right) v - \frac{g}{a \cos {\theta }} \frac{\partial h}{\partial \lambda } \end{aligned}$$
(60)
$$\begin{aligned} \frac{\partial v}{\partial t}= & {} - \frac{1}{a \cos {\theta }} \left( u \frac{\partial v}{\partial \lambda } + v \cos {\theta }\frac{\partial v}{\partial \theta }\right) + \left( f + \frac{u\tan {\theta }}{a}\right) u - \frac{g}{a} \frac{\partial h}{\partial \theta } \end{aligned}$$
(61)
$$\begin{aligned} \frac{\partial h}{\partial t}= & {} - \frac{1}{a \cos {\theta }} \left( \frac{\partial \left( hu\right) }{\partial \lambda } + \frac{\partial \left( hu\cos {\theta }\right) }{\partial \theta }\right) \end{aligned}$$
(62)

Here f is the Coriolis parameter given by \(f = 2\varOmega \sin {\theta }\), where \(\varOmega \) is the angular speed of the rotation of the Earth, h is the height of the homogeneous atmosphere (or of the free ocean surface), u and v are the zonal and meridional wind (or the ocean velocity) components, respectively, \(\theta \) and \(\lambda \) are the latitudinal and longitudinal directions, respectively, a is the radius of the earth and g is the gravitational constant.

We express the system of equations (60)–(62) using a compact form, i.e.:

$$\begin{aligned} \frac{\partial {\mathbf {Z} }}{\partial t} = \mathcal {M}_{t-\varDelta t\rightarrow t}\left( {\mathbf {Z}} \right) \end{aligned}$$
(63)

where

$$\begin{aligned} {\mathbf {Z}}=\left( \begin{array}{c} u \\ v \\ h \end{array} \right) \end{aligned}$$
(64)

and

$$\begin{aligned} \mathcal {M}_{t-\varDelta t\rightarrow t}\left( {\mathbf {Z}} \right)= & {} \left( \begin{array}{l} - \frac{1}{a \cos {\theta }} \left( u \frac{\partial u}{\partial \lambda } + v \cos {\theta }\frac{\partial u}{\partial \theta }\right) + \left( f + \frac{u\tan {\theta }}{a}\right) v - \frac{g}{a \cos {\theta }} \frac{\partial h}{\partial \lambda } \\ - \frac{1}{a \cos {\theta }} \left( u \frac{\partial v}{\partial \lambda } + v \cos {\theta }\frac{\partial v}{\partial \theta }\right) + \left( f + \frac{u\tan {\theta }}{a}\right) u - \frac{g}{a} \frac{\partial h}{\partial \theta } \\ - \frac{1}{a \cos {\theta }} \left( \frac{\partial \left( hu\right) }{\partial \lambda } + \frac{\partial \left( hu\cos {\theta }\right) }{\partial \theta }\right) \end{array} \right) \nonumber \\= & {} \left( \begin{array}{l} F_1 \\ F_2 \\ F_3 \end{array} \right) \end{aligned}$$
(65)

We discretize (63) just in space using an un-staggered Turkel-Zwas scheme [37, 38], and we obtain:

$$\begin{aligned} \frac{\partial {\mathbf {Z}_{disc}}}{\partial t} = \mathcal {M}^{t-\varDelta t\rightarrow t}_{disc}\left( {\mathbf {Z}_{disc}} \right) \end{aligned}$$
(66)

where

$$\begin{aligned} { \mathbf {Z}_{disc}}=\left( \begin{array}{c} \left( u_{i,j}\right) _{i=0,\ldots ,nlon-1;j=0,\ldots ,nlat-1} \\ \left( v_{i,j}\right) _{i=0,\ldots ,nlon-1;j=0,\ldots ,nlat-1} \\ \left( h_{i,j}\right) _{i=0,\ldots ,nlon-1;j=0,\ldots ,nlat-1} \\ \end{array} \right) \end{aligned}$$
(67)

and

$$\begin{aligned} \mathcal {M}^{t-\varDelta t\rightarrow t}_{disc}\left( {\mathbf {Z}_{disc}} \right) = \left( \begin{array}{l} \left( U_{i,j}\right) _{i=0,\ldots ,nlon-1;j=0,\ldots ,nlat-1} \\ \left( V_{i,j}\right) _{i=0,\ldots ,nlon-1;j=0,\ldots ,nlat-1} \\ \left( H_{i,j}\right) _{i=0,\ldots ,nlon-1;j=0,\ldots ,nlat-1} \\ \end{array} \right) \end{aligned}$$
(68)

so that

$$\begin{aligned} U_{i,j}= & {} - \sigma _{lon} \frac{u_{i,j}}{\cos {\theta _j}} \left( u_{i+1,j} - u_{i-1,j}\right) \\&- \sigma _{lat} \ {v_{i,j}} \left( u_{i,j+1} - u_{i,j-1}\right) \\&- \sigma _{lon} \frac{g}{p \cos {\theta _j}} \left( h_{i+p,j} - h_{i-p,j}\right) \\&+ 2 \left[ \left( 1-\alpha \right) \left( 2\varOmega \sin {\theta _j}+ \frac{u_{i,j}}{a}\tan \theta _j\right) v_{i,j} \right. \\&+ \left. \frac{\alpha }{2} \left( 2\varOmega \sin {\theta _j}+ \frac{u_{i+p,j}}{a}\tan \theta _j\right) v_{i+p,j} \right. \\&+ \left. \frac{\alpha }{2} \left( 2\varOmega \sin {\theta _j}+ \frac{u_{i-p,j}}{a}\tan \theta _j\right) v_{i-p,j} \right] \\ V_{i,j}= & {} - \sigma _{lon} \frac{u_{i,j}}{\cos {\theta _j}} \left( v_{i+1,j} - v_{i-1,j}\right) \\&- \sigma _{lat} \ {v_{i,j}} \left( u_{i,j+1} - u_{i,j-1}\right) \\&- \sigma _{lat} \frac{g}{q} \left( h_{i,j+q} - h_{i,j-q}\right) \\&- 2 \left[ \left( 1-\alpha \right) \left( 2\varOmega \sin {\theta _j}+ \frac{u_{i,j}}{a}\tan \theta _j\right) u_{i,j} \right. \\&+ \left. \frac{\alpha }{2} \left( 2\varOmega \sin {\theta _{j+q}}+ \frac{u_{i,j+q}}{a}\tan \theta _{j+q}\right) u_{i,j+q} \right. \\&+ \left. \frac{\alpha }{2} \left( 2\varOmega \sin {\theta _{j-q}}+ \frac{u_{i,j-q}}{a}\tan \theta _{j-q}\right) u_{i,j-q} \right] \\ H_{i,j}= & {} - \alpha \left\{ \frac{u_{i,j}}{\cos {\theta _j}} \left( h_{i+1,j} - h_{i-1,j}\right) \right. \\&+ \ {v_{i,j}} \left( h_{i,j+1} - h_{i,j-1}\right) \\&+ \frac{h_{i,j}}{\cos {\theta _j}} \left[ \left( 1-\alpha \right) \left( u_{i+p,j} - u_{i-p,j}\right) \right. \\&+ \left. \frac{\alpha }{2} \left( u_{i+p,j+q} - u_{i-p,j+q} + u_{i+p,j-q} - u_{i-p,j-q}\right) \right] \frac{1}{p} \\&+ \left[ \left( 1-\alpha \right) \left( v_{i,j+q}\cos {\theta _{j+q}} - v_{i,j-q}\cos {\theta _{j-q}}\right) \right. \\&+ \frac{\alpha }{2} \left( v_{i+p,j+q}\cos {\theta _{j+q}} - v_{i+p,j-q}\cos {\theta _{j-q}} \right) \\&\left. \left. +\frac{\alpha }{2} \left( v_{i-p,j+q}\cos {\theta _{j+q}} - v_{i-p,j-q}\cos {\theta _{j-q}} \right) \right] \frac{1}{q} \right\} \end{aligned}$$

The numerical model depends on a combination physical parameters, including the number of state variables in the model, the number of observations in an assimilation cycle, as well as numerical parameters as the discretization step in time and in space domain are defined on the basis of discretization grid used by data available, at repository Ocean Synthesis/Reanalysis Directory of Hamburg University (see [15]).

To begin our data assimilation, an initial value of the model state is created by choosing snapshots from the run prior to the start of the assimilation experiment and treating it as realization valid at the nominal time. Then, the model state is advanced to the next time using the forecast model, and the observations are combined with the forecasts (i.e., the background) to produce the analysis. This process is iterated. As it proceeds, the process fills gaps in sparsely observed regions, converts observations to improved estimates of model variables, and filters observation noise. All this is done in a manner that is physically consistent with the dynamics of the ocean as represented by the model. In our experiments, the simulated observations are created by sampling the model states and adding random errors to those values. A detailed description of the simulation, together with the results and the software implemented, is presented in [11]. In the following, we are mainly interested to focus on performance results.

The reference domain decomposition strategy uses the following correspondence between QP and nproc:

$$\begin{aligned} QP\leftrightarrow nproc, \end{aligned}$$

which means that the number of subdomains coincides with the number of available processors.

According to the characteristics of the physical domain in SWEs, the total number of grid points in space is

$$\begin{aligned} M=nlon \times nlat\times n_z \quad . \end{aligned}$$

Let assume that

$$\begin{aligned} nlon=nlat=n \end{aligned}$$

while \(n_z=3\). Since the unknown vectors are the fluid height or depth, and the two-dimensional fluid velocity fields, the problem size in space is

$$\begin{aligned} M=n^2 \times 3\quad . \end{aligned}$$

We assume a 2D uniform domain decomposition along the latitude-longitude directions, such that

$$\begin{aligned} D_s:= \frac{M}{p}= nloc_x \times nloc_y \times 3 \end{aligned}$$
(69)

with

$$\begin{aligned} nloc_x := \frac{n }{p_1} +2o_x\,\,,\, nloc_y := \frac{n}{p_2} +2o_y\,\,\,,\,n_z := 3\,\,, \end{aligned}$$
(70)

where \(p_1 \times p_2= p\).

Since the GPU (\(PE_2\)) can only process the data in its global memory [55], in a generic parallel algorithm execution, the host acquires these input data and sends them to the device memory, which concurrently calculates the minimization of the 4DVar functional. To avoid continuous relatively slow data transfer from the host to the device and in order to reduce the overhead, it was decided to store the device with the entire work data prior to any processing. Namely, the maximum value of \(D_s\) in (69) is chosen such that the amount of data related each subdomain (we denote it with \(Data_{mem} (Mbyte)\)) can be completely stored in the memory.

If we assume that \(nloc_x=nloc_y\) and we let \(n_{loc}=nloc_x=nloc_y\), as the global GPU memory is of 5Gbyte, we have the values of usable \(n_{loc}\) described in Table 1, Values of the speed-up \(s_{nproc}^{loc}\) in terms of gain obtained by using the GPU versus the CPU are reported in Table 2. We note that CUBLAS routines allow to reduce in average by 18 times the execution time necessary to a single CPU for the minimization part (Table 3).

Table 1 The amount of memory required to store data related to each subdomain on \(PE_2\) expressed in Mbyte
Table 2 Values of the speed-up \(s_{nproc}^{loc}\) in terms of gain obtained by using the GPU versus the CPU
Table 3 Weak scalability of one iteration of the parallel algorithm \(\mathcal {A}^{DD}_{4DVar}\) with \(n_{loc}=32\) computed using the measured software Scale-up \(Sc^{meas}_{1,QP}\) defined in (57)

The outcome we get from these experiments is that the algorithm scales up according to the performance analysis (see Fig. 2). Indeed, as expected, as QP increases, the scale up factor increases and the surface-to-volume ratio increases, too, so that performance gain tends to become stationary. This the inherent tradeoff between speed up and efficiency of any software architecture.

Fig. 2
figure 2

Weak scalability of one iteration of the parallel algorithm \(\mathcal {A}^{DD}_{4DVar}\) with \(n_{loc}=32\) computed using the measured software Scale-up \(Sc^{meas}_{1,QP}\) defined in (57)

8 Conclusions

We provided the whole computational framework of a space-time decomposition approach, including the mathematical framework, the numerical algorithm and, finally, its performance validation. We measure performance of the algorithm using a simulation case study based on the SWEs on the sphere. Results presented here are just a starting point towards the assessment of the software scalability. More precisely, we introduce simplifications and assumptions appropriate for a proof-of-concept study in order to get values of the measured scale up of the one iteration of the parallel algorithm. Anyway, the overall insight we get from these experiments is that the algorithm scales up according to the performance analysis. We are currently working on the development of a flexible framework ensuring efficiency and code readability, exploiting future technologies and equipped with a quantitative assessment of scalability [57]. In this regard, we could combine the proposed approach with the PFASST algorithm. Indeed, PFASST could be concurrently employed as local solver of each reduced-space PDE-constrained optimization subproblem, exposingeven more temporal parallelism.

This framework will allow designing, planning and running simulations to identify and overcome the limits of this approach.