Abstract
We address the development of innovative algorithms designed to solve the strongconstraint Four Dimensional Variational Data Assimilation (4DVar DA) problems in large scale applications. We present a spacetime decomposition approach which employs the whole domain decomposition, i.e. both along the spacial and temporal direction in the overlapping case, and the partitioning of both the solution and the operator. Starting from the global functional defined on the entire domain, we get to a sort of regularized local functionals on the set of sub domains providing the order reduction of both the predictive and the Data Assimilation models. The algorithm convergence is developed. Performance in terms of reduction of time complexity and algorithmic scalability is discussed on the Shallow Water Equations on the sphere. The number of state variables in the model, the number of observations in an assimilation cycle, as well as numerical parameters as the discretization step in time and in space domain are defined on the basis of discretization grid used by data available at repository Ocean Synthesis/Reanalysis Directory of Hamburg University.
1 Introduction and Motivations
Assimilation of observations into models is a wellestablished practice in the meteorological community. In view of the numbers (on the order of \(10^7\) or \(10^8\)) of model variables and (on the order of \(10^6\)) of observations in use for operational models, a variety of approaches have been proposed for reducing the complexity of an assimilation method to a computationally affordable version while retaining its advantages. Ensemble approaches and reduced order models are the most significant approximations. Other approaches take full advantage of existing Partial Differential Equations(PDEs)based solvers, based on spatial Domain Decomposition (DD) methods, where the DDsolver is suitably modified to also handle the adjoint system. A different approach is the combination of DDmethods in space and DA, where spatial domaindecomposed uncertainty quantification approach performs DA at the local level by using Monte Carlo sampling [1, 2, 27, 54]. The Parallel Data Assimilation Framework [41] implements parallel ensemblebased Kalman Filters algorithms coupled within the PDEmodel solver.
The above mentioned methods reduce the spacial dimensionality of the predictive model and the resulting reduced order model is then resolved in time via numerical integration, typically with the same time integrator and time step employed for the highfidelity model leading to high time synchronization. In last decades, parallelintime methods have been investigated for reducing the temporal dimensionality of evolutionary problems. Since Nievergelt, in 1964, which proposed the first time decomposition algorithm for finding the parallel solutions of evolutionary ordinary differential equations, and Hackbusch in 1984, who noted that relaxation operators in multigrid can be employed on multiple time steps simultaneously, the methods of timeparallel time integration have been extensively expanded and several relevant works can be found in the literature. An extensive and updated literature list can be found at the website [47] collecting information about people, methods and software in the field of parallelintime integration methods. Among them, we mention the Parallel Full Approximation Scheme in Space and Time (PFASST), introduced in [14]. PFASST is based on a simultaneous approach reducing the optimization overhead by integrating the PDEbased model directly into the optimization process, thus solving the PDE, the adjoint equations and the optimization problem simultaneously. Recently, a nonintrusive framework for integrating existing unsteady partial differential equation (PDE) solvers into a parallelintime simultaneous optimization algorithm, using PFASST, is provided in [22]. Finally, we cite the parallel PDE solvers based on Schwarz preconditioner in spacetime, proposed in [19, 29, 53].
We propose the design of an innovative mathematical model, and the development and analysis of the related numerical algorithms, based on the simultaneous introduction of spacetime decomposition in the overlapping case on the PDEs equations governing the physical model and on the DA model. The core of our approach is that the DA model acts as coarse/predictor operator solving the local PDE model, by providing the background values as initial conditions of the local PDE models. Moreover, in contrast to the other decomposition in time approaches, in our approach local solvers (i.e. both the coarse and the fine solvers) run concurrently from the beginning. As a consequence, the resulting algorithm only requires exchange of boundary conditions between adjacent subdomains. It is worth to mention that the proposed method belongs to the socalled reducedspace optimization techniques, in contrast to the fullspace approaches such as the PFASST method, reducing the runtime of the forward and the backward integration time loops. As a consequence, we could combine the proposed approach with the PFASST algorithm. Indeed, PFASST could be concurrently employed as local solver of each reducedspace PDEconstrained optimization subproblem, exposing even more temporal parallelism.
The article will describe the more general DA problem setup for describing the domain decomposition approach, then it will focus on parallel algorithm solving the reduced order model analysing the impact of the space and time decomposition on the performance of the algorithm and finally we provide the analysis of the algorithm’s scalability. Results presented here are should be intended as the starting point for the software development to make decisions about computer architecture, future estimates of the problem size (e.g., the resolution of the model and the number of observations to be assimilated) and the performance and parallel scalability of the algorithms.
In conclusion, specific contributions of this work include:

a novel decomposition approach in spacetime leading to a reduced order model of the coupled PDEbased 4DVar DA problem.

strategies for computing the ‘kernels’ of the resulting Regularized Nonlinear Least Square computational problem.

a priori performance analysis that enables a suitable implementation of the algorithm in advanced computing environments.
The article is organized as follows. Sect. 2 gives a brief introduction to the Data Assimilation framework, where we follow the discretizethenoptimize approach. Main result is the 4DVar functional decomposition, which is given in Sect. 3. In Sect. 4 we review the whole parallel algorithm while its performance analysis is discussed in Sect. 5 on the Shallow Water Equations on the sphere. The number of state variables in the model, the number of observations in an assimilation cycle, as well as numerical parameters as the discretization step in time and in space domain are defined on the basis of discretization grid used by data available, at repository Ocean Synthesis/Reanalysis Directory of Hamburg University (see [15]). Scalability prediction of the case study based on the Shallow Water Equations is performed in Sect. 6. Finally, conclusions are provided in Sect. 7.
2 The Data Assimilation framework
We start with the more general DA problem setup and then, for simplicity, for describing the domain decomposition approach, we will consider a more convenient setup.
Let \(\mathcal {M}^{\varDelta \times \varOmega }\) denote a forecast model described by nonlinear NavierStokes equations^{Footnote 1} where \(\varDelta \subset \Re \) is the time interval, and \(\varOmega \subset \Re ^N\) is the spatial domain. If \(t \in \varDelta \) denotes the time variable and \(x \in \varOmega \) the spatial variable, let^{Footnote 2}
be the function, which we assume belonging to the Hilbert space \(\mathcal {K}(\varDelta \times \varOmega )\) equipped with the standard euclidean norm, representing the solution of \(\mathcal {M}^{\varDelta \times \varOmega }\). Following [8], we assume that \(\mathcal {M}^{\varDelta \times \varOmega }\) is symbolically described as the following initial value problem:
Function \(u^b(t,x)\) is said background state in \(\varDelta \times \varOmega \). Function \(u^b_0(x)\) is the initial condition of \(\mathcal {M}^{\varDelta \times \varOmega }\) and this is the value of the background state in \(t_0\times \varOmega \). Let:
where \(\varDelta '\subset \varDelta \) be the observations time interval and \(\varOmega ' \subset \Re ^{nobs}\), with \(\varOmega '\subset \varOmega \) be the observation spatial domain. Finally,
denotes the observations mapping, where \( \mathcal {H}\) is a nonlinear operator which includes transformations and grid interpolations. According to the practical applications of modelbased assimilation of observations, we will use the following definition of a Data Assimilation (DA) problem associated to \(\mathcal {M}^{\varDelta \times \varOmega }\).
Definition 1
(The DA problem setup) Let^{Footnote 3}

\(\{t_k\}_{k=0,M1}\), where \(t_k=t_0+k \varDelta t\), be a discretization of \(\varDelta \), such that \(\varDelta _M:=[t_0,t_{M1}] \subseteq \varDelta .\)

\(D_{K}(\varOmega ):=\{x_j\}_{j=1,K}\in \Re ^{K}\), be a discretization of \(\varOmega \), such that \(D_{K}(\varOmega ) \subseteq \varOmega \quad .\)

\(\varDelta _M\times \varOmega _{K}=\{\mathbf {z}_{ji}:=(t_j,x_i)\}_{i=1,K;j=1,M}\);

\(\mathbf {u}_0^{b} :=\{u_0^j\}_{j=1,K}^{b} \equiv \{ u(t_0,x_j)^{b} \}_{j=1,K}\in \Re ^{K}\) be the discretization of initial value in (1);

\(\mathbf {u}_k^b:= \{ u^b(t_k,x_j) \}_{j=1,K}\in \Re ^K\) be the numerical solution of (1) at \(t_k\);

\(\mathbf {u}^b= \{\mathbf {u}_k^b\}_{k=0,M1}\);

\(nobs<<K\);

\(\varDelta '_M=[\tau _0, \tau _{M1}]\subseteq \varDelta _M\);

\(D'_{nobs}(\varOmega '):=\{x_j\}_{j=1,nobs}\in \Re ^{nobs}\), be a discretization of \(\varOmega '\), such that
$$\begin{aligned} D_{nobs}(\varOmega ') \subseteq \varOmega '\quad . \end{aligned}$$ 
\(\mathbf {v}_k:=\{v(\tau _k,x_j)\}_{j=1,nobs}\in \Re ^{nobs}\) be the values of the observations on \(x_j\) at \(\tau _k\);

\(\mathbf {v}= \{\mathbf {v}_k\}_{k=0,M1}\in \Re ^K\);

\(\{\mathbf {H}^{(k)}\}_{k=0,M1}\), the Tangent Linear Model (TLM) of \(\mathcal {H}(u(t_k,x))\) at time \(t_k\);

\(\mathbf {M}^{\varDelta _M \times \varOmega _{K}}\) be a discretization of \(\mathcal {M}^{\varDelta \times \varOmega }\).

\(\mathbf {M}^{0,M1}\), is the TLM of \(\mathcal {M}^{\varDelta \times \varOmega }\), i.e. it is the first order linearization^{Footnote 4} of \(\mathcal {M}^{\varDelta \times \varOmega }\) in \(\varDelta _M \times \varOmega _{K}\) [24];

\(\mathbf {M}^T\)is the Adjoint Model (ADM)^{Footnote 5} of \(\mathbf {M}^{0,M1}\) [20]^{Footnote 6}.
The aim of DA is seeking to produce the optimal combination of the background and observations throughout the assimilation window \(\varDelta '_M\) i.e. to find an optimal tradeoff between the estimate of the system state \(\mathbf {u}^b\) and \(\mathbf {v}\). The best estimate that optimally fuses all these information is called the analysis, and it is denoted as \(\mathbf {u}^{DA}\). It is then used as an initial condition for the next forecast.
Definition 2
(The 4DVar DA problem: a regularized nonlinear least square problem (RNLLS)) Given the DA problem setup, the 4DVar DA problem consists in computing the vector \(\mathbf {u}^{DA}\in \Re ^{K}\) such that
with
where \(\lambda >0\) is the regularization parameter, \(\mathbf{B}\) and \(\mathbf{R}_k\) (\(\forall k= 0,M1\)) are the covariance matrices of the errors on the background and the observations respectively, while \(\Vert \cdot \Vert _{\mathbf{B}^{1}}\) and \(\Vert \cdot \Vert _{\mathbf{R}_k^{1}}\) denote the weighted euclidean norm.
The first term of the (6) quantifies the departure of the solution \(\mathbf {u}^{DA}\) from the background state \(\mathbf {u}^{b}\). The second term measures the sum of the mismatches between the new trajectory and observations \(\mathbf {v}_k\), for each time \(t_k\) in the assimilation window. The weighting matrices \(\mathbf{B}\) and \(\mathbf{R}_k\) need to be predefined, and their quality influences the accuracy of the resulting analysis [3].
As K exceeds \(10^6\) in general this problem can be considered as a Large Scale Nonlinear Lest Square problem. We provide a mathematical formulation of a domain decomposition approach, which starts from decomposition of the whole domain \(\varDelta \times \varOmega \), namely both the spatial and temporal domain; it uses a partitioning of the solution and a modified functional describing the RNLLS problem on the subdomain of the decomposition. Solution continuity equations across interval boundaries are added as constraints of the assimilation functional. We will first introduce domain decomposition of \(\varDelta \times \varOmega \) then, restriction and extension operators will be defined on functions given on \(\varDelta \times \varOmega \). These definitions will subsequently generalized to \(\varDelta _M \times \varOmega _{K}\).
3 The SpaceTime Decomposition of the Continuous 4DVar DA Model
In this section we give a precise mathematical setting for space and function decomposition then we state some notations used later. In particular, we first introduce the function and domain decomposition, then by using restriction and extension operators, we associate to the domain decomposition a functional decomposition. So, we prove the following result: the minimum of the global functional, defined on the entire domain, can be obtained by collecting the minimum of each local functional.
For simplicity we assume that the spacial and temporal domains of the observations are the same of the background state, i.e. \(\varDelta '=\varDelta \) and \(\varOmega '=\varOmega \); furthermore we assume that \(t_k=\tau _k\).
Definition 3
(Domain Decomposition) Let \(P\in \mathbf {N}\) and \(Q\in \mathbf {N}\) be fixed. The set of bounded Lipschitz domains \({\varOmega _i}\), overlapping subdomains of \(\varOmega \):
is called a decomposition of \(\varOmega \), if
with
when two subdomains are adjacent. Similarly, the set of overlapping subdomains of \(\varDelta \):
is a decomposition of \(\varDelta \), if
with
when two subdomains are adjacent. We call domain decomposition of \(\varDelta \times \varOmega \) and we denote it as \(DD(\varDelta \times \varOmega )\), the set of \(P\times Q\) overlapping subdomains of \(\varDelta \times \varOmega \):
From (11) it follows that
Associated to the decomposition (11) we define the Restriction Operator of functions belonging to \(\mathcal {K}(\varDelta \times \varOmega )\):
Definition 4
(Restriction of a function) Let
be the Restriction Operator (RO) of f in \(DD(\varDelta \times \varOmega )\) as in (11) be such that:
We pose:
For simplicity, if \(i \equiv j\), we denote \(RO_{ii}=RO_{i}\).
In line with this, given a set of \(Q\times P\) functions \(g_{ji}\), \(j=1,Q\), \(i=1,P\) each belonging to \(\mathcal {K}(\varDelta _j\times \varOmega _i)\), we define the Extension Operator of \(g_{ji}\):
Definition 5
(Extension of a function) Let
be the Extension Operator (EO) of \(g_{ji}\) in \(DD(\varDelta \times \varOmega )\) as in (11)be such that:
We pose:
For any function \(u\in \mathcal {K}(\varDelta \times \varOmega )\), associated to the decomposition (8), it holds that
Given \(P\times Q\) functions \(u_{ji}(t,x) \in \mathcal {K}(\varDelta _i\times \varOmega _j)\), the summation
defines a function \(u \in \mathcal {K}(\varDelta \times \varOmega )\) such that:
Main outcome of this framework is the definition of the operator \(RO_{ji}\) for the 4DVar functional defined in (6). This definition originates from the definition of the restriction operator of \(\mathcal {M}^{\varDelta \times \varOmega }\) in (1), given as follows.
Definition 6
(Reduction of \(\mathcal {M}^{\varDelta \times \varOmega }\)) If \(\mathcal {M}^{\varDelta \times \varOmega }\) is defined in (1), we introduce the model \(\mathcal {M}^{\varDelta _j \times \varOmega _i}\) to be the Reduction of \(\mathcal {M}^{\varDelta \times \varOmega }\):
defined in \(\varDelta _j \times \varOmega _i\), such that:
It is worth noting that initial condition \(u^b_j(x)\) is the value in \(t_j\) of the solution of \(\mathcal {M}^{\varDelta \times \varOmega }[u(t_0,x)]\) defined in (1).
3.1 SpaceTime Decomposition of the Discrete Model
Let assume that \(\varDelta _M\times \varOmega _{K}\) can be decomposed into a sequence of \(P\times Q\) overlapping subdomains \(\varDelta _j \times \varOmega _i\) such that
where \(\varOmega _i \subset \Re ^{r_i}\) with \(r_i \le K\) and \(\varDelta _j \subset \Re ^{s_j}\) with \(s_j\le M\). Finally, let us assume that
Hence
In this respect, we define the Extension Operator (EO) also. If \(\mathbf {u}=(u(z_{ji}))_{z_{ji} \in \varDelta _j \times \varOmega _i}\), it is
and \( EO(\mathbf {u}) \equiv \mathbf {u^{EO}} \in \Re ^{M\times K}\).
Definition 7
(Restriction of the Covariance Matrix) Let \(\mathbf {C}(\mathbf {w})\in \Re ^{K\times K}\) be the covariance matrix of a random vector \(\mathbf {w}=(w_1, w_2, \ldots ,w_{K}) \in \Re ^{K}\), that is coefficient \(c_{i,j}\) of \(\mathbf {C}\) is \(c_{i,j}=\sigma _{ij} \equiv Cov(w_i,w_j)\). Let \(s<K\), we define the Restriction Operator \(RO_{st}\) onto \(\mathbf {C}(\mathbf {w})\) as follows:
i.e., it is the covariance matrix defined on \(\mathbf {w^{RO_{st}}}\).
Hereafter, we refer to \(\mathbf {C}(\mathbf {w^{RO_s}})\) using the notation \(\mathbf {C_{st}}\).
Definition 8
(Restriction of the operator \(\mathbf {H}^{(k)}\)) We define the Restriction Operator \(RO_{ji}\) of \(\mathbf {H}^{(k)}\) in \(DD(\varDelta \times \varOmega )\) as in (11) as the TLM at time \(t_k\) of the restriction of \(\mathcal {H}\) on \(\varDelta _j \times \varOmega _i\).
Definition 9
(Restriction of \(\mathbf {M}^{\varDelta _M \times \varOmega _{K}} \)) We let \(\mathbf {M}^{\varDelta _j \times \varOmega _{i}}\) be the Restriction Operator \(RO_{ji}\) of \(\mathbf {M}^{\varDelta _M \times \varOmega _{K}} \) in \(\varDelta _j \times \varOmega _i\) where:
defined in \(\varDelta _j \times \varOmega _i\).
Definition 10
(Restriction of the operator \(\mathbf {M}^{0,M1}\)) We define \(\mathbf {M}^{j,j+1}_i \) to be the Restriction Operator \(RO_{ji}\) of \(\mathbf {M}^{0,M1}\) in \(DD(\varDelta \times \varOmega )\), as in (11). It is the TLM of the Restriction of \(\mathbf {M}^{\varDelta _M \times \varOmega _{K}}\) on \(\varDelta _j \times \varOmega _i\).
Finally, we are now able to give the following definition.
Definition 11
(Restriction of 4DVar DA) Let
denotes the Restriction operator of the 4DVar DA functional defined in (6). It is defined as
Local 4DVar DA functional \( J_{ji}(\mathbf {u}_{{ji}})\) in (16) becomes:
This means that the approach we are following is to firstly decompose the 4DVar functional J then to locally linearize and solve each local functional \(J_{ji}\). For simplicity of notations we let
We observe that \(RO_{ji}[J](\mathbf {u}_{{ji}})\) is made of a first term which quantifies the departure of the state \(\mathbf {u}_{{ji}}\) from the background state \(\mathbf {u}^b_{{ji}}\) at time \(t_j\) and space \(x_i\). The second term measures the mismatch between the state \(\mathbf {u}_{{ji}}\) and the observation \(\mathbf {v}_{{ji}}\).
Definition 12
(Extension of 4DVar DA) Given \(DD(\varDelta \times \varOmega )\) as in (11) let
be the Extension Operator (EO) of the 4DVar functional defined in (6), where
From (19), it follows the decomposition of J as follows.
Main outcome of (19) is the capability of defining local 4D Var problems which contribute all together to the 4DVar problem as detailed in the following.
3.2 Local 4DVar DA Problem: The Local RNLLS Problem
Starting from Local 4DVar functional in (17) which is obtained applying the Restriction Operator to the 4DVar functional defined in (6), we add a local constraint to such restriction. This is a sort of regularization of the local 4DVar functional introduced in order to enforce the continuity of each solution of the local problem onto the overlap region between adjacent subdomains. Local constraint consists of the overlapping operator \(\mathcal {O}_{(jh)(ik)}\) defined as
where the symbol \(\circ \) denotes the operators composition. Each operator in the (20) tackles the overlapping of the solution in the spatial dimension and in the temporal dimension, respectively. More precisely, for \(j = 1\dots Q; \, i=1 \ldots P\), operator \(\mathcal {O}_{(jh)(ik)}\) represents the overlap of temporal subdomains j and h and spatial subdomains i and k, where h and k are given as in Definition 4 and
and
Remark 1
We observe that, in the overlapping domain \(\varDelta _{jh} \times \varOmega _{ik}\) we get two vectors: \(\mathbf {u}_{(jh)(ik)}\), which is obtained as the restriction of \(\mathbf {u}_{(ji)}= argmin J_{ji}(\mathbf {u}_{ji})\) to that region, and \(\mathbf {u}_{(hj)(ki)}\), which is the restriction of \(\mathbf {u}_{(hk)}= argmin J_{hk}(\mathbf {u}_{hk})\) to the same region. The order of the indexes plays a significant rule from the computing perspectives.
From (20), three cases derive

1.
decomposition in space, i.e. \(j=Q=1\) and \(P>1\). Here we get \(j=Q=1\), i.e. time interval is not decomposed, and \(P>1\), i.e. the spatial domain \(\varOmega \) is decomposed according to the domain decomposition in (11). The overlapping operator is defined as in (21). In particular we assume that
$$\begin{aligned} \mathcal {O}_{ik}(\mathbf {u}_{ji}):=\Vert \underbrace{RO_{ji}(\mathbf {u}_{jk})}_{\mathbf {u}_{j(ki)}}  \underbrace{RO_{jk}(\mathbf {u}_{ji})}_{\mathbf {u}_{(j)(ik)}}\Vert _{(\mathbf {B}^{1})_{ik}} \end{aligned}$$ 
2.
decomposition in time, i.e. \(Q>1\) and \(P=1\). We get that \(i=P=1\), i.e. the spatial domain is not decomposed, and \(Q>1\), i.e. the time interval is decomposed according to the domain decomposition in (11). The overlapping operator is defined as in (22). In particular we assume that
$$\begin{aligned} \mathcal {O}_{jh}(\mathbf {u}_{ji}):=\Vert \underbrace{RO_{ji}(\mathbf {u}_{hi})}_{\mathbf {u}_{(hj)i}}  \underbrace{RO_{hi}(\mathbf {u}_{ji})}_{_{\mathbf {u}_{(jh)i}}}\Vert _{(\mathbf {B}^{1})_{jh}} \end{aligned}$$ 
3.
decomposition in spacetime, i.e. \(Q>1\) and \(P>1\). We assume that \(Q>1\) and \(P>1\) i.e. both the time interval and the spatial domain are decomposed according to the domain decomposition in (11). The overlapping operator is defined as in (20). In particular we assume that
$$\begin{aligned} \mathcal {O}_{(jh)(ik)}(\mathbf {u}_{ji}):=\Vert \mathbf {u}_{(hj)(ki)}  \underbrace{RO_{hi}(RO_{jk}(\mathbf {u}_{ji}))}_{_{\mathbf {u}_{(jh)(ik)}}}\Vert _{(\mathbf {B}^{1})_{(jh)(ik)}} \end{aligned}$$
We now give the new definition of the local 4DVar DA functional
Definition 13
(Local 4DVar DA) Given \(DD(\varDelta \times \varOmega )\) as in (11), let:
where \(RO_{ji}[J](\mathbf {u}_{{ji}})\) is given in (16), \(\mathcal {O}_{(jh)(ik)}\) will be suitably defined on \(\varDelta _{jh}\times \varOmega _{ik}\), be the local 4DVar functional. Parameter \(\mu _{ji}\) is a regularization parameter. Finally, let
be the global minimum of \({J}_{ji}\) in \(\varDelta _j \times \varOmega _i\).
More precisely, the local 4DVar DA functional \( J_{ji}(\mathbf {u}_{{ji}})\) in (23) becomes:
where the three terms contributing to the definition of the local DA functional clearly come out. We note that in (17) the operator \(\mathbf {M}^{k,k+1}_i\) which is defined in (4) replaces \(\mathcal {M}^{\varDelta _j \times \varOmega _i}\).
Finally, we have to guarantee that the global minimum of the operator J, can be searched among the global minima of local functionals.
3.3 Local 4DVar DA Minimization
Let
where \({\mathbf {u}}_{ji}^{DA}\) is defined in (24), be (the extension of) the minimum of the (global) minimums of the local functionals \({J}_{ji}\) as in (24). Let
be its minimum.
Theorem 1
If \( DD( \varDelta \times \varOmega )\) is a decomposition of \( \varDelta \times \varOmega \) as defined in (11). It follows that:
with \( \mathbf {u}^{DA}\) defined in (5). Moreover, the equality in (28) holds if J is convex.
Proof
Let \({\mathbf {u}}_{ji}^{DA}\) be defined in (24), it is
From the (29) follows
which gives from the (19):
then \(({\mathbf {u}}_{ji}^{DA})^{EO}\) is a stationary point for J in \(\Re ^{M \times K}\). As \( \mathbf {u}^{DA}\) in (5) is the global minimum of J in \(\Re ^K\), it follows that:
then, from the (27) it follows that
Now we prove that if J is convex, then
by reduction to the absurd. Assume that
In particular,
This means that
From the (35) and the (27), it is:
then, from the (14):
The (36) is an absurd as the values of \({\mathbf {u}}_{ji}^{DA}\) is the global minimum for \(J_{ji}\). So the (28) is proved.
\(\square \)
4 The SpaceTime RNLLS Parallel Algorithm
We introduce the algorithm solving the RNLLS problem by using the spacetime decomposition, i.e. solving the \(QP=q\times p\) local problems in \(\varDelta _j \times \varOmega _i\), where \(j=1,Q\) and \(i=1,P\) (see Fig. 1 to see an example of domain decomposition where \(Q=4\) and \(P=2\).).
Definition 14
(DDRNLLS Algorithm) Let \(\mathcal {A}^{loc}_{RNLLS}(\varDelta _j \times \varOmega _i)\) denote the algorithm solving the local 4DVar DA problem defined in \(\varDelta _j \times \varOmega _i\). The spacetime DDRNLLS parallel algorithm solving the RNLLS problem in \(DD(\varDelta \times \varOmega )\), is symbolically denoted as
and it is defined as the merging of the \(QP=Q\times P\) local algorithms \(\mathcal {A}^{loc}_{RNLLS}(\varDelta _j \times \varOmega _i)\), i.e.:
The DDRNLLS algorithm can be sketched as described by Algorithm 1. Similarly, the Local RNLLS algorithm \(\mathcal {A}^{loc}_{RNLLS}\) is described by Algorithm 2.
Remark 2
We observe that \(\mathcal {A}^{DD}_{RNNLS}(\varDelta _M \times \varOmega _{K})\) algorithm is based on two main steps, i.e. the domain decomposition step (see line 1) and the model linearization step (see line 6). This means that this algorithm uses a convex approximation of the objective DA functional so that Theorem 1 holds.
The common approach for solving RNLLS problems consists in defining a sequence of local approximations of \(\mathbf {J}_{ij}\) where each member of the sequence is minimized by employing Newton’s method or one its variants (such as GaussNewton, LBFGS, LevenbergMarquardt). Approximations of \(\mathbf {J}_{ij}\) are obtained by expanding \(\mathbf {J}_{ij}\) in a truncated Taylor’s series, while the minimum is obtained by using secondorder sufficient conditions [13, 44]. Let’s consider Algorithm 3 solving the RNLLS problem on \(\varDelta _j \times \varOmega _i\).
Main computational task occurs at step 5 of Algorithm 3 concerning the minimization of \(\tilde{\mathbf {J}_{ji}}\), which is the local approximation of \(\mathbf {J}_{ij}\). Two approaches could be employed in Algorithm 3:

(a)
by truncating Taylor’s series expansion of \(\mathbf {J}_{ij}\) at the second order, we get
$$\begin{aligned} \mathbf {J}_{ij}^{QD}(\mathbf {{u}}_{ji}^{l+1})=\mathbf {J}_{ij}(\mathbf {{u}}_{ji}^{l})+\nabla \mathbf {J}_{ij}(\mathbf {{u}}_{ji}^{l})^T \delta \mathbf {{u}}_{ji}^l+\left( \delta \mathbf {{u}}_{ji}^l\right) ^T \nabla ^2 \mathbf {J}_{ij}(\mathbf {{u}}_{ji}^{l}) \delta \mathbf {{u}}_{ji}^l \end{aligned}$$(38)giving a quadratic approximation of \(\mathbf {J}_{ji}\) at \(\mathbf {u}_{ji}^l\). Newton’methods (including LBFGS and LevenbergMarquardt) use \(\varvec{\tilde{J}}_{{ji}}=\mathbf {J}_{ij}^{QD}\).

(b)
by truncating Taylor’s series expansion of \(\mathbf {J}_{ij}\) at the first order we get the following linear approximation of \(\mathbf {J}_{ij}\) at \(\mathbf {{u}}_{ji}^{k}\):
$$\begin{aligned} \mathbf {J}_{ij}^{TL}(\mathbf {{u}}_{ji}^{l+1})=\mathbf {J}_{ij}(\mathbf {{u}}_{ji}^{l})+\nabla \mathbf {J}_{ij}(\mathbf {{u}}_{ji}^{l})^T \delta \mathbf {{u}}_{ji}^l=\frac{1}{2} \Vert \nabla \mathbf {F}_{ji}(\mathbf {{u}}_{ji}^{l})\delta \mathbf {{u}}_{ji}^l+\mathbf {F}_{ji}(\mathbf {{u}}_{ji}^{l})\Vert _2^2 \end{aligned}$$(39)where we let^{Footnote 7}\(\mathbf {J}_{ij}:= \Vert \mathbf {F}_{ji}\Vert _2^2\) which gives a linear approximation of \(\mathbf {J}_{ji}\) at \(\mathbf {u}_{ji}^l\). GaussNewton’s methods (including Truncated or Approximated GaussNewton [21]) use \(\mathbf {J}^{TL}_{ji}=\tilde{\mathbf {J}}_{ji}\).
Observe that from (38) it follows
Algorithm 3 becomes Algorithm 4, described below.

(a)
\(\mathcal {A}_{QN}^{loc}\): computes a local minimum of \(\mathbf {J}^{QN}_{ji}\) following the Newton’s descend direction. The minimum is computed solving the linear system involving the Hessian matrix \(\nabla ^2\mathbf {{J}}_{ij}\), and the negative gradient \(\nabla \mathbf {{J}}_{ij}\) at \(\mathbf {{u}}_{ji}^l\), for each value of l (see Algorithm 5 described below);

(b)
\(\mathcal {A}_{LLS}^{loc}\): computes a local minimum of \(\mathbf {J}^{TL}_{ji}\) following the steepest descend direction. The minimum is computed solving the normal equations arising from the local Linear Least Squares (LLS) problem (see Algorithm 6 described below).
Remark 3
: We observe that if, in \(\mathcal {A}^{loc}_{QN}\) algorithm, matrix \(\mathbf {Q}(\mathbf {{u}}_{ij}^l)\) (see line 6 of Algorithm 5) is neglected we get the GaussNewton method described by \(\mathcal {A}^{loc}_{LLS}\) algorithm. More generally, term \(\mathbf {Q}(\mathbf {{u}}_{ij}^l)\)

1.
in case of Gauss Newton, \(\mathbf {Q}(\mathbf {{u}}_{ij}^l)\) is neglected;

2.
in case of LevenbergMarquardt, \(\mathbf {Q}(\mathbf {{u}}_{ij}^l)\) equals to \(\lambda I\), where the damping term, \(\lambda >0\), is updated at each iteration and I is the identity matrix [26, 30];

3.
in case of the LBFGS, the Hessian matrix is Rank1 updated at every iteration [45].
According to the most common implementation of the 4DVar DA [15, 50], we focus the attention on GaussNewton(GN) method described in \(\mathcal {A}^{loc}_{LLS}\) in Algorithm 6.
For each l, let \(\mathbf {G}_{ji}^l = RO_{ji}[\mathbf {G}^l]\), where \(\mathbf {G}^l \in \Re ^{( M\times nobs)\times ( NP \times M)}\), be the block diagonal matrix such that
while \((\mathbf {G}_{ji}^T)^l=RO_{ji}[(\mathbf {G}^T)^k]\) is the restriction of the transpose of \(\mathbf {G}^l\), and
are the TLMs of \(\mathbf {M}_{k,k+1}\), for \(s=0,M1\), around \( \mathbf {u}^l_{ji}\), respectively. Finally, let
be the restriction of the misfit vector. In line 7 of Algorithm 6, it is
and,
Most popular 4DVar DA software implement the so called \(\mathbf {B}\)preconditioned Krylov subspace iterative method [21, 23, 50] arising by using the background error covariance matrix as preconditioner of a Krylov subspace iterative method.
Let \(\mathbf{B}_{ji}=\mathbf {V}_{ji}\mathbf {V}_{ji}^T\) be expressed in terms of the deviance matrix \(\mathbf {V}_{ji}\), and \(\mathbf {w}_i\) such that
with \(V_i ^+\) generalised inverse of \(\mathbf {V}_i \), the (42) becomes
while the (43) becomes
The normal equation system (see line 7 of \(\mathcal {A}^{loc}_{LLS}\)), i.e. the linear system
becomes
Definition 15
Let \(\mathcal {A}^{loc}_{4DVar}(\varDelta _j \times \varOmega _i)\) denote the algorithm solving the local 4DVar DA problem defined in \(\varDelta _j \times \varOmega _i\). The spacetime 4DVar DA parallel algorithm solving the 4DVar DA problem in \(DD(\varDelta _{M} \times \varOmega _{K})\), is symbolically denoted as \(\mathcal {A}^{DD}_{4DVar}(\varDelta _M \times \varOmega _{K})\), and it is defined as the union of the \(QP=q\times p\) local algorithms \(\mathcal {A}^{loc}_{4DVar}(\varDelta _j \times \varOmega _i)\), i.e.:
Algorithm \( \mathcal {A}^{loc}_{4DVar}\) is algorithm \( \mathcal {A}^{loc}_{LLS}\) (see Algorithm 6) specialized for the 4D Var DA problem and it is described by Algorithm 7 and Algorithm 8 described below [23].
In the next section we will show that this formulation leads to local numerical solutions convergent to the numerical solution of the global problem.
5 Convergence Analysis
In the following we pose \(\Vert \cdot \Vert =\Vert \cdot \Vert _{\infty }\).
Proposition 1
Let \(u_{j,i}^{ASM,r}\) be the approximation of the increment \(\delta \mathbf {{u}}_{ji}\) to the solution \(\mathbf {{u}}_{ji}\) obtained at step r of ASMbased inner loop on \(\varOmega _{j}\times \varDelta _{i}\). Let \(u_{j,i}^{n}\) be the approximation of \(\mathbf {u}_{j,i}\) obtained at step n of the outer loop i.e. the spacetime decomposition approach on \(\varOmega _{j}\times \varDelta _{i}\). Let us assume that the numerical scheme discretizing the model \(\mathbf {M}_i^{j,j+1}\) is convergent. Then, given i and j fixed, it holds that:
Proof
let \(u_{j,i}^{\mathbf {M}_i^{j,j+1},n+1}\) be the numerical solution of \(\mathbf {M}_i^{j,j+1}\) at step n; taking into account that, according to the incremental update of the solution of the 4D Var DA functional (for instance, see line 10 of the Algorithm 7), the approximation \(\mathbf {u}_{j,i}^n\) is computed as
then, it is
from the hypothesis, we have
and (49) can be rewritten as follows
Convergence of ASM is proved in [5], similarly, applying ASM to 4D Var DA problem, it holds that
and for \(n\ >M^{2}(\epsilon ^{ASM})\), we get
Hence, by using \(\epsilon :={\epsilon ^{ASM}}+\epsilon ^{\mathbf {M}_i^{j,j+1}}\) and \(M(\epsilon ):= \max \{M^{1}(\epsilon ^{ASM}),M^{2}(\epsilon ^{\mathcal {M}_i^{j,j+1}})\}\), we get the thesis in (52). \(\square \)
6 Performance Analysis
Performance metrics we will employ are time complexity and scalability. Our aim is to highlight the benefits arising from using the decomposition approach instead of solving the problem on the whole domain. As we shall discuss later, the performance gain that we get from using the space and time decomposition approach is two fold:

1.
Instead of solving one larger problem we can solve several smaller problems which are better conditioned than the former problem. This result leads to a reduction of each local algorithm’s time complexity.

2.
Subproblems reproduce the whole problem at smaller dimensions and they are solved in parallel. This result leads to a reduction of software execution time.
We give the following
Definition 16
An uniform bidirectional decomposition of the space and time domain \(\varDelta _M \times \varOmega _{K}\), is such that if we let
be the size of the whole domain, then each subdomain \(\varDelta _j \times \varOmega _i\) is such that
where \(D_t=\frac{M}{q}\ge 1\), and \(D_s=\frac{K}{p}\ge 1\).
In the following we let
Let \(T(\mathcal {A}^{DD}_{4DVar}(\varDelta _M \times \varOmega _{K}))\) denote time complexity of \(\mathcal {A}^{DD}_{4DVar}(\varDelta _{M} \times \varOmega _{K})\).
We now provide an estimate of the time complexity of each local algorithm, denoted as \(T(\mathcal {A}^{Loc}_{4DVar}(\varDelta _j \times \varOmega _i))\). This algorithm consists of two loops. The outerloop, over lindex, for computing local approximations of \(\mathbf {J}_{ji}\), and the innerloop over mindex, for performing Newton’s or Lanczos’ steps. The major computational tasks to be performed at each step of the outerloop is the computation of \(\mathbf {J}_{ji}\). The major computational tasks to be performed at each step l of the innerloop, in case of GN method (see algorithm \(\mathcal {A}^{Loc}_{4DVar}\)), involving the predictive model is^{Footnote 8}

1.
the computation of the tangent linear model \(RO_{ji}[\mathbf {M}_{k,k+1}]\) (the time complexity of such operation scales as the problem size squared) ,

2.
the computation of the adjoint model \(RO_{ji}[\mathbf {M}_{k,k+1}^T]\) ( which is at least 4 times more expensive than the computation of \(RO_{ji}[\mathbf {M}_{k,k+1}]\),

3.
the solution of the normal equations, involving at each iteration, two matrixvector products with \(RO_{ji}[\mathbf {M}_{k,k+1}^T]\) and \(RO_{ji}[\mathbf {M}_{k,k+1}]\) (whose time complexity scales as the problem size squared).
As the most time consuming operation involving the predictive model is the computation of the tangent linear model, we prove that
Proposition 2
Let
be the polynomial of degree \(d=2\) denoting the timecomplexity of the tangent linear model \(RO_{ji}[\mathbf {M}_{s,s+1}]\). Let \(m_{ji}\) and \(l_{ji}\) be the number of steps of the outer/innerloop, of \(\mathcal {A}^{Loc}_{4DVAR}\), respectively. We get
Proof
It is:
Let
Observe that \(m_{max}\) and \(l_{max}\) actually are the number of steps of the outer and inner loops of \(\mathcal {A}^{DD}(\varDelta _M \times \varOmega _{K})\), respectively. Let \(m_G\) and \(l_G\) denote the number of iterations of inner and outer loop of \(\mathcal {A}^{G}(\varDelta _M \times \varOmega _{K})\) algorithm, we give the following:
Definition 17
Let
denote the total number of iterations of \(\mathcal {A}^{G}_{4DVAR}(\varDelta _M \times \varOmega _{K})\), of \(\mathcal {A}^{Loc}_{4DVAR}(\varDelta _j \times \varOmega _{i})\) and of \(\mathcal {A}^{DD}_{4DVAR}(\varDelta _M \times \varOmega _{K})\), respectively.
If we denote by \(\mu (J)\) the condition number of the DA operator , as it holds that [3]
then it is
and
This result says that the number of iterations of \(\mathcal {A}^{DD}_{4DVar}(\varDelta _M \times \varOmega _{K})\) algorithm is always smaller than the number of iterations of \(\mathcal {A}^{G}_{4DVar}(\varDelta _M \times \varOmega _{K})\) algorithm. This is one of the benefits of using the space and time decomposition.
Algorithm scalability is measured in terms of the strong scaling (which is the measure of the algorithm’s capability to exploit performance of high performance computing architectures in order to minimise the time to solution for a given problem with a fixed dimension) and of the weak scaling (which is the measure of the algorithm’s capability to use additional computational resources effectively to solve increasingly larger problems). A variety of metrics have been developed to assist in evaluating the scalability of a parallel algorithm, speedup, model throughput, scale up, efficiency are the most used. Each one highlights specific needs and limits to be answered by the parallel algorithm. In our case, as we intend to mainly focus on the benefits arising from the use of hybrid computing architectures we consider the so called scaleup factor firstly introduced in [7].
First result straightforwardly derives from the definition of the scaleup factor:
Proposition 3
(DD4DVar Scale up factor) The (relative) scaleup factor of \(\mathcal {A}^{DD}_{4DVar}(\varDelta _M \times \varOmega _{K})\) related to \(\mathcal {A}^{loc}_{4DVar}(\varDelta _j \times \varOmega _i)\), denoted as \(Sc_{QP}(\mathcal {A}^{DD}_{4DVar}(\varDelta _M \times \varOmega _{K}))\) is:
where \(QP:=q \times p\) is the number of sub domains. It is:
where
and
Corollary 1
If \( a_i= 0 \quad \forall i\in [0,d1]\), then \(\beta =1\), i.e.
Finally
Corollary 2
If \(N_{loc}\) is fixed, it is
while, if QP is fixed
From (55) it results that, considering one iteration of the whole parallel algorithm, the growth of the scale up factor essentially is one order less than the time complexity of the reduced model. In other words, the time complexity of the reduced model mostly impacts the scalability of the parallel algorithm. In particular, as parameter d equal to 2, it follows that the asymptotic scaling factor of the parallel algorithm, with respect to QP, is bounded above by two.ECMWF
Besides the time complexity, scalability is also affected by the communication overhead of the parallel algorithm. The surfacetovolume ratio is a measure of the amount of data exchange (proportional to surface area of domain) per unit operation (proportional to volume of domain). We prove that
Theorem 2
The surface to volume ratio of a uniform bidimensional decomposition of spacetime domain \(\varDelta _M \times \varOmega _{K}\), is
Let \(\mathcal {S}\) denote the surface of each subdomain, then
and \(\mathcal {V}\) denote its volume, then
It holds that
and the (56) follows.
Definition 18
(Measured Software Scaleup) Let
be the measured software scaleup in going from 1 to QP.
Proposition 4
If
then, it holds that
with
If
from (59) it comes the thesis in (58).
In the following we will denote the measured scale up as \(Sc_{1,QP}^{meas}(\mathcal {A}^{DD}_{4DVar})\) or as \(Sc_{1,QP}^{meas}(N)\), respectively.
Finally, last proposition allows us to examine the benefit on the measured scale up arising from the speed up of the local parallel algorithm, mainly in the presence of a multilevel decomposition, where \(s_{nproc}^{loc}(\mathcal {A}^{loc}_{4DVar}) >1\).
Proposition 5
It holds that
Proof

if \(s_{nproc}^{loc}(\mathcal {A}^{loc}_{4DVar}) =1\) then
$$\begin{aligned} \alpha (N,QP)<1 \Leftrightarrow Sc_{1,QP}^{meas}(\mathcal {A}^{DD}_{4DVar}) < Sc_{1,QP}(\mathcal {A}^{DD}_{4DVar}) \end{aligned}$$ 
if \(s_{nproc}^{loc}(\mathcal {A}^{loc}_{4DVar}) >1\) then
$$\begin{aligned} \alpha (N,QP)>1 \Leftrightarrow Sc_{1,QP}^{meas}(\mathcal {A}^{DD}_{4DVar}) > Sc^f_{1,QP}(\mathcal {A}^{DD}_{4DVar}) ; \end{aligned}$$ 
if \(s_{nproc}^{loc} (\mathcal {A}^{loc}_{4DVar})=QP\) then
$$\begin{aligned} 1< \alpha (N,QP)<QP \Rightarrow Sc_{1,QP}^{meas}(\mathcal {A}^{DD}_{4DVar}) < QP \cdot Sc^f_{1,QP}(\mathcal {A}^{DD}_{4DVar}) ; \end{aligned}$$
\(\square \)
We may conclude that

1.
strong scaling: if QP increases and \(M\times K\) is fixed, the scale up factor increases but the surfacetovolume ratio increases.

2.
weak scaling: if QP is fixed and \(M \times K\) increases, the scale up factor stagnate and the surfacetovolume ration decreases.
This means that it needs to find the appropriate value of the number of sub domains, QP, giving the right trade off between the scale up and the overhead of the algorithm.
7 Scalability Results
Results presented here are just a starting point towards the assessment of the software scalability. More precisely, we introduce simplifications and assumptions appropriate for a proofofconcept study in order to get values of the measured scale up of the one iteration of the parallel algorithm.
As the main outcome of the decomposition is that the parallel algorithm is oriented to better exploit the high performance of new architectures where concurrency is implemented both at the coarsest and finest levels of granularity, such as a distributed memory multiprocessor (MIMD) and a Graphic Processing Units (GPUs) [56], we consider a distributed computing environment located in the University of Naples Federico II campus, connected by localarea network made of

\(PE_1\) (for the coarsest level of granularity): a MultipleInstruction, MultipleData (MIMD) architecture made of 8 nodes which consist of distributed memory DELL M600 blades connected by a 10 Gigabit Ethernet technology. Each blade consists of 2 Intel Xeon@2.33GHz quadcore processors sharing the same local 16 GB RAM memory for a total of 8 cores per blade and of 64 total cores.

\(PE_2\) (for the finest level of granularity): a Kepler architecture of the GK110 GPU [46], which consists of a set of 13 programmable SingleInstruction, MultipleData (SIMD) Streaming Multiprocessors (SMXs), connected to a quadcore Intel i7 CPU running at 3.07GHz, 12 GB of RAM. For host(CPU)todevice(GPU) memory transfers CUDA enabled graphic cards are connect to a PC motherboard via a PCIExpress (PCIe) BUS [48]. For this architecture the maximum number of active threads per multiprocessor is 2048, which means that the maximum number of active warps per SMX is 64.
Our implementation uses the matrix and vector functions in the Basic Linear Algebra Subroutines (BLAS) for \(PE_1\) and the CUDA Basic Linear Algebra Subroutines (CUBLAS) library for \(PE_2\). The routines used for computing the minimum of J on \(PE_1\) and \(PE_2\) are described in [28] and [10] respectively.
The case study is based on the Shallow Water Equations (SWEs) on the sphere. The SWE have been used extensively as a simple model of the atmosphere or ocean circulation since they contain the essential wave propagation mechanisms found in general circulation models [52].
The SWEs in spherical coordinates are:
Here f is the Coriolis parameter given by \(f = 2\varOmega \sin {\theta }\), where \(\varOmega \) is the angular speed of the rotation of the Earth, h is the height of the homogeneous atmosphere (or of the free ocean surface), u and v are the zonal and meridional wind (or the ocean velocity) components, respectively, \(\theta \) and \(\lambda \) are the latitudinal and longitudinal directions, respectively, a is the radius of the earth and g is the gravitational constant.
We express the system of equations (60)–(62) using a compact form, i.e.:
where
and
We discretize (63) just in space using an unstaggered TurkelZwas scheme [37, 38], and we obtain:
where
and
so that
The numerical model depends on a combination physical parameters, including the number of state variables in the model, the number of observations in an assimilation cycle, as well as numerical parameters as the discretization step in time and in space domain are defined on the basis of discretization grid used by data available, at repository Ocean Synthesis/Reanalysis Directory of Hamburg University (see [15]).
To begin our data assimilation, an initial value of the model state is created by choosing snapshots from the run prior to the start of the assimilation experiment and treating it as realization valid at the nominal time. Then, the model state is advanced to the next time using the forecast model, and the observations are combined with the forecasts (i.e., the background) to produce the analysis. This process is iterated. As it proceeds, the process fills gaps in sparsely observed regions, converts observations to improved estimates of model variables, and filters observation noise. All this is done in a manner that is physically consistent with the dynamics of the ocean as represented by the model. In our experiments, the simulated observations are created by sampling the model states and adding random errors to those values. A detailed description of the simulation, together with the results and the software implemented, is presented in [11]. In the following, we are mainly interested to focus on performance results.
The reference domain decomposition strategy uses the following correspondence between QP and nproc:
which means that the number of subdomains coincides with the number of available processors.
According to the characteristics of the physical domain in SWEs, the total number of grid points in space is
Let assume that
while \(n_z=3\). Since the unknown vectors are the fluid height or depth, and the twodimensional fluid velocity fields, the problem size in space is
We assume a 2D uniform domain decomposition along the latitudelongitude directions, such that
with
where \(p_1 \times p_2= p\).
Since the GPU (\(PE_2\)) can only process the data in its global memory [55], in a generic parallel algorithm execution, the host acquires these input data and sends them to the device memory, which concurrently calculates the minimization of the 4DVar functional. To avoid continuous relatively slow data transfer from the host to the device and in order to reduce the overhead, it was decided to store the device with the entire work data prior to any processing. Namely, the maximum value of \(D_s\) in (69) is chosen such that the amount of data related each subdomain (we denote it with \(Data_{mem} (Mbyte)\)) can be completely stored in the memory.
If we assume that \(nloc_x=nloc_y\) and we let \(n_{loc}=nloc_x=nloc_y\), as the global GPU memory is of 5Gbyte, we have the values of usable \(n_{loc}\) described in Table 1, Values of the speedup \(s_{nproc}^{loc}\) in terms of gain obtained by using the GPU versus the CPU are reported in Table 2. We note that CUBLAS routines allow to reduce in average by 18 times the execution time necessary to a single CPU for the minimization part (Table 3).
The outcome we get from these experiments is that the algorithm scales up according to the performance analysis (see Fig. 2). Indeed, as expected, as QP increases, the scale up factor increases and the surfacetovolume ratio increases, too, so that performance gain tends to become stationary. This the inherent tradeoff between speed up and efficiency of any software architecture.
8 Conclusions
We provided the whole computational framework of a spacetime decomposition approach, including the mathematical framework, the numerical algorithm and, finally, its performance validation. We measure performance of the algorithm using a simulation case study based on the SWEs on the sphere. Results presented here are just a starting point towards the assessment of the software scalability. More precisely, we introduce simplifications and assumptions appropriate for a proofofconcept study in order to get values of the measured scale up of the one iteration of the parallel algorithm. Anyway, the overall insight we get from these experiments is that the algorithm scales up according to the performance analysis. We are currently working on the development of a flexible framework ensuring efficiency and code readability, exploiting future technologies and equipped with a quantitative assessment of scalability [57]. In this regard, we could combine the proposed approach with the PFASST algorithm. Indeed, PFASST could be concurrently employed as local solver of each reducedspace PDEconstrained optimization subproblem, exposingeven more temporal parallelism.
This framework will allow designing, planning and running simulations to identify and overcome the limits of this approach.
Data Availability
Enquiries about data availability should be directed to the authors.
Change history
21 July 2022
The Missing Open Access funding information has been added in the Funding Note.
Notes
Although typical prognostic variables are temperature, salinity, horizontal velocity, and sea surface displacement, here, for simplicity of notations, we assume that \(u^b(t,x) \in \Re \).
In the following and throughout the paper, for simplicity, we use the notation \(j=1,K\) to indicate \(j=1, \ldots , K\).
For nonlinear Navier Stokes equations which we are considering here, first order linearization of \(\mathcal {M}^{\varDelta \times \varOmega }\) is formed by truncating at the first order the Taylor series expansion of \(\mathbf {M}^{\varDelta _M \times \varOmega _{K}}\) about \(\mathbf {u}^b\) over the interval \(\varDelta _M\). In some cases this approach performs equally to the approach based on first linearization of the continuous model \(\mathcal {M}^{\varDelta \times \varOmega }\) and then discretization [24].
Let \(\mathbf {A}:\mathbf {x}\rightarrow \mathbf {y}=\mathbf {A}\mathbf {x} \) be a linear operator on \(\Re ^{N}\) equipped with the standard euclidean norm. The operator \(\mathbf {A}^T:\mathbf {y}\rightarrow \mathbf {x}=\mathbf {A}^T\mathbf {y} \) such that
$$\begin{aligned}<\mathbf {y}, \mathbf {A}\mathbf {x}>=<\mathbf {A}^T\mathbf {y},\mathbf {x}>, \quad \forall \mathbf {x},\forall \mathbf {y} \end{aligned}$$(3)where \(<\cdot ,\cdot>\) denotes the scalar product in \(\Re ^N\), is the adjoint of \(\mathbf {A}\).
If \(\mathbf {M}^{i1,i}\) is the TLM of \(\mathcal {M}^{\varDelta \times \varOmega }\), in \([t_{i1},t_i]\times \varOmega _{K}\) , then it holds that:
$$\begin{aligned} (\mathbf {M}^{0,M1})^T=(\mathbf {M}^{0,1}\cdot \mathbf {M}^{1,2}\cdots \mathbf {M}^{M2,M1})^T=(\mathbf {M}^{M2,M1})^T\cdots (\mathbf {M}^{1,2})^T (\mathbf {M}^{0,1})^T. \end{aligned}$$(4)If \(\mathbf {{C}}_{ji}=diag((\mathbf {B}^{1})_{ji},(\mathbf {R}^{1})_{ji})\), and \(\varvec{\tilde{d}}_{ji}^{l}=(\mathbf {{u}}_{ji}^{l}\mathbf {{u}}_{0}^{b},\mathbf {H}_{ji}^{0}(\mathbf {{u}}_{ji}^{l})\mathbf {{v}}_{ji}^{k}, \ldots , (\mathbf {{H}}^{M1})_{ji}[(\mathbf {{M}}_{M2,M1}^{k})_{ji}(\mathbf {{u}}_{ji}^{k})]\mathbf {{v}}_{ji}^{l})\) then \(\mathbf {J}_{ij}:=\frac{1}{2}((\mathbf {{C}}^{1/2})_{ji}\varvec{\tilde{d}}_{ji}^{l})^T((\mathbf {{C}}^{1/2})_{ji}\varvec{\tilde{d}}_{ji}^{l}) =\) \(= \Vert \mathbf {F}_{ji}\Vert _2^2\) where \(\mathbf {F}_{ji}=(\mathbf {{C}}^{1/2})_{ji}\varvec{\tilde{d}}_{ji}^{l}\).
These assumptions hold true for the socalled local discretization schemes, i.e. those schemes where each grid point receives contribution from a neighborhood (for instance, using finite difference and finite volume discretization schemes as in [51]).
References
Antil, H., Heinkenschloss, M., Hoppe, R.H., Sorensen, D.C.: Domain decomposition and model reduction for the numerical solution of PDE constrained optimization problems with localized optimization variables. Comput. Vis. Sci. 13(6), 249–264 (2010)
Amaral, S., Allaire, D., Willcox, K.: A decompositionbased approach to uncertainty analysis of feedforward multicomponent systems. Int. J. Numer. Methods Eng. 100(13), 982–1005 (2014)
Arcucci, R., D’Amore, L., Pistoia, J., Toumi, R., Murli, A.: On the variational data assimilation problem solving and sensitivity analysis. J. Comput. Phys. 335, 311–326 (2017)
Arcucci, R., D’Amore, L., Carracciuolo, L., Scotti, G., Laccetti, G.: A decomposition of the tikhonov regularization functional oriented to exploit hybrid multilevel parallelism. J. Parallel Program. 45(5), 1214–1235 (2017)
Clerc, S.: Etude de schemas decentres implicites pour le calcul numerique en mecanique des fluides, resolution par decomposition de domaine. Ph.D. thesis, Univesity Paris VI (1997)
Constantinescu, E., D’Amore L.: A mathematical framework for domain decomposition approaches in 4D VAR DA problems. H2020MSCARISE2015NASDAC project, Report 122016, https://doi.org/10.13140/RG.2.2.34627.20002
D’Amore, L., Arcucci, R., Carracciuolo, L., Murli, A.: A scalable approach to three dimensional variational data assimilation. J. Sci. Comput. (2014). https://doi.org/10.1007/s1091501498242
Daget, N., Weaver, A.T., Balmaseda, M.A.: Ensemble estimation of backgrounderror variances in a threedimensional variational data assimilation system for the global ocean. Q. J. R. Meteorol. Soc. 135, 1071–1094 (2009)
D’Amore, L., Arcucci, R., Carracciuolo, L., Murli, A.: A scalable variational data assimilation. J. Sci. Comput. 61, 239–257 (2014)
D’Amore, L., Laccetti, G., Romano, D., Scotti, G.: Towards a parallel component in a GPU–CUDA environment: a case study with the LBFGS Harwell routine. J. Comput. Math. 93(1), 59–76 (2015)
D’Amore, L., Carracciuolo, L., Constantinescu, E.: Validation of a PETSc based software implementing a 4DVAR data assimilation algorithm: a case study related with an oceanic model based on shallow water equation. Oct. 2018 arXiv:1810.01361v2
Dennis, J.E., Jr., Moré, J.J.: QuasiNewton methods, motivation and theory. SIAM Rev. 19(1), 46–89 (1977)
Dennis, J.E., Jr., Schnabel, R.B.: Numerical Methods for Unconstrained Optimization and Nonlinear Equation. SIAM, Philadelphia (1996)
Emmett, M., Minion, M.L.: Toward an efficient parallel in time method for partial differential equations. Commun. Appl. Math. Comput. Sci. 7, 105–132 (2012)
ECMWF Ocean ReAnalysis ORAS3. Avalaible to: http://icdc.cen.unihamburg.de/projekte/easyinit/easyinitocean.html
Fischer, M., Gurol, S.: Parallelization in the time dimension of the four dimensional variational data assimilation. https://doi.org/10.1002/qj:2996
Flatt, H.P., Kennedy, K.: Performance of parallel processors. Parallel Comput. 12, 1–20 (1989)
Gander, M.J.: 50 years of time parallel time integration. In: Carraro, T., Geiger, M., Körkel, S., Rannacher, R. (eds.) Multiple Shooting and Time Domain Decomposition Methods: MuSTDD, pp. 69–113. Springer International Publishing, Heidelberg (2015)
Gander, M.J., Kwok, F.: Schwarz methods for the timeparallel solution of parabolic control problems. Lect. Notes Comput. Sci. Eng. 104, 207–216 (2016)
Giering, R., Kaminski, T.: Recipes for adjoint code construction. ACM Trans. Math. Softw. 24(4), 437–474 (1998)
Gratton, S., Lawless, A.S., Nichols, N.K.: Approximate Gauss–Newton methods for nonlinear least square problems. SIAM J. Optim. 18(1), 106–132 (2007)
Gunther, S., Gauger, N.R., Schroder, J.B.: A nonintrusive parallelintime approach for simultaneous optimization with unsteady PDEs. arXiv:1801.06356v2 [math.OC] 28 Feb (2018)
Gurol, S., Weaver, A.T., Moore, A.M., Piacentini, A., Arango, H.G., Gratton, S.: Bpreconditioned minimization algorithms for variational data assimilation with the dual formulation. Q. J. R. Metereol. Soc. 140, 539–556 (2014)
Lawless, A.S., Gratton, S., Nichols, N.K.: On the convergence of incremental 4DVar using non tangent linear models. Q. J. R. Meteorol. Soc. 131, 459–476 (2005)
Le Dimet, F.X., Talagrand, O.: Variational algorithms for analysis and assimilation of meteorological observations: theoretical aspects. Tellus 38A, 97–110 (1986)
Levenberg, K.: A method for the solution of certain nonlinear problems in least squares. Q. Appl. Math. 2(2), 164–168 (1944)
Liao, Q., Willcox, K.: A domain decomposition approach for uncertainty analysis. SIAM J. Sci. Comput. 37(1), A103–A133 (2015)
Liu, D.C., Nocedal, J.: On the limited memory BFGS method for large scale optimization. Math. Program. 45, 503–528 (1989)
Liu, J., Wang, Z.: Efficient time domain decomposition algorithms for parabolic PDEconstrained optimization problems. Comput. Math. Appl. 75(6), 2115–2133 (2018)
Marquardt, D.W.: An algorithm for the leastsquares estimation of nonlinear parameters. SIAM J. Appl. Math. 11(2), 431–441 (1963)
Miyoshi, T.: Computational challenges in big data assimilation with extremescale simulations, talk at BDEC workshop. Charleston, SC (2013)
Moore, A.M., Arango, H.G., Broquet, G., Powell, B.S., Weaver, A.T., ZavalaGaray, J.: The regional ocean modeling system (ROMS) 4dimensional variational data assimilation systems: Isystem overview and formulation. Prog. Oceanogr. 91, 34–49 (2011)
Moore, A.M., Arango, H.G., Broquet, G., Edwards, C.A., Veneziani, M., Powell, B.S., Foley, D., Doyle, J.D., Costa, D., Robinson, P.: The regional ocean modeling system (ROMS) 4dimensional variational data assimilation systems: II performance and application to the California current system. Prog. Oceanogr. 91, 50–73 (2011)
Moore, A.M., Arango, H.G., Broquet, G., Edwards, C.A., Veneziani, M., Powell, B.S., Foley, D., Doyle, J.D., Costa, D., Robinson, P.: The regional ocean modeling system (ROMS) 4dimensional variational data assimilation systems: III observation impact and observation sensitivity in the California current system. Prog. Oceanogr. 91, 74–94 (2011)
Moore, A.M., Arango, H.G., Di Lorenzo, E., Cornuelle, B.D., Miller, A.J., Neilson, D.J.: A comprehensive ocean prediction and analysis system based on the tangent linear and adjoint of a regional ocean model. Ocean Model. 7, 227–258 (2004)
Murli, A., D’Amore, L., Laccetti, G., Gregoretti, F., Oliva, G.: A multigrained distributed implementation of the parallel block conjugate gradient algorithm. Concurr. Comput. Pract. Exp. 22(15), 2053–2072 (2010)
Navon, I.M., De Villiers, R.: The application of the TurkelZwas explicit large timestep scheme to a hemispheric barotropic model with constraint restoration. Mon. Weather Rev. 115(5), 1036–1052 (1987)
Navon, I.M., Yu, J.: Exshall: a TurkelZwas explicit large timestep FORTRAN program for solving the shallowwater equations in spherical coordinates. Comput. Geosci. 17(9), 1311–1343 (1991)
Nerger, L., Hiller, W.: Software for ensemblebased data assimilation systems: implementation strategies and scalability. Comput. Geosci. 55, 110–118 (2013)
Neta, B., Giraldo, F.X., Navon, I.M.: Analysis of the TurkelZwas scheme for the twodimensional shallow water equations in spherical coordinates. J. Comput. Phys. 133(1), 102–112 (1997). https://doi.org/10.1006/jcph.1997.5657
PDAF https://pdaf.awi.de
NEMO Web page www.nemoocean.eu
Nichols, N.K.: Mathematical concepts of data assimilation. In: Lahoz, W., Khattatov, B., Menard, R. (eds.) Data Assimilation: Making Sense of Observations, pp. 13–40. Springer, Cham (2010)
Nocedal, J., Wright, S.J.: Numerical Optimization. SpringerVerlag, Cham (1999)
Nocedal, J., Byrd, R.H., Lu, P., Zhu, C.: LBFGSB: Fortran subroutines for largescale boundconstrained optimization. ACM Trans. Math. Softw. 23(4), 550–560 (1997)
Nvidia.: TESLA K20 GPU active accelerator. Board spec. (2012) Available http://www.nvidia.in/content/PDF/kepler/TeslaK20ActiveBD06499001v02.pdf
PCIsig, tecnology specifications at https://pcisig.com/specifications/pciexpress/
Rao, V., Sandu, A.: A timeparallel approach to strong constraint four dimensional variational data assimilation. J. Comput. Phys. 313, 583–593 (2016)
ROMS Web page www.myroms.org
Shchepetkin, A.F., McWilliams, J.C.: The regional oceanic modeling system (ROMS): a splitexplicit, freesurface, topographyfollowingcoordinate oceanic model. Ocean Model. 9, 347–404 (2005)
StCyr, A., Jablonowski, C., Dennis, J.M., Tufo, H.M., Thomas, S.J.: A comparison of two shallow water models with nonconforming adaptive grids. Mon. Weather Rev. 136, 1898–1922 (2008)
Ulriq, S.: Generalized SQP methods with “Parareal” timedomain decomposition for timedependent PDEconstrained optimization. In: Biegler, L.T., Ghattas, O., Heinkenschloss, M., Keyes, D., van Bloemen Waanders, B. (eds.) RealTime PDEConstrained Optimization. SIAM, Philadelphia (2017)
Arcucci R., Carracciuolo L., D’Amore L.: On the problemdecomposition of scalable 4DVar Data Assimilation models, Proceedings of the 2015 International Conference on High Performance Computing and Simulation, HPCS 2015 pp. 589–594, 2 September 2015 Article number 7237097 13th International Conference on High Performance Computing and Simulation, HPCS 2015Amsterdam20 July 2015 through 24 July 2015
D’Amore L., Marcellino L., Mele V., Romano D.: Deconvolution of 3D fluorescence microscopy images using graphics processing units, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) Volume 7203 LNCS, Issue PART 1, pp. 690–699, 2012 9th International Conference on Parallel Processing and Applied Mathematics, PPAM 201111 September 2011 through 14 September 2011
D’Amore L., Casaburi D., Galletti A., Marcellino., Murli A.: Integration of emerging computer technologies for an efficient image sequences analysis. Integ. Comput. Aided Eng. 18(4), 365–378, https://doi.org/10.3233/ICA20110382 (2011)
Murli, A., Boccia, V., Carracciuolo, L., D’Amore, L., Laccetti, G., Lapegna, M.: Monitoring and migration of a PETScbased parallel application for medical imaging in a grid computing PSE. IFIP International Federation for Information Processing, vol. 239, pp. 421–432 (2007)
Acknowledgements
This work was developed within the research activity of the H2020MSCARISE2016 554 NASDAC Project N. 691184. This work has been realized thanks to the use of the S.Co.P.E. computing infrastructure at the University of Naples.
Funding
Open access funding provided by Università degli Studi di Napoli Federico II within the CRUICARE Agreement.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors have not disclosed any competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
D’Amore, L., Constantinescu, E. & Carracciuolo, L. A Scalable SpaceTime Domain Decomposition Approach for Solving Large Scale Nonlinear Regularized Inverse Ill Posed Problems in 4D Variational Data Assimilation. J Sci Comput 91, 59 (2022). https://doi.org/10.1007/s10915022018267
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10915022018267
Keywords
 Data assimilation
 Space and time decomposition
 Scalable algorithm
 Inverse problems
 Nonlinear least squares problems