A scalable space-time domain decomposition approach for solving large-scale nonlinear regularized inverse ill-posed problems in 4D variational data assimilation

We develop innovative algorithms for solving the strong-constraint formulation of four-dimensional variational data assimilation in large-scale applications. We present a space-time decomposition approach that employs domain decomposition along both the spatial and temporal directions in the overlapping case and involves partitioning of both the solution and the operators. Starting from the global functional defined on the entire domain, we obtain a type of regularized local functionals on the set of subdomains providing the order reduction of both the predictive and the data assimilation models. We analyze the algorithm convergence and its performance in terms of reduction of time complexity and algorithmic scalability. The numerical experiments are carried out on the shallow water equation on the sphere according to the setup available at the Ocean Synthesis/Reanalysis Directory provided by Hamburg University.

or 10 8 model variables and the capacity to assimilate on the order of 10 6 observations. Various approaches have been proposed for reducing the complexity of assimilation methods to make them more computationally affordable while retaining their original accuracy. Ensemble approaches and reduced-order models are the most significant approximations. Other approaches take full advantage of existing partial differential equations (PDEs)-based solvers, based on spatial domain decomposition (DD) methods, where the DD solver is suitably modified to also solve the adjoint associated with the forward model. A different approach is the combination of DD methods in space and data assimilation (DA), where a spatial domain-decomposed uncertainty quantification approach performs DA at the local level by using Monte Carlo sampling [1,2,28]. The parallel data assimilation framework [42] implements parallel ensemble-based Kalman filters coupled with the PDE-model solver. These methods reduce the spatial dimensionality of the predictive model, and the resulting reduced-order model is then resolved in time via numerical integration, typically with the same time integrator and time step employed for the highfidelity model leading to high-precision time synchronization. In the past decades, parallel-in-time methods have been investigated for reducing the temporal dimensionality of evolutionary problems. Pioneering work includes that of Nievergelt (1964), who proposed the first time decomposition algorithm for finding the parallel solutions of evolutionary ordinary differential equations, and that of Hackbusch (1984), who noted that relaxation operators in multigrid can be employed on multiple time steps simultaneously. Since then, time parallel time integration methods have been extensively expanded. A large literature list can be found at [48], which collects information about the community, methods, and software in the field of parallel-in-time integration methods. Recent efforts include the parallel full approximation scheme in space and time (PFASST), introduced by [15]. PFASST reduces the optimization overhead by integrating the PDE-based model directly into the optimization process, thus solving the PDE, the adjoint equations, and the optimization problem simultaneously. A nonintrusive framework for integrating existing unsteady PDE solvers into a parallel-in-time simultaneous optimization algorithm, using PFASST, is provided in [23]. Related parallel PDE solvers based on a Schwarz preconditioner in space-time are proposed in [20,30,54].
In this study we present the design of an innovative mathematical model and the development and analysis of the related numerical algorithms, based on the simultaneous introduction of space-time decomposition in the overlapping case on the PDEs governing the physical model and on the DA model. The core of our approach is that the DA model acts as coarse predictor operator solving the local PDE model, by providing the background values as initial conditions of the local PDE models. Moreover, in contrast to the other decomposition-in-time approaches, in our approach local solvers (i.e., both the coarse and the fine solvers) run concurrently from the beginning. Consequently, the resulting algorithm requires only the exchange of boundary conditions between adjacent subdomains. The proposed method belongs to the so-called reduced-space optimization techniques, in contrast to full-space approaches such as the PFASST method, reducing the runtime of the forward and the backward integration time loops. Consequently, we could combine the proposed approach with the PFASST algorithm. Indeed, PFASST could be concurrently employed as the local solver of each reduced-space PDE-constrained optimization subproblem, exposing even more temporal parallelism.
Specific contributions of this work include (1) a novel decomposition approach in space-time leading to a reduced-order model of the coupled PDE-based 4D-Var DA problem; (2) strategies for computing the "kernels" of the resulting regularized nonlinear least squares computational problem; and (3) a priori performance analysis that enables a suitable implementation of the algorithm in advanced computing environments. Results presented here are intended as the starting point for the software development to make decisions about computer architecture, future estimates of the problem size (e.g., the resolution of the model and the number of observations to be assimilated), and the performance and parallel scalability of the algorithms. The article is organized as follows. Section 2 gives a brief introduction to the data assimilation framework, where we follow the discretize-then-optimize approach. The main result is the 4D-Var functional decomposition, which is given in Section 3. In Section 4 we review the whole parallel algorithm; its performance analysis is discussed in Section 5 on the shallow water equations on the sphere. The number of state variables in the model, the number of observations in an assimilation cycle, and the numerical parameters as the discretization step in the time and space domains are defined on the basis of a discretization grid using data from the Ocean Synthesis/Reanalysis Directory of Hamburg University ( [16]). A scalability prediction of the case study based on the shallow water equations is presented in Section 6. Our conclusions are provided in Section 7.

Data assimilation framework
We begin with a general DA problem setup and then consider a more convenient setup for describing the domain decomposition approach.
Let M ∆×Ω denote a forecast model described by nonlinear Navier-Stokes equations, 1 where ∆ ⊂ is the time interval and Ω ⊂ N is the spatial domain. If t ∈ ∆ denotes the time variable and x ∈ Ω the spatial variable, let 2 is the function representing the solution of M ∆×Ω , which we assume belongs to the Hilbert space K(∆×Ω) equipped with the standard Euclidean norm. Following [9], we assume that M ∆×Ω is symbolically described as the following initial value problem: The function u b (t, x) is referred to as the background state in ∆ × Ω. The function is the initial condition of M ∆×Ω , and this is the value of the background where ∆ ⊂ ∆ is the observation time interval and Ω ⊂ nobs , with Ω ⊂ Ω, is the observation spatial domain.
denotes the observation mapping, where H is a nonlinear operator that includes transformations and grid interpolations. According to the practical applications of model-based assimilation of observations, we use the following definition of a data assimilation problem associated with M ∆×Ω .
Definition 1 (DA problem setup) We consider the following setup. 3 Throughout the paper, for simplicity, we use the notation j = 1, K to indicate j = 1, . . . , K. 4 Let A : x → y = Ax be a linear operator on N equipped with the standard Euclidean norm. The operator A T : y → x = A T y, such that where < ·, · > denotes the scalar product in N , is the adjoint of A.
The aim of DA is to produce the optimal combination of the background and observations throughout the assimilation window ∆ M , in other words, to find an optimal tradeoff between the estimate of the system state u b and v. The best estimate that optimally fuses all this information is called the analysis, and it is denoted as u DA . It is then used as an initial condition for the next forecast.
Definition 2 (The 4D-Var DA problem: a regularized nonlinear least squares problem (RNL-LS)) Given the DA problem setup, the 4D-Var DA problem consists of computing the vector u DA ∈ K such that with where λ > 0 is the regularization parameter; B and R k (∀k = 0, . . . , M − 1) are the covariance matrices of the errors on the background and the observations, respectively; and · B −1 and · R −1 k denote the weighted Euclidean norm, respectively.

♠
The first term in (6) quantifies the departure of the solution u DA from the background state u b . The second term measures the mismatch between the new trajectory and observations v k for each time t k in the assimilation window. The weighting matrices B and R k need to be predefined, and their quality influences the accuracy of the resulting analysis [3].
This nonlinear least-squares problem is typically considered large scale with K larger than 10 6 . We next provide a mathematical formulation of a domain decomposition approach that starts from the decomposition of the whole domain ∆ × Ω (i.e., in both space and time); it uses a partitioning of the solution and a modified functional describing the RNL-LS problem on the subdomain of the decomposition. Solution continuity equations across interval boundaries are added as constraints of the assimilation functional. We first introduce the domain decomposition of ∆ × Ω and then define the restriction and extension operators on functions given on ∆ × Ω. These definitions are then generalized to ∆ M × Ω K .

The space-time decomposition
In this section we give a precise mathematical setting for space and operator decomposition. In particular, we introduce the functional and domain decomposition. Then, by using restriction and extension operators, we associate with the domain decomposition a functional decomposition. To this end, we prove the following result: the minimum of the global functional, defined on the entire domain, can be obtained by collecting the minimum of each local functional.

The space-time decomposition of the continuous 4D-Var DA model
For simplicity we assume that the spatial and temporal domains of the observations are the same as the background state, namely, ∆ = ∆ and Ω = Ω; furthermore, we assume that t k = τ k .
Definition 3 (Domain decomposition) Let P ∈ N and Q ∈ N be fixed. The set of bounded Lipschitz domains Ω i , overlapping subdomains of Ω, is called a decomposition of Ω if with when two subdomains are adjacent. Similarly, the set of overlapping subdomains of ∆, with when the two subdomains are adjacent. We denote the domain decomposition of ∆ × Ω by DD(∆ × Ω) with the set of P × Q overlapping subdomains of ∆ × Ω: ♠ From (11) it follows that Next we define the restriction operator on functions in K(∆ × Ω) associated with the decomposition (11).

Definition 4 (Restriction of a function) Let
be the restriction operator (RO) of f in DD(∆ × Ω) as in (11) such that In line with this, given a set of Q × P functions g ji , j = 1, . . . , Q, i = 1, . . . , P each in K(∆ j × Ω i ), we define the extension operator of g ji .

Definition 5 (Extension of a function) Let
be the extension operator (EO) of g ji in DD(∆ × Ω) as in (11) such that ♠ For any function u ∈ K(∆ × Ω), associated with the decomposition (8), it holds that defines a function u ∈ K(∆ × Ω) such that The main outcome of this framework is the definition of the operator RO ji for the 4DVar functional defined in (6). This definition originates from the definition of the restriction operator of M ∆×Ω in (1), given as follows.
Definition 6 (Restriction of M ∆×Ω ) If M ∆×Ω is defined in (1), we introduce the model M ∆j ×Ωi to be the restriction of M ∆×Ω :

Space-time decomposition of the discrete model
. With s < K, we define the restriction operator RO st onto C(w) as follows: in other words, the covariance matrix defined on w ROst . ♠ Hereafter, we refer to C(w ROs ) using the notation C st .

Definition 8 (Restriction of the operator H
With these definitions, we are now able to construct the restriction of the entire cost functional. denote the restriction operator of the 4D-Var DA functional defined in (6). It is defined as In other words, the approach we are following is first to decompose the 4D-Var functional J and then to locally linearize and solve each local functional J ji .
For simplicity of notations we let We note that in (16) RO ji [J](u ji ) the first term quantifies the departure of the state u ji from the background state u b ji at time t j and space x i . The second term measures the mismatch between the state u ji and the observation v ji .
♠ From (19), it follows that the decomposition of J satisfies The implication in (19) is that the 4D-Var problem can be defined as a set of local 4D-Var problems as detailed in the following section.

Local 4D-Var DA problem: the local RNL-LS problem
Starting from the local 4D-Var functional in (17), which is obtained by applying the restriction operator to the 4D-Var functional defined in (6), we add a local constraint to the restriction. This is a type of regularization of the local 4D-Var functional introduced in order to enforce the continuity of each solution of the local problem onto the overlap region between adjacent subdomains. The local constraint consists of the overlapping operator O (jh)(ik) defined as where the symbol • denotes the operators composition. Each operator in (20) tackles the overlapping of the solution in the spatial dimension and in the temporal dimension, respectively. More precisely, for j = 1, . . . , Q; i = 1, . . . , P , the operator O (jh)(ik) represents the overlap of the temporal subdomains j and h and spatial subdomains i and k, where h and k are given as in Definition 4 and and Remark 1 We observe that in the overlapping domain ∆ jh ×Ω ik we get two vectors, u (jh)(ik) , which is obtained as the restriction of u (ji) = arg min J ji (u ji ) to that region, and u (hj)(ki) , which is the restriction of u (hk) = arg min J hk (u hk ) to the same region. The order of the indexes plays a significant role from the computing perspectives.
There are three basic cases that we may consider in (20): 1. Decomposition in space, namely, Q = 1 and P > 1. Here we get j = Q = 1 (i.e., the time interval is not decomposed) and P > 1 (i.e., the spatial domain Ω is decomposed according to the domain decomposition in (11)). The overlapping operator is defined as in (21). In particular, we assume that 2. Decomposition in time, namely, Q > 1 and P = 1. We get i = P = 1 (i.e., the spatial domain is not decomposed) and Q > 1 (i.e., the time interval is decomposed according to the domain decomposition in (11)). The overlapping operator is defined as in (22). In particular, we assume that 3. Decomposition in space-time, namely, Q > 1 and P > 1. We assume that Q > 1 and P > 1 (i.e., both the time interval and the spatial domain are decomposed according to the domain decomposition in (11)). The overlapping operator is defined as in (20). In particular, we assume that We now give the new definition of the local 4D-Var DA functional.
Definition 13 (Local 4D-Var DA) Given DD(∆ × Ω) as in (11), let , suitably defined on ∆ jh × Ω ik , be the local 4D-Var functional. The parameter µ ji is a regularization parameter. Also let where the three terms contributing to the definition of the local DA functional clearly come out. We note that in (17) Next we show that the absolute minimum of operator J is found among the absolute minima of local functionals.

Local 4D-Var DA minimization
where u DA ji is defined in (24), be (the extension of) the minimum of the (global) minima of the local functionals J ji as in (24). Let be its minimum. (11), then with u DA defined in (5).
Proof: Let u DA ji be defined in (24); it is From (29) it follows that which gives from (19) Then, from (27) it follows that Now we prove that if J is convex, then In particular,

This means that
From (35) and (27), it is Then, from (14): Equation (36) is a contradiction because the value of u DA ji is the global minimum for J ji , and therefore the (28) is proved. We introduce the algorithm solving the RNL-LS problem by using the space-time decomposition, in other words, solving the QP = q × p local problems in ∆ j × Ω i , where j = 1, Q and i = 1, P (see Figure 1 for an example of domain decomposition where Q = 4 and P = 2.).
and is defined as the merging of the QP = Q×P local algorithms A loc RN LLS (∆ j ×Ω i ): The DD-RNL-LS algorithm can be sketched as described by Algorithm 1. Similarly, the Local RNL-LS algorithm A loc RN LLS is described by Algorithm 2. Compute % Local Model Linearization Step 7: for j = 1, q; i = 1, p do 8: l := 0, u 0 ji = u b ji 9: repeat 10: l := l + 1 11: Call Loc RNLLS (in : Exchange u k ji between adjacent subdomains 13: until u l ji − u l−1 ji < eps

14: % End the Domain Decomposition
Step 15: Gather of u l ji : u DA = arg min ji J u l ji The common approach for solving RNL-LS problems involves defining a sequence of local approximations of J ij where each member of the sequence is minimized by employing Newton's method or one its variants (such as Gauss-Newton, L-BFGS, or Levenberg-Marquardt). Approximations of J ij are obtained by expanding J ij in a truncated Taylor series, while the minimum is obtained by using second-order sufficient conditions [14,45]. Let us consider Algorithm 2 solving the RNL-LS problem on ∆ j × Ω i . Initialize u 0 ij := u b ij ; 3: Initialize l := 0; 4: repeat % at each step l, a local approximation ofJ ij is minimized 5: Compute δu l ij = arg minJ ji 6: Update u l ji = u l ji + δu l ji 7: Update l = l + 1 8: until (convergence is reached) The main computational task occurs at step 5 of Algorithm 2 concerning the minimization ofJ ji , which is the local approximation of J ij . Two approaches could be employed in Algorithm 2: (a) By truncating the Taylor series expansion of J ij at the second order, we get giving a quadratic approximation of J ji at u l ji . Newton-based methods (including LBFGS and Levenberg-Marquardt) useJ ji = J QD ij . (b) By truncating the Taylor series expansion of J ij at the first order, we get the following linear approximation of J ij at u k ji : where we let 6 , which gives a linear approximation of J ji at u l ji . Gauss-Newton's methods (including truncated or approximated Gauss-Newton [22]) use J T L ji = J ji .
Observe that from (38) it follows that Algorithm 2 can be updated to Algorithm 3 as described below. following the Newton descent direction. The minimum is computed by solving the linear system involving the Hessian matrix ∇ 2 J ij and the negative gradient −∇J ij at u l ji , for each value of l (see Algorithm 4 described below).
(b) A loc LLS : computes a local minimum of J T L ji following the steepest descent direction. The minimum is computed by solving the normal equations arising from the local linear least squares (LLS) problem (see Algorithm 5 described below). Initialize u 0 ji := u b ji ; 3: Initialize l := 0; 4: repeat 5: %Compute δu l ij = arg min J QD ji , by Newton's method 6: 1.1 Compute ∇J ji (u l ij ) = ∇F T ji (u l ji )∇F ji (u l ji ) 7: 1.2 Compute ∇ 2 J ji (u l ij ) = ∇F T ji (u l ji )∇F ji (u l ji ) + Q((u l ij )) 8: 1.3 Solve ∇ 2 J ji (u l ij )δu l ij = −∇J ji (u l ij ) 9: Update u l ji = u l ji + δu l ji 10: Update l = l + 1 11: until (convergence is reached) Initialize u 0 ij := u b ij ; 3: Initialize l := 0; 4: repeat 5: Compute ∇J ji = ∇F T ji (u l ji )∇F ji (u l ji ) 6: %Compute δu l ij = arg min J T L ji by solving the normal equations system: 7: Solve ∇F T ji (u l ji )∇F ji (u ji )δu l ji = −∇F T ji (u l ji )F ji (u l ji ) 8: Update u l ji = u l ji + δu l ji 9: Update l = l + 1 10: until (convergence is reached) Remark 3 : We observe that if, in the A loc QN algorithm, matrix Q(u l ij ) (see line 6 of Algorithm 4) is neglected, we get the Gauss-Newton method described by A loc LLS algorithm. More generally, the term Q(u l ij ) 1. in the case of Gauss-Newton, Q(u l ij ), is neglected; 2. in the case of Levenberg-Marquardt, Q(u l ij ) equals λI, where the damping term, λ > 0, is updated at each iteration and I is the identity matrix [27,31]; and 3. in the case of the L-BFGS, the Hessian matrix is rank-1 updated at every iteration [46].
In accordance with the most common implementation of the 4D-Var DA [16,51], we focus attention on the Gauss-Newton(G-N) method described in A loc LLS in Algorithm 6.
For each l, let G l ji = RO ji [G l ], where G l ∈ (M ×nobs)×(N P ×M ) , be the block diagonal matrix such that In line 7 of Algorithm 5, it is and where B ji and R ji are the restrictions of B and R matrices, respectively. Most popular 4D-Var DA software implements the so-called B-preconditioned Krylov subspace iterative method [22,24,51] arising by using the background error covariance matrix as a preconditioner of a Krylov subspace iterative method.
Let B ji = V ji V T ji be expressed in terms of the deviance matrix V ji and w i such that with V + i the generalized inverse of V i . Then (42) becomes and (43) becomes The normal equation system (see line 7 of A loc LLS ), in other words, the linear system Definition 15 (DD-4D-Var Algorithm) Let A loc 4DV ar (∆ j × Ω i ) denote the algorithm solving the local 4D-Var DA problem defined in ∆ j × Ω i . The space-time 4D-Var DA parallel algorithm solving the 4D-Var DA problem in DD(∆ M × Ω K ) is symbolically denoted as A DD 4DV ar (∆ M × Ω K ), and it is defined as the union of the QP = q × p local algorithms A loc 4DV ar (∆ j × Ω i ): Algorithm 6; A loc 4DV ar : solves Local 4DVAR DA problem in ∆ j × Ω i 1: procedure Loc-4DVar(M ∆ M ×Ω N P , R, B, H, v, u b , ∆ M , Ω K ; out : u l ji ) 2: Initialize u 0 ji := u b ji ; 3: Initialize l := 0; 4: repeat 5: Call TLM(in : M ∆×Ω , u l ji ; out : M l 0,M −1 ) 7: Call ADJ(in : Update u l ji = u l ji + δu l ji
In the next section we will show that this formulation leads to local numerical solutions that converge to the numerical solution of the global problem.

Convergence analysis
In the following we assume · = · ∞.
Proposition 1 Let u ASM,r j,i be the approximation of the increment δu ji to the solution u ji obtained at step r of ASM-based inner loop on Ω j × ∆ i . Let u l j,i be the approximation of u j,i obtained at step l of the outer loop, that is, the space-time decomposition approach on Ω j × ∆ i . Let us assume that the numerical scheme discretizing the model M j,j+1 i is convergent. Then with fixed i and j, it holds that be the numerical solution of M j,j+1 i at step l; taking into account that, according to the incremental update of the solution of the 4D-Var DA functional (for instance, see line 10 of Algorithm 7), the approximation u l j,i is computed as and then From the hypothesis above we have and (49) can be rewritten as follows: Convergence of ASM is proved in [5]. Similarly, applying ASM to the 4D-Var DA problem, we have that Convergence behavior of local solutions essentially depends on the rate of convergence of the truncation error given by the discrete forecasting model (see [7] for the convergence analysis).

Performance Analysis
We use time complexity and scalability as performance metrics. Our aim is to highlight the benefits arising from using the decomposition approach instead of solving the problem on the whole domain. As we discuss later, the performance gain that we get from using the space and time decomposition approach is twofold.
1. Instead of solving one larger problem, we can solve several smaller problems that are better conditioned than the former problem. This approach leads to a reduction in each local algorithm's time complexity. 2. Subproblems reproduce the whole problem at smaller dimensions, and they are solved in parallel. This approach leads to a reduction in software execution time.
We give the following definition.
Definition 16 A uniform bidirectional decomposition of the space and time do- be the size of the whole domain, then each subdomain ∆ j × Ω i is such that where D t = M q ≥ 1 and Ds = K p ≥ 1.

♠
In the following we let . We now provide an estimate of the time complexity of each local algorithm, denoted as T (A Loc 4DV ar (∆ j × Ω i )). This algorithm consists of two loops: an outer loop, over l-index, for computing local approximations of J ji , and an inner loop over the m index, for performing the Newton or Lanczos steps. The major computational task to be performed at each step of the outer loop is the computation of J ji . The major computational tasks to be performed at each step l of the inner loop, in the case of the G-N method (see Algorithm Since the most time-consuming operation involving the predictive model is the computation of the tangent linear model, we prove the following. Proof: It is Observe that mmax and lmax actually are the number of steps of the outer and inner loops of A DD (∆ M × Ω K ), respectively. Let A G (∆ M × Ω K ) denote the algorithm used to solve problem (5) on the undecomposed domain, and let m G and l G denote the number of iterations of the inner and outer loop of A G (∆ M × Ω K ) algorithm, respectively. Then we have the following.

Definition 17 Let
If we denote by µ(J) the condition number of the DA operator, since it holds that [3] ∀ i, j µ(J Loc 4DV AR ) < µ(J 4DV AR ), then This result says that the number of iterations of the A DD 4DV ar (∆ M × Ω K ) algorithm is always smaller than the number of iterations of the A G 4DV ar (∆ M × Ω K ) algorithm. This is one of the benefits of using the space and time decomposition. Algorithm scalability is measured in terms of strong scaling (which is the measure of the algorithm's capability to exploit performance of high-performance computing architectures in order to minimise the time to solution for a given problem with a fixed dimension) and of weak scaling (which is the measure of the algorithm's capability to use additional computational resources effectively to solve increasingly larger problems). Various metrics have been developed to assist in evaluating the scalability of a parallel algorithm; speedup, model throughput, scale-up, efficiency are the most used. Each one highlights specific needs and limits to be answered by the parallel algorithm. In our case, since we focus mainly on the benefits arising from the use of hybrid computing architectures, we consider the so-called scale-up factor first introduced in [8]. The first result straightforwardly derives from the definition of the scale-up factor: Proposition 3 (DD-4D-Var Scale-up factor) The (relative) scale-up factor of where QP := q × p is the number of subdomains. It is where ♣ From (55) it results that, considering one iteration of the whole parallel algorithm, the growth of the scale-up factor essentially is one order less than the time complexity of the reduced model. In other words, the time complexity of the reduced model impacts mostly the scalability of the parallel algorithm. In particular, since parameter d is equal to 2, it follows that the asymptotic scaling factor of the parallel algorithm, with respect to QP , is bounded above by two.
Besides the time complexity, scalability is also affected by the communication overhead of the parallel algorithm. The surface-to-volume ratio is a measure of the amount of data exchange (proportional to surface area of domain) per unit operation (proportional to volume of domain). We prove the following. and V(A loc 4DV ar ) denote its volume. Then It holds that Ds , and (56) follows.

Definition 18 (Measured Software Scale-up) Let
Sc meas be the measured software scale-up in going from 1 to QP . -If s loc nproc (A loc 4DV ar ) = QP , then We may conclude the following: 1. Strong scaling: if QP increases and M ×K is fixed, the scale-up factor increases but the surface-to-volume ratio also increases. 2. Weak scaling: if QP is fixed and M × K increases, the scale-up factor stagnates and the surface-to-volume ratio decreases.
Thus, one needs to find the appropriate value of the number of subdomains, QP , giving the right tradeoff between the scale-up and the overhead of the algorithm.

Scalability results
The results presented here are just a starting point toward the assessment of the software scalability. More precisely, we introduce simplifications and assumptions appropriate for a proof-of-concept study in order to get values of the measured scale-up of the one iteration of the parallel algorithm.
Since the main outcome of the decomposition is that the parallel algorithm is oriented to better exploit the high performance of new architectures where concurrency is implemented both at the coarsest and finest levels of granularity, such as a distributed-memory multiprocessor (MIMD) and a graphics processing unit (GPU), we consider a distributed-computing environment located in the University of Naples Federico II campus, connected by local-area network made of the following: -P E 1 (for the coarsest level of granularity): a MIMD architecture made of 8 nodes that consist of distributed-memory DELL M600 blades connected by a 10 Gigabit Ethernet technology. Each blade consists of 2 Intel Xeon@2.33GHz quadcore processors sharing the same local 16 GB of RAM memory for a total of 8 cores per blade and 64 total cores.
-P E 2 (for the finest level of granularity): a Kepler architecture of the GK110 GPU [47], which consists of a set of 13 programmable single-instruction, multipledata (SIMD) streaming multiprocessors (SMXs), connected to a quad-core Intel i7 CPU running at 3.07 GHz, 12 GB of RAM. For host(CPU)-to-device(GPU) memory transfers CUDA-enabled graphic cards are connected to a PC motherboard via a PCI-Express (PCIe) bus [49]. For this architecture the maximum number of active threads per multiprocessor is 2,048, which means that the maximum number of active warps per SMX is 64.
Our implementation uses the matrix and vector functions in the Basic Linear Algebra Subroutines (BLAS) for P E 1 and the CUDA Basic Linear Algebra Subroutines (CUBLAS) library for P E 2 . The routines used for computing the minimum of J on P E 1 and P E 2 are described in [29] and [11], respectively. The case study is based on the shallow water equations on the sphere. The SWEs have been used extensively as a simple model of the atmosphere or ocean circulation because they contain the essential wave propagation mechanisms found in general circulation models [53]. (62) Here f is the Coriolis parameter given by f = 2Ω sin θ, where Ω is the angular speed of the rotation of the Earth; h is the height of the homogeneous atmosphere (or of the free ocean surface); u and v are the zonal and meridional wind (or the ocean velocity) components, respectively; θ and λ are the latitudinal and longitudinal directions, respectively; and a is the radius of the Earth and g is the gravitational constant. We express the system of equations (60)-(62) using a compact form: where and We discretize (63) just in space using an unstaggered Turkel-Zwas scheme [38,39], and we obtain ∂Z disc ∂t where  The numerical model depends on a combination physical parameters, including the number of state variables in the model, the number of observations in an assimilation cycle, and the numerical parameters as the discretization step in time and in space are defined on the basis of a discretization grid used by data available in the Ocean Synthesis/Reanalysis Directory of Hamburg University ( [16]). Our data assimilation experiments are initialized by choosing snapshots from the run prior to the start of the assimilation experiment and treating it as realization valid at the nominal time. Then, the model state is advanced to the next time using the forecast model, and the observations are combined with the forecasts (i.e., the background) to produce the analysis. This process is iterated. As it proceeds, the process fills gaps in sparsely observed regions, converts observations to improved estimates of model variables, and filters observation noise. All this is done in a manner that is physically consistent with the dynamics of the ocean as represented by the model. In our experiments, the simulated observations are created by sampling the model states and adding random errors to those values. A detailed description of the simulation, together with the results and the software implemented, is presented in [12]. In the following, we focus mainly on performance results.
The reference domain decomposition strategy uses the following correspondence between QP and nproc, QP ↔ nproc,  Table 2 Values of the speedup s loc nproc in terms of gain obtained by using the GPU versus the CPU. The CUBLAS routines allow reducing on average by 18 times the execution time necessary for a single CPU for the minimization part. 3.3 · 10 0 1.54 · 10 1 5.41 · 10 1 1.23 · 10 2 2.30 · 10 2 3.2 × 10 2 Table 3 Weak scalability of one iteration of the parallel algorithm A DD 4DV ar with n loc = 32 computed by using the measured software scale-up Sc meas 1,QP defined in (57). The outcome from these experiments is that the algorithm scales up according to the performance analysis (see Figure 2). Indeed, as expected, as QP increases, the scale-up factor increases and the surface-to-volume ratio increases, too, so that performance gain tends to become stationary. This the inherent tradeoff between speedup and efficiency of any software architecture.

Conclusions
We provide a complete computational framework of a space-time decomposition approach for 4D-Var. This includes the mathematical framework, the numerical algorithm, and its performance validation. We measure the performance of the algorithm using a simulation case study based on the SWEs on the sphere. Results presented here are just a starting point toward the assessment of the software scalability. More precisely, we introduce simplifications and assumptions appropriate for a proof-of-concept study in order to measure scale-up of one iteration of the parallel algorithm. The overall insight we get from these experiments is that the algorithm scales up according to the performance analysis.
We are currently working on the development of a flexible framework ensuring efficiency and code readability, exploiting future technologies, and including a quantitative assessment of scalability. In this regard, we could combine the proposed approach with the PFASST algorithm. Indeed, PFASST could be concurrently employed as a local solver of each reduced-space PDE-constrained optimization subproblem, exposing even more temporal parallelism. This framework will allow designing, planning, and running simulations to identify and overcome the limits of this approach.