Fast Parallel Solver for the Space-time IgA-DG Discretization of the Diffusion Equation

We consider the space-time discretization of the diffusion equation, using an isogeometric analysis (IgA) approximation in space and a discontinuous Galerkin (DG) approximation in time. Drawing inspiration from a former spectral analysis, we propose for the resulting space-time linear system a multigrid preconditioned GMRES method, which combines a preconditioned GMRES with a standard multigrid acting only in space. The performance of the proposed solver is illustrated through numerical experiments, which show its competitiveness in terms of iteration count, run-time and parallel scaling.


Introduction
In recent years, with ever increasing computational capacities, space-time methods have received fast growing attention from the scientific community.Space-time approximations of dynamic problems, in contrast to standard time-stepping techniques, enable full space-time parallelism on modern massively parallel architectures [27].Moreover, they can naturally deal with moving domains [38,[57][58][59]63] and allow for space-time adaptivity [1,24,28,39,47,49,61].The main idea of space-time formulations is to consider the temporal dimension as an additional spatial one and assemble a large space-time system to be solved in parallel as in [25].Space-time methods have been used in combination with various numerical techniques, including finite differences [2,11,35], finite elements [4,26,37,40], isogeometric analysis [34,41], and discontinuous Galerkin methods [1,16,32,37,38,48,57,63].Moreover, they have been considered for a variety of applications, such as mechanics [15], fluid dynamics [11,38,54], fluid-structure interaction [60], and many others.When dealing with space-time finite elements, the time direction needs special care.To ensure that the information flows in the positive time direction, a particular choice of the basis in time is often used.The discontinuous Galerkin formulation with an "upwind" flow is a common choice in this context; see, for example, [38,51,57,62].
Specialized parallel solvers have been recently developed for the large linear systems arising from space-time discretizations.We mention in particular the space-time parallel multigrid proposed by Gander and Neumüller [29], the parallel preconditioners for spacetime isogeometric analysis proposed by Hofer et al. [34], the fast diagonalization techniques proposed by Langer and Zank [42] and Loli et al. [44], and the parallel proposal by McDonald and Wathen [46].We also refer the reader to [56] for a recent review on space-time methods for parabolic evolution equations, and to [55] for algebraic multigrid methods.
The overall discretization process leads to solving a large space-time linear system.We propose a fast solver for this system in the case of maximal smoothness k = p − 1, i.e., the case corresponding to the classical IgA paradigm [3,9,17,36].The solver is a preconditioned GMRES (PGMRES) method whose preconditioner P is obtained as an approximation of another preconditioner P inspired by the spectral analysis carried out in [10].Informally speaking, the preconditioner P is a standard multigrid, which is applied only in space and not in time, and which involves, at all levels, a single symmetric Gauss-Seidel post-smoothing step and standard bisection for the interpolation and restriction operators (following the Galerkin assembly).The proposed solver is then a multigrid preconditioned GMRES (MG-GMRES).Its performance is illustrated through numerical experiments and turns out to be satisfactory in terms of iteration count and run-time.In addition, the solver is suited for parallel computation as it shows remarkable scaling properties with respect to the number of cores.Comparisons with other benchmark solvers are also presented and reveal the actual competitiveness of our proposal.
The paper is organized as follows.In Sect.2, we briefly recall the space-time IgA-DG discretization of (1.1) and we report the main result of [10] concerning the spectral distribution of the associated discretization matrix C. In Sect.3, we present a PGMRES method for the matrix C, which is the root from which the proposed solver originated.In Sect.4, we describe the proposed solver.In Sect.5, we describe its parallel version.In Sect.6, we illustrate its performance in terms of iteration count, run-time and scaling.In Sect.7, we test it on a generalization of problem (1.1) where (0, 1) d is replaced by a non-rectangular domain and the considered IgA discretization involves a non-trivial geometry.In Sect.8, we draw conclusions.In order to keep this paper as concise as possible, we borrow notation and terminology from [10].It is therefore recommended that the reader takes a look at Sects. 1 and 2 of [10].

Space-time IgA-DG Discretization of the Diffusion Equation
Let N ∈ N and n = (n 1 , . . ., n d ) ∈ N d , and define the following uniform partitions in time and space: where t = T/N and x = ( x 1 , . . ., x 2 ) = (1/n 1 , . . ., 1/n d ).We consider for the differential problem (1.1) the same space-time discretization as in [10], i.e., we use a pdegree C k IgA approximation in space based on the uniform mesh {x i , i = 0, . . ., n} and a q-degree DG approximation in time based on the uniform mesh {t i , i = 0, . . ., N }.
Here, p = ( p 1 , . . ., p d ) and k = (k 1 , . . ., k d ) are multi-indices, with p i and 0 ≤ k i ≤ p i − 1 representing, respectively, the polynomial degree and the smoothness of the IgA basis functions in direction x i .As explained in [10,Sect. 3], the overall discretization process leads to a linear system where: - are (q + 1) n × (q + 1) n matrices given by where is the number of degrees of freedom (DoFs) in space (the total number of DoFs is equal to the size N (q +1) n of the matrix C [q, p,k] N ,n (K )); each block row in the block partition of C [q, p,k] N ,n (K ) given by (2.2) is referred to as a time slab; - M n,[ p,k] and K n,[ p,k] (K ) are the n × n mass and stiffness matrices in space, which are given by  .
In the context of (nodal) DG methods [33], 1,[q] , . . ., q+1, [q] are often chosen as the Lagrange polynomials associated with q + 1 fixed points {τ 1 , . . ., τ q+1 } ⊆ [−1, 1], such as, for example, the Gauss-Lobatto or the right Gauss-Radau nodes in The solution of system (2.1) yields the approximate solution of problem (1.1); see [10] for details.The main result of [10] then, assuming we have no outliers, the eigenvalues of X m are approximately equal to for m large enough; and so on for ≥ 3.
Theorem 2.1 Let q ≥ 0 be an integer, let p ∈ N d and 0 ≤ k ≤ p − 1. Assume that K (x) is symmetric positive definite at every point x ∈ (0, 1) d and each component of K (x) is a continuous bounded function on (0, 1) d .Suppose the following two conditions are met: n = αn, where α = (α 1 , . . ., α d ) is a vector with positive components in Q d and n varies in some infinite subset of Then, for the sequence of normalized space-time matrices , where: is defined as [10, eq. (5.12)]; -T is the final time in (1.1) and M [q] is given in (2.7).
With the same argument used for proving Theorem 2.1, it not difficult to prove the following result.
Theorem 2.2 Suppose the hypotheses of Theorem 2.1 are satisfied, and let Then,

PGMRES for the Space-time IgA-DG System
Suppose the hypotheses of Theorem 2.1 are satisfied.Then, on the basis of Theorem 2.2 and the theory of (block) generalized locally Toeplitz (GLT) sequences [7,8,30,31,52,53], we expect that the sequence of preconditioned matrices as well as the sequence of preconditioned matrices has an asymptotic spectral distribution described by the preconditioned symbol This means that the eigenvalues of the two sequences of matrices (3.1) and (3.2) are (weakly) clustered at 1; see [7,Sect. 2.4.2].Therefore, in view of the convergence properties of the GMRES method [13]-see in particular [13,Theorem 2.13] and the original research paper by Bertaccini and Ng [14]-we may expect that the PGMRES with preconditioner has an optimal convergence rate, i.e., the number of iterations for reaching a preassigned accuracy ε is independent of (or only weakly dependent on) the matrix size.We may also expect that the same is true for the PGMRES with preconditioner because (up to a negligible normalization factor t/2) P N ,n (K ).Indeed, the spectrum of (P q ] for some positive constants c q , C q > 0 depending only on q.For instance, one can take c q = λ min (M [q] ) and C q = λ max (M [q] ), which are both positive as M [q] is symmetric positive definite (see (2.7)).
To show that our expectation is realized, we solve system (2.1) in two space dimensions (d = 2), up to a precision ε = 10 −8 , by means of the GMRES and the PGMRES with preconditioner  The total size of the space-time system (number of DoFs) is given by n n = n(n + p − 2) 2 k = (k, k), and varying K (x), N , n, q, p, k.The resulting number of iterations are collected in Tables 1, 2, 3. We see from the tables that the GMRES solver rapidly deteriorates with increasing n, and it is not robust with respect to p, k.On the other hand, the convergence rate of the proposed PGMRES is robust with respect to all spatial parameters n, p, k, though its performance is clearly better in the case where N is fixed (Tables 2, 3) than in the case where N increases (Table 1).An explanation of this phenomenon based on Theorem 2.1 is the following.In the case where N is fixed, the ratio N /n 2 converges to 0 much more quickly than in the case where N = n.Consequently, when N is fixed, the spectrum of both 2N N ,n (K ) is better described by the symbol f [α,K ] [q, p,k] than when N = n.Similarly, the spectrum of the preconditioned matrix (Q In conclusion, the eigenvalues of the preconditioned matrix are supposed to be more clustered when N is fixed than when N = n.
In order to investigate the influence of q on the number of PGMRES iterations, we performed a further numerical experiment in Table 4.We observe that the considered PGMRES is not robust with respect to q, but the number of PGMRES iterations grows linearly with q.By comparing Tables 1 and 4, we note that the PGMRES convergence is linear with respect to both N and q.In practice, increasing q is the most convenient way to improve the temporal accuracy of the discrete solution u; see, e.g., [12].This is due to the superconvergence property, according to which the order of convergence in time of a q-degree DG method is 2q + 1 [19,43].Tables 1 and 4 show that the strategy of keeping N fixed and increasing q is more convenient even in terms of performance of the proposed PGMRES.
As it is known, each PGMRES iteration requires solving a linear system with coefficient matrix given by the preconditioner P [q, p,k] N ,n (K ), and this is not required in a GMRES iteration.Thus, if we want to prove that the proposed PGMRES is fast, we have to show that we are  The number of DoFs is given by 40 123   able to solve efficiently a linear system with matrix P [q, p,k] N ,n (K ).However, for the reasons explained in Sect.4, this is not exactly the path we will follow.
Before moving on to Sect. 4, we remark that, thanks to the tensor structure (3.3), the solution of a linear system with coefficient matrix P [q, p,k] N ,n (K ) reduces to the solution of N (q + 1) linear systems with coefficient matrix K n,[ p,k] (K ).Indeed, the solution of the system where y T = [y T 1 , . . ., y T N (q+1) ] and each y i has length n.It is then clear that the computation of the solution x is equivalent to solving the N (q + 1) linear systems K n,[ p,k] (K )x i = y i , i = 1, . . ., N (q + 1).Note that the various x i can be computed in parallel as the computation of x i is independent of the computation of x j whenever i = j.

Fast Solver for the Space-time IgA-DG System
From here on, we focus on the maximal smoothness case k = p − 1, that is, the case corresponding to the classical IgA approach.For notational simplicity, we drop the subscript/superscript k = p−1, so that, for instance, the matrices The solver suggested in Sect. 3 for a linear system with matrix C [q, p] N ,n (K ) is a PGMRES with preconditioner P [q, p] N ,n (K ).According to (3.4), the solution of a linear system with matrix P [q, p] N ,n (K ), which is required at each PGMRES iteration, is equivalent to solving N (q + 1) linear systems with matrix K n,[ p] (K ).Fast solvers for K n,[ p] (K ) that have been proposed in recent papers (see [20][21][22] and references therein) might be employed here.However, using an exact solver for K n,[ p] (K ) is not what we have in mind.Indeed, it was discovered experimentally that the PGMRES method converges faster if the linear system with matrix P [q, p] N ,n (K ) occurring at each PGMRES iteration is solved inexactly.More precisely, when solving the N (q + 1) linear systems with matrix K n,[ p] (K ) occurring at each PGMRES iteration, it is enough to approximate their solutions by performing only a few standard multigrid iterations in order to achieve an excellent PGMRES run-time; and, in fact, only one standard multigrid iteration is sufficient.In view of these experimental discoveries, we propose to solve a linear system with matrix C [q, p] N ,n (K ) in the following way.Algorithm 1

Apply to the given system the PGMRES algorithm with preconditioner P
[q, p] N ,n (K ).

The exact solution of the linear system with matrix P
[q, p] N ,n (K ) occurring at each PGM-RES iteration would require solving N (q + 1) linear systems with matrix K n,[ p] (K ) as per eq.(3.4).

Instead of solving exactly these N (q + 1) systems, apply to each of them, starting from the zero vector as initial guess, μ multigrid (V-cycle) iterations involving, at all levels, a single symmetric Gauss-Seidel post-smoothing step and standard bisection for the interpolation and restriction operators (following the Galerkin assembly in which the interpolation operator is the transpose of the restriction operator).
As we shall see in the numerics of Sect.6, the choice μ = 1 yields the best performance of Algorithm 1.The proposed solver is not the PGMRES with preconditioner P [q, p] N ,n (K ) because, at each iteration, the linear system associated with P [q, p] N ,n (K ) is not solved exactly.However, the solver is still a PGMRES with a different preconditioner P[q, p] N ,n (K ).To see this, let MG be the iteration matrix of the multigrid method used in step 3 of Algorithm 1 for solving a linear system with matrix K n,[ p] (K ).Recall that MG depends only on K n,[ p] (K ) and not on the specific right-hand side of the system to solve.If the system to solve is K n,[ p] (K )x i = y i , the approximate solution xi obtained after μ multigrid iterations starting from the zero initial guess is given by xi Hence, the approximation x computed by our solver for the exact solution (3.4) of the system In conclusion, the proposed solver is the PGMRES with preconditioner P[q, p] N ,n (K ).From the expression of P[q, p] N ,n (K ), we can also say that the proposed solver is a MG-GMRES, that is, a PGMRES with preconditioner given by a standard multigrid applied only in space.A more precise notation for this solver could be MG space -GMRES, but for simplicity we just write MG-GMRES.N ,n (K ) = I N (q+1) ⊗ K using ρ = N (q + 1) − 1 processors (left) and ρ = N (q + 1) + 1 processors (right) with N (q + 1) = 4.For simplicity, we write " K " instead of "K

Fast Parallel Solver for the Space-time IgA-DG System
In Sect.4, we have described the sequential version of the proposed solver.The same version is used also in the case where ρ ≤ N (q +1) processors are available, with the only difference that step 3 of Algorithm 1 is performed in parallel.In practice, s i linear systems are assigned to the ith processor for i = 1, . . ., ρ, with s 1 +. ..+s ρ = N (q +1) and s 1 , . . ., s ρ approximately equal to each other according to a load balancing principle.This is illustrated in Fig. 1 (left), which shows the row-wise partition of P N ,n (K ) corresponding to the distribution of the N (q + 1) systems among ρ = N (q + 1) − 1 processors.
If ρ > N (q + 1) processors are available, we use a slight modification of the solver, which is suited for parallel computation.As before, the modification only concerns step 3 of Algorithm 1.Since we now have more processors than systems to be solved, after assigning 1 processor to each system, we still have ρ − N (q + 1) unused processors.Following again a load balancing principle, we distribute the unused processors among the N (q + 1) systems, so that now one system can be shared between two or more different processors; see Fig. 1 (right).Suppose that the system K [q, p] N ,n (K )x = y is shared between σ processors.The symmetric Gauss-Seidel post-smoothing iteration in step 3 of Algorithm 1 cannot be performed in parallel.Therefore, we replace it with its block-wise version.To be precise, we recall that the symmetric Gauss-Seidel iteration for a system with matrix E = L + U − D is just the preconditioned Richardson iteration with preconditioner M = L D −1 U . 1 Its blockwise version in the case where we consider σ diagonal blocks E 1 , . . ., E σ of E is simply the preconditioned Richardson iteration with preconditioner is the block diagonal matrix whose diagonal blocks are M 1 , . . ., M σ .This block-wise version is suited for parallel computation in the case where σ processors are available.
(2021) 89: 20 Page 13 of 21 20 Fig. 2 The PETSc default row-wise partition does not account for the structure of the space-time problem; compare with Fig. 1 6 Numerical Experiments: Iteration Count, Timing and Scaling In this section, we illustrate through numerical experiments the performance of the proposed solver and we compare it to the performance of other benchmark parallel solvers, such as the PGMRES with block-wise ILU(0) preconditioner.

Implementation Details
For the numerics of this section, as well as throughout this paper, we used the C++ framework PETSc [5,6] and the domain specific language Utopia [64] for the parallel linear algebra and solvers, and the Cray-MPICH compiler.For the assembly of high order finite elements, we used the PetIGA package [18].A parallel tensor-product routine was implemented to assemble space-time matrices.Numerical experiments have been performed on the Cray XC40 nodes of the Piz Daint supercomputer of the Swiss national supercomputing centre (CSCS). 2 The used partition features 1813 computation nodes, each of which holds two 18core Intel Xeon E5-2695v4 (2.10GHz) processors.We stress that the PETSc default row-wise partition follows a load balancing principle and, except in the trivial case ρ = N , does not correspond to the row-wise partition described in Sect.5; see Fig. 2. Therefore, the partition must be adjusted by the user.Alternatively, one can use a PETSc built-in class for sparse block matrices and specify the block size (q + 1) n.

Experimental Setting
In the numerics of this section, we solve the linear system (2.1) arising from the choices . The basis functions 1,[q] , . . ., q+1, [q] are chosen as the Lagrange polynomials associated with the right Gauss-Radau nodes in [−1, 1].The values of K (x), N , n, q, p are specified in each example.For each solver considered herein, we use the tolerance ε = 10 −8 and the PETSc default stopping criterion based on the preconditioned relative residual.Moreover, the PGMRES method is always applied with restart after 30 iterations as per PETSc default.Whenever we report the run-time of a solver, the time spent in I/O operations and matrix assembly is ignored.Run-times are always expressed in seconds.In all the tables below, the number of iterations needed by a given solver to converge within the tolerance ε = 10 −8 is reported in square brackets next to the corresponding run-time.Throughout this section, we use the following abbreviations for the solvers.
-MG L μ,ν -GMRES The proposed solver, as described in Sect.4, with μ multigrid (Vcycle) iterations applied to K n,[ p] (K ).Each multigrid iteration involves ν symmetric Gauss-Seidel post-smoothing steps at the finest level and 1 symmetric Gauss-Seidel post-smoothing step at the coarse levels.The choice ν = 1 corresponds to our solver proposal.Different values of ν are considered for comparison purposes.The superscript L denotes the number of multigrid levels.
-TMG L μ,ν -GMRES The same as MG L μ,ν -GMRES, with the only difference that the multigrid iterations are performed with the telescopic option, thus giving rise to the telescopic multigrid (TMG) [23,45].This technique consists in reducing the number of processors used on the coarse levels and can be beneficial for the parallel multigrid performance.In the numerics of this section, we only reduced the number of processors used on the coarsest level to one fourth of the number of processors used at all other levels.

Iteration Count and Timing
Tables 5, 6, 7 illustrate the performance of the proposed solver in terms of number of iterations and run-time.It is clear from the tables that the best performance of the solver is obtained when applying to K n,[ p] (K ) a single multigrid iteration (μ = 1) with only one smoothing step at the finest level (ν = 1).Moreover, the solver is competitive with respect to the ILU(0)-GMRES.The worst performance of the solver with respect to the ILU(0)-GMRES is attained in Table 6, where the diffusion matrix K (x 1 , x 2 ) is singular at (x 1 , x 2 ) = (0, 0).We used K (x) = I 2 , q = 0, N = 32, n = 259 − p.The total size of the space-time system (number of DoFs) is given by 32 We used The total size of the space-time system (number of DoFs) is given by 40 We The total size of the space-time system (number of DoFs) is given by 20 • 257 2 We used K (x) = I 2 , q = 0, p = 3, N = 64, n = 384.The total size of the space-time system (number of DoFs) is given by 64 • 385 2 We used K (x) = I 2 , q = 0, p = 2 and (N , n) = (8, 65), (16,129), (32,256), (64,512).The ratio DoFs/Cores is constant in the table

Scaling
In the scaling experiments, besides the multigrid already considered above, we also employ a TMG for performance reasons.To avoid memory bounds, we use at most 16 cores per node.
From Table 8 and Fig. 3 we see that the proposed solver, especially when using the TMG option, shows a nearly optimal strong scaling with respect to the number of cores. 3Table 9 and Fig. 4 illustrate the weak scaling properties of the proposed solver, which possesses a superior parallel efficiency with respect to the standard ILU(0) approach in terms of iteration count and run-time.For both solvers, however, the weak scaling is not ideal (constant runtime).This is due to the fact that N grows from 8 to 64 and both solvers are not robust with respect to N .Fig. 4 Graphical representation of the run-times reported in Table 9 7 Non-rectangular Domain and Non-trivial Geometry So far, the performance of the proposed solver has been illustrated for the diffusion problem (1.1) over the hypersquare (0, 1) d .However, no special difficulty arises if (0, 1) d is replaced by a non-rectangular domain described (exactly) by a geometry map G : [0, 1] d → as per IgA paradigm.Indeed, as long as a tensor-product structure between space and time is maintained, the geometry map G acts as a reparameterization of through (0, 1) d , and the resulting discretization matrix is still given by (2.2)-(2.9)with the only difference that: -a factor |det(J G (x))| should be included in the integrand of (2.5), where J G (x) is the Jacobian matrix of G(x); -the matrix K (x) in (2.6) should be replaced by J G (x) −1 K (G(x))J G (x) −T |det(J G (x))|.
In short, a change of domain from (0, 1) d to essentially amounts to a mere change of diffusion matrix from K to J −1 G K (G)J −T G |det(J G )|, which does not affect the performance of the proposed solver.
In Table 10, we validate the previous claim by testing the solver on the linear system arising from the space-time IgA-DG discretization of (1.1) in the case where (0, 1) d is replaced by a non-rectangular domain described by a non-trivial geometry map G : [0, 1] d → .The experimental setting is the same as in Sect.6.2, with the only difference that (0, 1) 2 is now replaced by a quarter of an annulus We used K (x) = I 2 , q = 1, N = 20, n = 131 − p.The total size of the space-time system (number of DoFs) is given by 40 We remark that the geometry map G is a common benchmark example in IgA; see, e.g., [20,21].

Conclusions
We have proposed a MG-GMRES solver for the space-time IgA-DG discretization of the diffusion problem (1.1).Through numerical experiments, we have illustrated the competitiveness of our proposal in terms of iteration count, run-time and parallel scaling.We have also shown its applicability to more general problems than (1.1) involving a non-rectangular domain and a non-trivial geometry map G.To conclude, we remark that the proposed solver is highly flexible as it does not depend on the domain or the space-time discretization.It could therefore be applied to other space-time discretizations, as long as a tensor-product structure is maintained between space and time.

Fig. 3
Fig. 3 Graphical representation of the run-times reported in Table 8

n( p−k)+k+1,[ p,k] are
the tensor-product B-splines defined by and B 1,[ p r ,k r ] , . . ., B n r ( p r −k r )+k r +1,[ p r ,k r ] are the B-splines of degree p r and smoothness Let {X m } m be a sequence of matrices, with X m of size d m tending to infinity, and let f : D → C s×s be a measurable matrix-valued function defined on a set D ⊂ R with 0 < measure(D) < ∞.We say that {X m } m has a (asymptotic) spectral distribution described by f, and we write {X m } m ∼ λ f, if s×s, defined on a measurable set D ⊆ R , is measurable if its components f i j : D → C, i, j = 1, . . ., s, are (Lebesgue) measurable.

Table 1
p), Number of iterations GM[ p] and PGM[ p] needed by, respectively, the GMRES and the PGMRES with preconditioner P for solving the linear system (2.1), up to a precision ε = 10 −8 , in the case where

Table 2
Number of iterations GM[ p, k]and PGM[ p, k] needed by, respectively, the GMRES and the PGMRES with preconditioner P

Table 3
Number of iterations GM[ p, k]

Table 5
PGMRES iterations and run-time (using 64 cores) to solve the linear system (2.1) up to a precision of 10 −8 , according to the experimental setting described in Sect.6.2

Table 6
PGMRES iterations and run-time (using 64 cores) to solve the linear system (2.1) up to a precision of 10 −8 , according to the experimental setting described in Sect.6.2

Table 7
PGMRES iterations and run-time (using 64 cores) to solve the linear system (2.1) up to a precision of 10 −8 , according to the experimental setting described in Sect.6.2

Table 8
Strong scaling: PGMRES iterations and run-time to solve the linear system (2.1) up to a precision of 10 −8 , according to the experimental setting described in Sect.6.2.

Table 9
Space-time weak scaling: PGMRES iterations and run-time to solve the linear system (2.1) up to a precision of 10 −8 , according to the experimental setting described in Sect.6.2

Table 10
PGMRES iterations and run-time (using 64 cores) to solve, up to a precision of 10 −8 , the linear system arising from the space-time IgA-DG discretization of (1.1) in the case where (0, 1) d is replaced by the domain (7.1) described by the geometry map (7.2).The experimental setting is the same as in Sect.6.2