Abstract
To extend prevailing scaling limits when solving timedependent partial differential equations, the parallel full approximation scheme in space and time (PFASST) has been shown to be a promising parallelintime integrator. Similar to space–time multigrid, PFASST is able to compute multiple timesteps simultaneously and is therefore in particular suitable for largescale applications on high performance computing systems. In this work we couple PFASST with a parallel spectral deferred correction (SDC) method, forming an unprecedented doubly timeparallel integrator. While PFASST provides global, largescale “parallelization across the step”, the inner parallel SDC method allows integrating each individual timestep “parallel across the method” using a diagonalized local QuasiNewton solver. This new method, which we call “PFASST with Enhanced concuRrency” (PFASSTER), therefore exposes even more temporal concurrency. For two challenging nonlinear reactiondiffusion problems, we show that PFASSTER works more efficiently than the classical variants of PFASST and can use more processors than timesteps.
Similar content being viewed by others
1 Introduction
The efficient use of modern high performance computing systems for solving space–timedependent differential equations has become one of the key challenges in computational science. Exploiting the exponentially growing number of processors using traditional techniques for spatial parallelism becomes problematic when, for example, for a fixed problem size communication costs start to dominate. Parallelintime integration methods have recently been shown to provide a promising way to extend these scaling limits, see e.g. [7, 18, 22] to name but a few examples.
As one example, the “Parallel Full Approximation Scheme in Space and Time” (PFASST) by Emmett and Minion [6] allows one to integrate multiple timesteps simultaneously by using inner iterations of spectral deferred corrections (SDC) on a space–time hierarchy. It works on the so called composite collocation problem, where each timestep includes a further discretization through quadrature nodes. This “parallelization across the steps” approach [3] targets largescale parallelization on top of saturated spatial parallelization of partial differential equations (PDEs), where parallelization in the temporal domain acts as a multiplier for standard parallelization techniques in space. In contrast, “parallelization across the method” approaches [3] try to parallelize the integration within an individual timestep. While this typically results in smallerscale parallelization in the timedomain, parallel efficiency and applicability of these methods are often more favorable. Most notably, the “revisionist integral deferred correction method” (RIDC) by Christlieb et al. [4] makes use of integral deferred corrections (which are indeed closely related to SDC) in order to compute multiple iterations in a pipelined way. In [19], different approaches for parallelizing SDC across the method have been discussed, allowing the simultaneous computation of updates on multiple quadrature nodes. A much more structured and complete overview of parallelintime integration approaches can be found in [8]. The ParallelinTime community website (https://parallelintime.org) offers a comprehensive list of references.
The key goal of parallelintime integrators is to expose additional parallelism in the temporal domain in the cases where classical strategies like parallelism in space are either already saturated or not even possible. In [5] the classical Parareal method [15] is used to overcome the scaling limit of a spaceparallel simulation of a kinematic dynamo on up to 1600 cores. The multigrid extension of Parareal, the “multigrid reduction in time” method (MGRIT), has been shown to provide significant speedup beyond spatial parallelization [7] for a multitude of problems. Using PFASST, a spaceparallel Nbody solver has been extended in [22] to run on up to 262,244 cores, while in [18] it has been coupled to a spaceparallel multigrid solver on up to 458,752 cores.
So far, parallelintime methods have been implemented and tested either without any additional parallelization techniques or in combination with spatial parallelism. The goal for this work is to couple two different parallelintime strategies in order to extend the overall temporal parallelism exposed by the resulting integrator. To this end, we take the diagonalization idea for SDC presented in [19] (parallel across the method) and use it within PFASST (parallel across the steps). In this way we create an algorithm that computes approximations for different timesteps simultaneously but also works in parallel on each timestep itself. Doing so we combine the advantages of both parallelization techniques and create the “Parallel Full Approximation Scheme in Space and Time with Enhanced concuRrency” (PFASSTER), an unprecedented doubly timeparallel integrator for PDEs. In the next section we will first introduce SDC and PFASST from an algebraic point of view, following [1, 2]. We particularly focus on nonlinear problems and briefly explain the application of a Newton solver within PFASST. Then, this Newton solver is modified in Sect. 3 so that by using a diagonalization approach the resulting QuasiNewton method can be computed in parallel across the quadrature nodes of each timestep. In Sect. 4, we compare different variants of this idea to the classical PFASST implementation using two nonlinear reactiondiffusion test cases. We show parallel runtimes for different setups and evaluate the impact of the various Newton and diagonalization strategies. Section 5 concludes this work with a short summary and an outlook.
2 Parallelization across the steps with PFASST
We focus on an initial value problem
with \(u(t), u_0, f(u) \in \mathbb {R}\). In order to keep the notation simple, we do not consider systems of initial value problems for now, where \(u(t) \in \mathbb {R}^N\). Necessary modifications will be mentioned where needed. In a first step, we now discretize this problem in time and review the idea of singlestep, timeserial spectral deferred corrections (SDC).
2.1 Spectral deferred corrections
For one timestep on the interval \([t_l,t_{l+1}]\) the Picard formulation of Eq. (1) is given by
To approximate the integral we use a spectral quadrature rule. We define M quadrature nodes \(\tau _{l,1},\ldots ,\tau _{l,M}\), which are given by \(t_l \le \tau _{l,1}< \cdots < \tau _{l,M} = t_{l+1}\). We will in the following explicitly exploit the condition that the last node is equal to the right integral boundary. Quadrature rules like GaußRadau or GaußLobatto quadrature satisfy this property. We can then approximate the integrals from \(t_l\) to the nodes \(\tau _{l,m}\), such that
where \(u_{l,m} \approx u(\tau _{l,m})\), \(\varDelta t= t_{l+1}t_l\) and \(q_{m,j}\) represent the quadrature weights for the interval \([t_l,\tau _{l,m}]\) such that
We combine these M equations into one system
which we call the “collocation problem”. Here, \(\varvec{u}_l = (u_{l,1}, \ldots , u_{l,M})^T \approx (u(\tau _{l,1}), \ldots , u(\tau _{l,M}))^T\in \mathbb {R}^M\), \(\varvec{u}_{l,0} = (u_{l,0}, \ldots , u_{l,0})^T\in \mathbb {R}^M\), \(\mathbf {Q}= (q_{ij})_{i,j}\in \mathbb {R}^{M\times M}\) is the matrix gathering the quadrature weights and the vector function \(\varvec{f}:\mathbb {R}^M \rightarrow \mathbb {R}^M\) is given by
To simplify the notation we define
We note that for \(u(t) \in \mathbb {R}^N\), we need to replace \(\mathbf {Q}\) by \(\mathbf {Q}\otimes \mathbf {I}_N\), where \(\otimes \) denotes the Kronecker product.
System (3) is dense and a direct solution is not advisable, in particular if \(\varvec{f}\) is a nonlinear operator. The spectral deferred correction method solves the collocation problem in an iterative way. While it has been derived originally from classical deferred or defect correction strategies, we here follow [10, 17, 27] to present SDC as preconditioned Picard iteration. A standard Picard iteration is given by
for \(k = 0, \ldots , K\), and some initial guess \(\varvec{u}^{0}_l\).
In order to increase range and speed of convergence, we now precondition this iteration. The standard approach to preconditioning is to define an operator \(\mathbf {P}^{{\text {sdc}}}_{\varvec{f}}\), which is easy to invert but also close to the operator of the system. We define this “SDC preconditioner” as
so that the preconditioned Picard iteration reads
The key for defining \(\mathbf {P}^{{\text {sdc}}}_{\varvec{f}}\) is the choice of the matrix \(\mathbf {Q}_\varDelta \). The idea is to choose a “simpler” quadrature rule to generate a triangular matrix \(\mathbf {Q}_\varDelta \) such that solving System (4) can be done by forward substitution. Common choices include the implicit Euler method or the socalled “LUtrick”, where the LU decomposition of \(\mathbf {Q}^T\) with
is used [27].
System (4) establishes the method of spectral deferred corrections, which can be used to approximate the solution of the collocation problem on a single timestep. In the next step, we will couple multiple collocation problems and use SDC to explain the idea of the parallel full approximation scheme in space and time.
2.2 Parallel full approximation scheme in space and time
The idea of PFASST is to solve a “composite collocation problem” for multiple timesteps at once using multigrid techniques and SDC for each step in parallel. This composite collocation problem for L timesteps can be written as
where the matrix \(\mathbf {H}\in \mathbb {R}^{M\times M}\) on the lower subdiagonal transfers the information from one timestep to the next one. It takes the value of the last node \(\tau _{l,M}\) of an interval \([t_l, t_{l+1}]\), which is by requirement equal to the left boundary \(t_{l+1}\) of the following interval \([t_{l+1}, t_{l+2}]\), and provides it as a new starting value for this interval. Therefore, the matrix \(\mathbf {H}\) contains the value 1 on every position in the last column and zeros elsewhere. To write the composite collocation problem in a more compact form we define the vector \(\varvec{u} = (\varvec{u}_{1}, \ldots , \varvec{u}_{L})^T\in \mathbb {R}^{LM}\), which contains the solution at all quadrature nodes at all timesteps, and the vector \(\varvec{b} = (\varvec{u}_{0,0}, \varvec{0}, \ldots , \varvec{0})^T\in \mathbb {R}^{LM}\), which contains the initial condition for all nodes at the first interval and zeros elsewhere. We define \({\varvec{F}: \mathbb {R}^{LM} \rightarrow \mathbb {R}^{LM}}\) as an extension of \(\varvec{f}\) so that \({\varvec{F}} ({\varvec{u}}) = \left( {\varvec{f}} ({\varvec{u}}_{1}), \ldots , {\varvec{f}} ({\varvec{u}}_{L}) \right) ^T\). Then, the composite collocation problem can be written as
with
where the matrix \(\mathbf {E}\in \mathbb {R}^{L\times L}\) just has ones on the first subdiagonal and zeros elsewhere. If \(u \in \mathbb {R}^N\), we need to replace \(\mathbf {H}\) by \(\mathbf {H}\otimes \mathbf {I}_N\).
SDC can be used to solve the composite collocation problem by forward substitution in a sequential way, which means to solve one timestep after each other using the previous solution as initial value of the current timestep. The parallelintime integrator PFASST, on the other hand solves the composite collocation problem by calculating on all timesteps simultaneously and is therefore an attractive alternative. The first step from SDC towards PFASST is the introduction of multiple levels, which are representations of the problem with different accuracies in space and time. In order to simplify the notation we focus on a twolevel scheme consisting of a fine and a coarse level. Coarsening can be achieved for example by reducing the resolution in space, by decreasing the number of quadrature nodes on each interval or by solving implicit systems less accurately. Especially a coarsening through the reduction of quadrature points does not seem to be worthwhile for our idea to parallize the belonging calculations, since there would no longer be a full employment regarding the calculations on the coarse grid, but instead individual processors would have to communicate larger amounts of data. For this work, we only consider coarsening in space, i.e., by using a restriction operator R on a vector \(u\in \mathbb {R}^{N}\) we obtain a new vector \(\tilde{u}\in \mathbb {R}^{\tilde{N}}\). Vice versa, the interpolation operator T is used to interpolate values from \(\tilde{u}\) to u. Operators, vectors and numbers on the coarse level will be denoted by a tilde to avoid further index cluttering. Thus, the composite collocation operator on the coarselevel is given by \( \tilde{\mathbf {C}}_{\varvec{F}}\). While \(\mathbf {C}_{\varvec{F}}\) is defined on \(\mathbb {R}^{LMN}\), \(\tilde{\mathbf {C}}_{\varvec{F}}\) acts on \(\mathbb {R}^{L M \tilde{N}}\) with \(\tilde{N} \le N\), but as before we will neglect the space dimension in the following notation. The extension of the spatial transfer operators to the full space–time domain is given by \(\mathbf {R} = \mathbf {I}_{LM} \otimes R\) and \(\mathbf {T} = \mathbf {I}_{LM} \otimes T\).
The main goal of the introduction of a coarse level is to move the serial part of the computation to this hopefully cheaper level, while being able to run the expensive part in parallel. For that, we define two preconditioners: a serial one with a lower subdiagonal for the coarse level and a parallel, blockdiagonal one for the fine level. The serial preconditionier for the coarse level is defined by
or, in a more compact way, by
Inverting this corresponds to a single inner iteration of SDC (a “sweep”) on step 1, then sending forward the result to step 2, an SDC sweep there and so on. The parallel preconditioner on the fine level then simply reads
Applying \(\mathbf {P}_{\varvec{F}}\) on the fine level leads to L decoupled SDC sweeps, which can be run in parallel.
For PFASST, these two preconditioners and the levels they work on are coupled using a full approximation scheme (FAS) known from nonlinear multigrid theory [25]. Following [1] one iteration of PFASST can then be formulated in four steps:

1.
the computation of the FAS correction \({\tau }^k\), including the restriction of the fine value to the coarse level
$$\begin{aligned} {\tau }^k =\tilde{\mathbf {C}}_{\varvec{F}} (\mathbf {R} {\varvec{u}}^k)  \mathbf {R} \mathbf {C}_{\varvec{F}} ( {\varvec{u}}^k) , \end{aligned}$$ 
2.
the coarse sweep on the modified composite collocation problem on the coarse level
$$\begin{aligned} \tilde{\mathbf {P}}_{\varvec{F}} (\tilde{\varvec{u}}^{k+1} )&= (\tilde{\mathbf {P}}_{\varvec{F}}  \tilde{\mathbf {C}}_{\varvec{F}})({\tilde{\varvec{u}}}^{k}) + \tilde{\varvec{b}} + \tau ^k, \end{aligned}$$(7) 
3.
the coarse grid correction applied to the fine level value
$$\begin{aligned} \varvec{u}^{k+\frac{1}{2}}&= \varvec{u}^{k} + \mathbf {T} ( \tilde{\varvec{u}}^{k+1} \mathbf {R} \varvec{u}^k ), \end{aligned}$$(8) 
4.
the fine sweep on the composite collocation problem on the fine level
$$\begin{aligned} \mathbf {P}_{\varvec{F}} ( \varvec{u}^{k+1} )&= (\mathbf {P}_{\varvec{F}}  \mathbf {C}_{\varvec{F}})( \varvec{u}^{k+\frac{1}{2}} ) + \varvec{b} . \end{aligned}$$(9)
In Fig. 1, we see a schematic representation of the described steps. The timestep parallel procedure, which we describe here is also the same for all PFASST versions, that we will introduce later. It is common to use as many processors as timesteps: In the given illustration four processors work on four timesteps. Therefore the temporal domain is divided into four intervals, which are assigned to four processors \(P_0, \ldots , P_3\). Every processor performs SDC sweeps on its assigned interval on alternating levels. The big red blocks represent fine sweeps, given by Eq. (9), and the small blue blocks coarse sweeps, given by Eq. (7).
The coarse sweep over all intervals is a serial process: after a processor finished its coarse sweeps, it sends forward its results to the next processor, which takes this result as an initial value for its own coarse sweeps. We see the communication in the picture represented by small arrows, which connect the coarse sweeps of each interval. In (7), the need for communication with a neighboring process is obvious, because \(\tilde{\mathbf {P}}_{\varvec{F}}\) is not a (block) diagonal matrix, but has entries on its lower blockdiagonal. \(\mathbf {P}_{\varvec{F}}\) on the other hand is blockdiagonal, which means that the processors can calculate on the fine level in parallel. We see in (9) that there is only a connection to previous timesteps through the righthand side, where we gather values from the previous timestep and iteration but not from the current iteration. The picture shows this connection by a fine communication, which forwards data from each fine sweep to the following fine sweep of the right neighbor. The fine and coarse calculations on every processor are connected through the FAS corrections, which in our formula are part of the coarse sweep.
2.3 PFASSTNewton
For each coarse and each fine sweep within each PFASST iteration, System (7) and System (9), respectively, need to be solved. If f is a nonlinear function these systems are nonlinear as well. The obvious and traditional way to proceed in this case is to linearize the problem locally (i.e. for each timestep, at each quadrature node) using Newton’s method. This way, PFASST is the outer solver with an inner Newton iteration. For triangular \(\mathbf {Q}_\varDelta \), the mth equation on the lth timestep on the coarse level reads
where \(\tilde{u}^{k+1}_{0,0} = \tilde{u}_{0,0}\) and \(\tilde{\varvec{c}}(\tilde{\varvec{u}}^k)_{l,m}\) is the mth entry the lth block of \(\tilde{\varvec{c}}(\tilde{\varvec{u}}^k) := (\tilde{\mathbf {P}}_{\varvec{F}}  \tilde{\mathbf {C}}_{\varvec{F}})({\tilde{\varvec{u}}}^{k}) + \tau ^k.\) This term gathers all values of the previous iteration. The first summand of the righthand side of the coarse level equation corresponds to \(\tilde{\varvec{b}}\) and \(\tilde{\mathbf {H}}\), while the following sum comes from the lower triangular structure of \(\tilde{\mathbf {Q}}_\varDelta \).
For timestep l these equations can be solved one by one using Newton iterations and forward substitution. This is inherently serial, because the solution on the mth quadrature node depends on the solution at all previous nodes through the sum. Thus, while running parallel across the steps, each solution of the local collocation problem is found in serial. In the next section, we will present a novel way of applying Newton’s method, which allows one to parallelize this part across the collocation nodes, joining parallelization across the step with parallelization across the method.
3 PFASSTER
From the perspective of a single timestep \([t_l, t_{l+1}]\) or processor \(P_l\), Eq. (7) on the coarse level for this step reads
where \(\tau ^k_{l}\) is the lth component of \(\tau ^k\), belonging to the interval \([t_l, t_{l+1}]\). Note that the serial dependency is given by the term \(\tilde{\varvec{u}}_{l,0}^{k+1}\), so that it does not depend on the solution \(\tilde{\varvec{u}}_{l}^{k+1}\) of this equation and can thus be considered as part of a given righthand side. On the fine level, this is even simpler, because there we have to solve
where the \(\varvec{u}_{l,0}^{k+\frac{1}{2}}\)term is independent of the current iteration (which, of course, leads to the parallelism on the fine level).
As we have seen above, the typical strategy would be to solve these systems line by line, node by node, using forward substitution and previous PFASST iterates as initial guesses. An alternative approach has been presented in [19], where each SDC iteration can be parallelized across the nodes. While this is trivial for linear problems, nonlinear ones require the linearization of the full equations, not nodewise as before. For the fine sweep, let
then a Newton step for \(\mathbf {G}_{\varvec{f}}^{{\text {sdc}}}(\varvec{v}) = 0\) is given by
for Jacobian matrix \(\nabla \mathbf {G}_{\varvec{f}}^{{\text {sdc}}}(\varvec{v}^j)\) of \(\mathbf {G}_{\varvec{f}}^{{\text {sdc}}}\) evaluated at \(\varvec{v}^j\). We have
for Jacobian matrix \(\nabla \varvec{f}(\varvec{v}^j)\) of \(\varvec{f}\) evaluated at \(\varvec{v}^j\) which in turn is given by
There is still no parallelism to exploit, but when we replace the full Jacobian matrix \(\nabla \varvec{f}(\varvec{v}^j)\) by the approximation \(f'(v_{l,0})\mathbf {I}_M\), which is the derivative of f at the initial value for the current timestep, we can use
to establish a QuasiNewton iteration as
This decouples the evaluation of the Jacobian matrix from the current quadrature nodes and now \(\mathbf {Q}_\varDelta \) can be diagonalized, so that the inversion of \(\nabla \mathbf {G}_{\varvec{f}}^{\varDelta \text {}\mathrm {QN}}(v_{l,0})\) can be parallelized across the nodes. Note that there are other options for approximating the full Jacobian matrix. Most notably, in [9] the mean over all Jacobian matrices is used (there across the timesteps). We did not see any impact on the convergence when following this strategy, most likely because the number of quadrature nodes is typically rather low. The advantage of using the initial value is that it reduces the number of evaluations of the Jacobian matrix, which also includes communication time.
Provided that \(\mathbf {Q}_\varDelta \) is diagonalizable, we can decompose it by \(\mathbf {Q}_\varDelta =\mathbf {V}_\varDelta \varvec{\Lambda }_\varDelta \mathbf {V}^{1}_\varDelta \), where \(\varvec{\Lambda }_\varDelta =\mathrm {diag}((\mathbf {Q}_\varDelta )_{ii})\) contains the eigenvalues \((\mathbf {Q}_\varDelta )_{ii} \in \mathbb {R}\) of \(\mathbf {Q}_\varDelta \) and \(\mathbf {V}\) contains its eigenvectors.
Using the given diagonalization the algorithm reads:

1.
replace \(\varvec{r}^j = \mathbf {G}_{\varvec{f}}^{{\text {sdc}}}(\varvec{v}^{j})\) by \(\bar{\varvec{r}}^j = \mathbf {V}_\varDelta ^{1}\mathbf {G}_{\varvec{f}}^{{\text {sdc}}}(\varvec{v}^{j})\) (serial),

2.
solve \(\left( \mathbf {I}  f'(v_{l,0})\varDelta t\varvec{\Lambda }_\varDelta \right) \bar{\varvec{e}}^{j} = \bar{\varvec{r}}^j\) (parallel in M),

3.
replace \(\bar{\varvec{e}}^{j}\) by \(\varvec{e}^{j} = \mathbf {V}_\varDelta \bar{\varvec{e}}^{j}\) (serial),

4.
set \(\varvec{v}^{j+1} = \varvec{v}^{j} + \varvec{e}^{j}\) (parallel in M).
This can be iterated until a certain threshold is reached and then set \(\varvec{u}^{k+1}_{l} = \varvec{v}^J\) to obtain the solution of the equation for the fine sweep. On the coarse level, the procedure is very similar, with a slightly different definition of \(\tilde{\mathbf {G}}_{\tilde{\varvec{f}}}^{{\text {sdc}}}(\tilde{\varvec{v}})\). In practice, choosing only a single Newton iteration (i.e. \(J=1\)) is sufficient, because this is only the inner solver for an outer PFASST iteration. In all cases we have studied so far, using more inner iterations does not lead to a faster overall method.
This linearization and diagonalization strategy immediately suggests a second approach: instead of using \(\mathbf {Q}_\varDelta \) for the preconditioner, we can use the original quadrature matrix \(\mathbf {Q}\) directly. The intention of using \(\mathbf {Q}_\varDelta \) in the first place was to obtain a preconditioner which allowed inversion using forward substitutions. Now, with diagonalization in place, this is no longer necessary. Instead, we can use
and thus
Note that this is just the lth block of the original composite collocation problem. Following the same ideas as before, we end up with
which can be diagonalized using \(\mathbf {Q}= \mathbf {V}\varvec{\Lambda }\mathbf {V}^{1}\), where \(\varvec{\Lambda }\) is a diagonal matrix with eigenvalues \(\lambda _i(\mathbf {Q}) \in \mathbb {C}\). The same idea can be applied to the coarse level sweep, of course. As a result, the original nonlinear SDC sweeps within PFASST are now replaced by QuasiNewton iterations which can be done parallel across the nodes. We note that using simplified or QuasiNewton methods for solving implicit RungeKutta schemes is a standard approach, as e.g. [26] shows. We further refer to [19] for more details on the idea of parallel SDC sweeps with \(\mathbf {Q}\) and \(\mathbf {Q}_\varDelta \).
The question now is, how much the approximation of the Jacobians affects the convergence and runtime of the method and how all this compares to standard PFASST iterations. It is well known that for suitable righthand sides and initial guesses the standard, unmodified Newton method converges quadratically while the QuasiNewton method as well as SDC show linear convergence, see e.g. [11, 13, 24]. We will examine the impact of these approaches in the following section along the lines of two numerical examples. A more rigorous mathematical analysis is currently ongoing work, as it can be embedded into a larger convergence theory for PFASST with inner Newtontype solvers.
4 Numerical results
We apply PFASST and PFASSTER to two different, rather challenging reactiondiffusion problems, starting with a detailed analysis of the parallelization strategies for the Allen–Cahn equation and highlighting differences to these findings for the GrayScott equations.
4.1 Allen–Cahn equation
We study the twodimensional Allen–Cahn equation, which is given by
on the spatial domain \([0.5,0.5]^2\) and with initial condition
and periodic boundary conditions. We use simple secondorder finite differences for discretization in space and take 256 elements in each dimension on the fine level and 128 on the coarse one. We furthermore use \(M = 4\) GaußRadau nodes, set \(\varepsilon = 0.04\), \(\varDelta t = 0.001 < \varepsilon ^2\) and stop the simulation after 24 timesteps at \(T = 0.024\). The initial condition describes a circle with a radius \(R_0 = 0.25\), see e.g. [28].
Note that since our focus is on the temporal parallelization, the temporal resolution was chosen to be quite high in contrast to the spatial resolution. Errors in space and time are not balanced here (and in the following example), with the spatial error being much higher. This has been done deliberately to avoid higher computational costs. By increasing the accuracy in space, we would increase the amount of large parallelizable computations in relation to communication. This would in turn improve the overall parallel efficiency, which would in the end lead to even better scaling results. However, when using parallelization in space, all processors have very few degreesoffreedom anyway, so our results may even reflect a “real” situation better.
The results we present in the following were computed with pySDC [20, 21] on the supercomputer JURECA [12]. We run a serial singlelevel simulation using SDC (“SL” in the plots), a serial multilevel simulation using multilevel SDC (“ML”, which is PFASST on one processor, see [23]) and parallel simulations with 2, 4, 8, 12 and 24 processors (“P2” to “P24”), all until a given residual tolerance of \(10^{10}\) is reached.
If less processors than timesteps are used, the time domain is split into blocks of parallel PFASST runs. These are handled sequentially, using the solution of the previous block as the initial data for the next one. For example, 6 processors work on the first 6 timesteps until convergence and the solution is used as new initial condition for the next block of 6 timesteps. This is repeated until all 24 timesteps have been completed.
In Fig. 2 we show the maximum number of linear solves which were performed by the slowest processor (i.e. last processor in time) for different versions of the solvers, aggregated over all its timesteps and quadrature nodes, over all outer and inner iterations.
Here, two versions of the original PFASST algorithm are run: The first one performs exactly one inner Newton iteration in every PFASST iteration; this version is labeled as “PFASST: 1 iter”. In contrast, “PFASST: N iter” performs as many inner Newton iterations required so that the residual of the nonlinear inner problem is less than \(10^{11}\). Both PFASST versions use the quadrature matrix \(\mathbf {Q}_\varDelta ^{LU}\) from Eq. (5) inside the preconditioner. For PFASSTER we also show two variants: The PFASSTER algorithm, which uses the original \(\mathbf {Q}\) inside the preconditioner is labeled as “PFASSTER: Q” and the one which uses \(\mathbf {Q}^{LU}_\varDelta \) is labeled as “PFASSTER: \(Q_\varDelta \)”. Solving the innermost linear systems is done using GMRES with a tolerance of \(10^{12}\) in all cases.
We can see that performing more than one inner Newton iteration (“PFASST: N iter” vs. “PFASST: 1 iter”) does not improve the convergence of the overall algorithm. Although it is possible that by increasing the number of inner Newton iterations the number of outer PFASST iterations decreases, the total effort, which can be measured by the total number of linear solves, increases due to a higher number of inner Newton iterations.
Using the QuasiNewton approach with the same preconditioner instead of the classical Newton solver (“PFASSTER: \(Q_\varDelta \)” vs. “PFASST: 1 iter”) shows little effect on the total iteration numbers, but using the original quadrature matrix \(\mathbf {Q}\) instead of \(\mathbf {Q}^{LU}_\varDelta \) inside the preconditioner (“PFASSTER: Q” vs. “PFASSTER: \(Q_\varDelta \)”) greatly reduces the number of iterations.
However, without parallelization one iteration of PFASSTER with \(\mathbf {Q}\) is in general more expensive than one iteration of the other algorithms, because it requires the solution of a full system via diagonalization instead of stepping through a triangular system via forward substitution.
In Fig. 3, we thus examine whether the lower number of more expensive iterations actually pays off. The plot shows results for the same setup as Fig. 2, but now we focus on the runtime instead of the iteration numbers. We only consider parallelization across the timesteps to compare the impact of the algorithmic change first. We see that despite the fact that the iterations are more expensive, PFASSTER with \(\mathbf {Q}\) already in this example shows a lower runtime than the original PFASST method. This is also true when using \(\mathbf {Q}_\varDelta \) instead of \(\mathbf {Q}\).
At this point, we have not yet considered the additional direction of concurrency exposed by PFASSTER. For that, we next compare different distributions of up to 24 cores on the 4 quadrature nodes and the 24 timesteps. All divisions of 24 were tested with all possible distributions.
The two plots in Fig. 4 show different combinations of cores used for stepparallelization (xaxis) and for nodeparallelization (yaxis) with PFASSTER and \(\mathbf {Q}\). Multiplying the numbers on both axes gives the total number of cores used for this simulation. This is also the reason why there are two plots, because not all combinations are actually possible or meaningful. Within each colored block the total runtime (in seconds, first number) and parallel efficiencies (second number) for this setup are given. We can see that using all available cores for parallelization across the step is by far not the most efficient choice. In turn, more than 4 cores cannot be used for parallelization across the nodes, although 4 gives the best speedup. Indeed, the best combination for this problem is to maximize nodeparallelization first and then add stepparallelization (31.3 seconds with 4 cores on the nodes and 6 on the steps, lower picture). This is about 1.8 times faster than using 24 cores for the steps alone and more than 5 times faster than the serial PFASSTER run.
Although using \(\mathbf {Q}\) instead of \(\mathbf {Q}_\varDelta \) in PFASSTER is faster for this example, it is quite revealing to repeat the simulations using \(\mathbf {Q}_\varDelta \). These results are shown in Fig. 5 and it is obvious that using as many cores as possible for the parallelization across the nodes now is not the optimal strategy. Here, using 2 cores on the nodes and 12 on the steps is the most efficient combination, albeit still significantly slower than using PFASSTER with \(\mathbf {Q}\), even with the same combination. The reason for this potentially surprising result is that solving the innermost linear systems heavily depends on the structure of these systems, in particular when using an iterative solver like GMRES. Moreover, initial guesses are a crucial factor, too. For PFASSTER, we use the current solution at node zero of the respective timestep as the initial guess. This is particularly suitable for the closest first nodes, but potentially less so for later ones. While both effects did not lead to significant variations in the time spent for solving the linear systems when using \(\mathbf {Q}\), it does produce a severe load imbalance when using \(\mathbf {Q}_\varDelta \). More specifically, using 4 cores for the nodes and only 1 for the timesteps, i.e. exploiting only parallelization across the nodes, the first core takes about 118.2 seconds for all linear system solves together at the first node, while the last core takes about 194.6 seconds on the last node. Therefore, using 2 cores on the nodes, which enables a better load distribution is the ideal choice. One possibility would be that core 1 deals with nodes 1 and 4 and core 2 with 2 and 3, but because node 3 and 4 are very close to each other and the corresponding calculations are almost equally expensive also an alternating distribution is an ideal choice. This is precisely what has been done for Fig. 5, leading to the best speedup with 2 cores on the nodes. For other examples, an optimal distribution might be more difficult to find.
In Fig. 6 we now summarize the best results: PFASST with one inner Newton iteration in comparison to PFASSTER using \(\mathbf {Q}_\varDelta \) and 2 cores on the nodes and PFASSTER using \(\mathbf {Q}\) with 4 cores on the node. The plot shows the simulation time for each variant based on the number of processors used in total. We see that PFASSTER is always much more time efficient in doing the calculations than PFASST, with another significant gain when using \(\mathbf {Q}\) instead of \(\mathbf {Q}_\varDelta \). Now, since PFASSTER adds another direction of parallelization compared to PFASST, we can not only increase parallel efficiency as shown, but also extend the number of usable cores to obtain a better timetosolution. This has been done in Fig. 7: taking 48 or 96 cores in total further reduces the computing time for 24 timesteps. With PFASSTER, the number of resources that can be used for parallelintime integration is no more limited by the number of timesteps, but can be increased by the factor given by the number of quadrature nodes.
4.2 GrayScott equations
The second example we present here is the GrayScott system [16], which is given by
on the spatial domain \([0,1]\times [0,1]\), with periodic boundary conditions. As initial condition we choose a circle with radius 0.05 centred in the spatial domain, where \(u=0.5\) and \(v=0.25\) on the inside, and \(u=1.0\) and \(v=0\) outside of this circle. We use \(D_u=10^{4}\), \(D_v=10^{5}\) and set a feed rate of \(F=0.0367\) and a kill rate of \(K=0.0649\). This leads after some time to a process similar to cellular division and is known as “mitosis”. We discretize the spatial domain with 128 points in each dimension on the fine level and with 64 on the coarse one, using standard finite differences. We discretize every timestep of size \(\varDelta t =1\) with 4 quadrature nodes and run the simulation again for 24 timesteps.
The results are similar to the ones for the Allen–Cahn equation in the previous section. We will omit the case of PFASST with more than one inner Newton iteration, though.
We start again by looking at the total number of linear solves the different algorithms need to perform. Figure 8 shows the number of linear solves for the methods, which run until a residual tolerance of \(10^{12}\) is reached. The results look quite similar to the ones for the previous example, with one critical difference: The difference between the \(\mathbf {Q}\)variant of PFASSTER and the other algorithms becomes smaller more rapidly the more parallel timesteps are used. There is no obvious explanation (at least, obvious to us) for this behavior, though. The more time steps are approximated simultaneously, the less suitable \(u_0\) works as initial value for more distant timesteps. Although the full Newton and the QuasiNewton methods differ by an order of convergence in theory, in our scenario this seems relevant only for good initial values. One can expect that the runtime will increase when using PFASSTER with \(\mathbf {Q}\), while it stayed about the same in the case of the Allen–Cahn example.
This is precisely what we can see in Fig. 9. The more parallel timesteps are run, the less efficient PFASSTER with \(\mathbf {Q}\) in this variant becomes. Already at 3 parallel steps, it is as costly as the original PFASST version, at least when parallelization across the nodes is not considered.
Now, adding nodeparallelization, the findings are again similar to the ones in the previous section: Figure 10 shows that PFASSTER with \(\mathbf {Q}\) is still more efficient than using PFASST. In particular, using more cores on the nodes is better and the best combination is again 4 cores on the nodes and 6 on the steps. Again, this changes when considering PFASSTER with \(\mathbf {Q}_\varDelta \) as in Fig. 11, where the ideal setup uses only 2 cores on the nodes, but 12 on the steps. This is again due to load imbalances of the innermost linear solves. However, note the key difference to the previous results: The fastest run of the \(\mathbf {Q}_\varDelta \)variant is now faster than the one of the \(\mathbf {Q}\)variant.
In Fig. 12 we now give an overview of the best results: If we use parallelism across the nodes in a suitable way, both PFASSTER versions are more efficient based on the simulation time than the classical PFASST algorithm. Both can be used to extend the scaling capabilities beyond the number of timesteps, and both scale rather well in this regime. Note, however, that the \(\mathbf {Q}_\varDelta \)variant can here only leverage \(2\times 24\) cores. It is then faster than the \(\mathbf {Q}\)variant with twice as many cores.
5 Conclusion and outlook
Today’s supercomputers are designed with an ever increasing number of processors. Therefore we need our software and the underlying numerical algorithms to handle this increasing degree of parallelism. Timeparallel integrators are one promising research direction, with quite a number of different approaches. Some approaches parallelize each individual timestep and others act on multiple timesteps simultaneously. In this paper we have introduced a solver that works in parallel across the method as well as across the steps. More precisely, we combine nodeparallel spectral deferred corrections with the parallel full approximation scheme in space and time. While PFASST allows one to compute multiple timesteps simultaneously and target largescale parallelism in time, the new version called PFASSTER presented here extends this idea with an efficient smallscale parallelization for every single timestep itself. The scaling studies show that a combination of both concepts seems to be the most efficient way to solve timedependent PDEs. Here we tested two different preconditioners: ones using the traditional, triangular quadrature matrix \(\mathbf {Q}_\varDelta \), generated by a LUdecomposition and one using the original matrix \(\mathbf {Q}\). Both can be diagonalized and used as parallelacrossthenode preconditioners. For the \(\mathbf {Q}_\varDelta \)preconditioner, we saw load imbalances when using an inner iterative linear solver, but by grouping nodes we still can speed up the simulation beyond the number of parallel timesteps. For the \(\mathbf {Q}\)preconditioner, the overall number of iterations was lower and timetosolution was faster. Adding nodeparallelization, parallel efficiency can be increased and speedup extended when compared to PFASST. Both PFASSTER versions lead in the end to better scaling results than the classical PFASST algorithm. PFASSTER Q especially offers an almost equal distribution of work for iterative linear solvers with respect to the individual quadrature nodes of a timestep. This advantage makes this algorithm particularly flexible and can be used for any number of quadrature points.
PFASSTER is particularly favorable if an increase in parallelism across the steps would lead to a severe increase in the number of iterations. This could be due to e.g. the type of the equation or the coarsening strategy. During our experiments we saw that it is not clear a priori which combination of node and stepparallelization is the most efficient one. This could lead to many, potentially irrelevant runs to find the sweet spot. Here, a performance model and a suitable convergence theory are needed to at least narrow down the relevant options. This has to be accompanied by more numerical tests, relating e.g. model parameters with load imbalances, to identify the limits of this approach.
References
Bolten, M., Moser, D., Speck, R.: A multigrid perspective on the parallel full approximation scheme in space and time. Numer. Linear Algebra Appl. 24(6), e2110 (2017). https://doi.org/10.1002/nla.2110
Bolten, M., Moser, D., Speck, R.: Asymptotic convergence of the parallel full approximation scheme in space and time for linear problems. Numer. Linear Algebra Appl. 25(6), e2208 (2018). https://doi.org/10.1002/nla.2208
Burrage, K.: Parallel methods for ODEs. Adv. Comput. Math. 7, 1–3 (1997)
Christlieb, A.J., Macdonald, C.B., Ong, B.W.: Parallel highorder integrators. SIAM J. Sci. Comput. 32(2), 818–835 (2010)
Clarke, A.T., Davies, C.J., Ruprecht, D., Tobias, S.M.: Parallelintime integration of kinematic dynamos (2019). arXiv:1902.00387 [physics.compph]
Emmett, M., Minion, M.L.: Toward an efficient parallel in time method for partial differential equations. Commun. Appl. Math. Comput. Sci. 7, 105–132 (2012)
Falgout, R.D., Friedhoff, S., Kolev, T.V., MacLachlan, S.P., Schroder, J.B., Vandewalle, S.: Multigrid methods with space–time concurrency. Comput. Vis. Sci. 18(4–5), 123–143 (2017)
Gander, M.J.: 50 years of time parallel time integration. In: Multiple Shooting and Time Domain Decomposition. Springer (2015). https://doi.org/10.1007/9783319233215_3
Gander, M.J., Halpern, L., Ryan, J., Tran, T.T.B.: A direct solver for time parallelization. In: Dickopf, T., Gander, M.J., Halpern, L., Krause, R., Pavarino, L.F. (Eds) Domain Decomposition Methods in Science and Engineering XXII, pp. 491–499. Springer (2016). https://doi.org/10.1007/9783319188270_50
Huang, J., Jia, J., Minion, M.: Accelerating the convergence of spectral deferred correction methods. J. Comput. Phys. 214(2), 633–656 (2006)
Jackson, K .R., Kværnø, A., Nørsett, S .P.: The use of butcher series in the analysis of newtonlike iterations in Runge–Kutta formulas. Appl. Numer. Math. 15(3), 341–356 (1994)
Jülich Supercomputing Centre. JURECA: Generalpurpose supercomputer at Jülich Supercomputing Centre. J. LargeScale Res. Facil. 2(A62) (2016). https://doi.org/10.17815/jlsrf2121
Kelley, C.T.: Iterative Methods for Linear and Nonlinear Equations. Number 16 in Frontiers in Applied Mathematics. SIAM (1995)
Koehler, F.: PFASST TikZ. https://github.com/ParallelinTime/pfassttikz (2015)
Lions, J.L., Maday, Y., Turinici, G.: A “parareal” in time discretization of PDE’s. Comptes Rendus de l’Académie des Sciences—Series I: Mathematics. https://doi.org/10.1016/S07644442(00)017936
Pearson, J.E.: Complex patterns in a simple system. Science 261(5118), 189–192 (1993)
Ruprecht, D., Speck, R.: Spectral deferred corrections with fastwave slowwave splitting. SIAM J. Sci. Comput. 38(4), A2535–A2557 (2016)
Ruprecht, D., Speck, R., Emmett, M., Bolten, M., Krause, R.: Poster: extremescale spacetime parallelism. In: Proceedings of the 2013 Conference on High Performance Computing Networking, Storage and Analysis Companion, SC ’13 Companion (2013). http://sc13.supercomputing.org/sites/default/files/PostersArchive/tech_posters/post148s2file3.pdf
Speck, R.: Parallelizing spectral deferred corrections across the method. Comput. Vis. Sci. 19(3–4), 75–83 (2018). https://doi.org/10.1007/s007910180298x
Speck, R.: Algorithm 997: pySDCprototyping spectral deferred corrections. ACM Trans. Math. Softw. (2019). https://doi.org/10.1145/3310410
Speck, R.: Website for pySDC (2019). https://parallelintime.org/pySDC/. Accessed 27 November 2019
Speck, R., Ruprecht, D., Krause, R., Emmett, M., Minion, M., Winkel, M., Gibbon, P.: A massively space–time parallel Nbody solver. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC ’12, pp. 92:1–92:11. IEEE Computer Society Press, Los Alamitos, CA, USA (2012). ISBN 9781467308045. http://dl.acm.org/citation.cfm?id=2388996.2389121. eventplace: Salt Lake City, Utah
Speck, R., Ruprecht, D., Emmett, M., Minion, M.L., Bolten, M., Krause, R.: A multilevel spectral deferred correction method. BIT Numer. Math. 55, 843–867 (2015). https://doi.org/10.1007/s105430140517x
Tang, T., Xie, H., Yin, X.: Highorder convergence of spectral deferred correction methods on general quadrature nodes. J. Sci. Comput. 56(1), 1–13 (2013)
Trottenberg, U., Oosterlee, C., Schuller, A.: Multigrid. Academic Press, London (2000)
Wanner, G., Hairer, E.: Solving Ordinary Differential Equations II. Springer, Berlin (1996)
Weiser, M.: Faster SDC convergence on nonequidistant grids by DIRK sweeps. BIT Numer. Math. 55(4), 1219–1241 (2014)
Zhang, J., Du, Q.: Numerical studies of discrete approximations to the Allen–Cahn equation in the sharp interface limit. SIAM J. Sci. Comput. 31(4), 3042–3063 (2009). https://doi.org/10.1137/080738398.
Acknowledgements
The authors thankfully acknowledge the financial support by the German Federal Ministry of Education and Research through the ParaPhase project within the framework “IKT 2020  Forschung für Innovationen” (Project Number 01IH15005A).
Funding
Open Access funding enabled and organized by Projekt DEAL.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Sebastian Schöps.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Schöbel, R., Speck, R. PFASSTER: combining the parallel full approximation scheme in space and time with parallelization across the method. Comput. Visual Sci. 23, 12 (2020). https://doi.org/10.1007/s00791020003305
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00791020003305