1 Introduction

The efficient use of modern high performance computing systems for solving space–time-dependent differential equations has become one of the key challenges in computational science. Exploiting the exponentially growing number of processors using traditional techniques for spatial parallelism becomes problematic when, for example, for a fixed problem size communication costs start to dominate. Parallel-in-time integration methods have recently been shown to provide a promising way to extend these scaling limits, see e.g. [7, 18, 22] to name but a few examples.

As one example, the “Parallel Full Approximation Scheme in Space and Time” (PFASST) by Emmett and Minion [6] allows one to integrate multiple time-steps simultaneously by using inner iterations of spectral deferred corrections (SDC) on a space–time hierarchy. It works on the so called composite collocation problem, where each time-step includes a further discretization through quadrature nodes. This “parallelization across the steps” approach [3] targets large-scale parallelization on top of saturated spatial parallelization of partial differential equations (PDEs), where parallelization in the temporal domain acts as a multiplier for standard parallelization techniques in space. In contrast, “parallelization across the method” approaches [3] try to parallelize the integration within an individual time-step. While this typically results in smaller-scale parallelization in the time-domain, parallel efficiency and applicability of these methods are often more favorable. Most notably, the “revisionist integral deferred correction method” (RIDC) by Christlieb et al. [4] makes use of integral deferred corrections (which are indeed closely related to SDC) in order to compute multiple iterations in a pipelined way. In [19], different approaches for parallelizing SDC across the method have been discussed, allowing the simultaneous computation of updates on multiple quadrature nodes. A much more structured and complete overview of parallel-in-time integration approaches can be found in [8]. The Parallel-in-Time community website (https://parallel-in-time.org) offers a comprehensive list of references.

The key goal of parallel-in-time integrators is to expose additional parallelism in the temporal domain in the cases where classical strategies like parallelism in space are either already saturated or not even possible. In [5] the classical Parareal method [15] is used to overcome the scaling limit of a space-parallel simulation of a kinematic dynamo on up to 1600 cores. The multigrid extension of Parareal, the “multigrid reduction in time” method (MGRIT), has been shown to provide significant speedup beyond spatial parallelization [7] for a multitude of problems. Using PFASST, a space-parallel N-body solver has been extended in [22] to run on up to 262,244 cores, while in [18] it has been coupled to a space-parallel multigrid solver on up to 458,752 cores.

So far, parallel-in-time methods have been implemented and tested either without any additional parallelization techniques or in combination with spatial parallelism. The goal for this work is to couple two different parallel-in-time strategies in order to extend the overall temporal parallelism exposed by the resulting integrator. To this end, we take the diagonalization idea for SDC presented in [19] (parallel across the method) and use it within PFASST (parallel across the steps). In this way we create an algorithm that computes approximations for different time-steps simultaneously but also works in parallel on each time-step itself. Doing so we combine the advantages of both parallelization techniques and create the “Parallel Full Approximation Scheme in Space and Time with Enhanced concuRrency” (PFASST-ER), an unprecedented doubly time-parallel integrator for PDEs. In the next section we will first introduce SDC and PFASST from an algebraic point of view, following [1, 2]. We particularly focus on nonlinear problems and briefly explain the application of a Newton solver within PFASST. Then, this Newton solver is modified in Sect. 3 so that by using a diagonalization approach the resulting Quasi-Newton method can be computed in parallel across the quadrature nodes of each time-step. In Sect. 4, we compare different variants of this idea to the classical PFASST implementation using two nonlinear reaction-diffusion test cases. We show parallel runtimes for different setups and evaluate the impact of the various Newton and diagonalization strategies. Section 5 concludes this work with a short summary and an outlook.

2 Parallelization across the steps with PFASST

We focus on an initial value problem

$$\begin{aligned} u_t = f(u),\quad u(0) = u_0 \end{aligned}$$
(1)

with \(u(t), u_0, f(u) \in \mathbb {R}\). In order to keep the notation simple, we do not consider systems of initial value problems for now, where \(u(t) \in \mathbb {R}^N\). Necessary modifications will be mentioned where needed. In a first step, we now discretize this problem in time and review the idea of single-step, time-serial spectral deferred corrections (SDC).

2.1 Spectral deferred corrections

For one time-step on the interval \([t_l,t_{l+1}]\) the Picard formulation of Eq. (1) is given by

$$\begin{aligned} u(t) = u_{l,0} + \int _{t_0}^t f(u(s))ds,\ t\in [t_l,t_{l+1}]. \end{aligned}$$
(2)

To approximate the integral we use a spectral quadrature rule. We define M quadrature nodes \(\tau _{l,1},\ldots ,\tau _{l,M}\), which are given by \(t_l \le \tau _{l,1}< \cdots < \tau _{l,M} = t_{l+1}\). We will in the following explicitly exploit the condition that the last node is equal to the right integral boundary. Quadrature rules like Gauß-Radau or Gauß-Lobatto quadrature satisfy this property. We can then approximate the integrals from \(t_l\) to the nodes \(\tau _{l,m}\), such that

$$\begin{aligned} u_{l,m} = u_{l,0} + \varDelta t \sum _{j=1}^Mq_{m,j}f(u_{l,j}), \end{aligned}$$

where \(u_{l,m} \approx u(\tau _{l,m})\), \(\varDelta t= t_{l+1}-t_l\) and \(q_{m,j}\) represent the quadrature weights for the interval \([t_l,\tau _{l,m}]\) such that

$$\begin{aligned} \sum _{j=1}^Mq_{m,j}f(u_{l,j})\approx \int _{t_l}^{\tau _{l,m}}f(u(s))ds. \end{aligned}$$

We combine these M equations into one system

$$\begin{aligned} \left( \mathbf {I} - \varDelta t\mathbf {Q}\varvec{f} \right) (\varvec{u}_l) = \varvec{u}_{l,0}, \end{aligned}$$
(3)

which we call the “collocation problem”. Here, \(\varvec{u}_l = (u_{l,1}, \ldots , u_{l,M})^T \approx (u(\tau _{l,1}), \ldots , u(\tau _{l,M}))^T\in \mathbb {R}^M\), \(\varvec{u}_{l,0} = (u_{l,0}, \ldots , u_{l,0})^T\in \mathbb {R}^M\), \(\mathbf {Q}= (q_{ij})_{i,j}\in \mathbb {R}^{M\times M}\) is the matrix gathering the quadrature weights and the vector function \(\varvec{f}:\mathbb {R}^M \rightarrow \mathbb {R}^M\) is given by

$$\begin{aligned} \varvec{f}(\varvec{u}_l) = (f(u_{l,1}), \ldots , f(u_{l,M}))^T. \end{aligned}$$

To simplify the notation we define

$$\begin{aligned} \mathbf {C}^{{\text {coll}}}_{\varvec{f}}(\varvec{u}_l) := \left( \mathbf {I} - \varDelta t\mathbf {Q}\varvec{f} \right) (\varvec{u}_l) . \end{aligned}$$

We note that for \(u(t) \in \mathbb {R}^N\), we need to replace \(\mathbf {Q}\) by \(\mathbf {Q}\otimes \mathbf {I}_N\), where \(\otimes \) denotes the Kronecker product.

System (3) is dense and a direct solution is not advisable, in particular if \(\varvec{f}\) is a nonlinear operator. The spectral deferred correction method solves the collocation problem in an iterative way. While it has been derived originally from classical deferred or defect correction strategies, we here follow [10, 17, 27] to present SDC as preconditioned Picard iteration. A standard Picard iteration is given by

$$\begin{aligned} \varvec{u}^{k+1}_l = \varvec{u}^{k}_l + (\varvec{u}_{l,0} - \mathbf {C}^{{\text {coll}}}_{\varvec{f}} (\varvec{u}^k_l)) \end{aligned}$$

for \(k = 0, \ldots , K\), and some initial guess \(\varvec{u}^{0}_l\).

In order to increase range and speed of convergence, we now precondition this iteration. The standard approach to preconditioning is to define an operator \(\mathbf {P}^{{\text {sdc}}}_{\varvec{f}}\), which is easy to invert but also close to the operator of the system. We define this “SDC preconditioner” as

$$\begin{aligned} \mathbf {P}^{{\text {sdc}}}_{\varvec{f}}(\varvec{u}_l) := \left( \mathbf {I} - \varDelta t\mathbf {Q}_\varDelta \varvec{f} \right) (\varvec{u}_l) \end{aligned}$$

so that the preconditioned Picard iteration reads

$$\begin{aligned} \mathbf {P}^{{\text {sdc}}}_{\varvec{f}}(\varvec{u}_l^{k+1}) = (\mathbf {P}^{{\text {sdc}}}_{\varvec{f}} - \mathbf {C}^{{\text {coll}}}_{\varvec{f}})(\varvec{u}_l^k) + \varvec{u}_{l,0} . \end{aligned}$$
(4)

The key for defining \(\mathbf {P}^{{\text {sdc}}}_{\varvec{f}}\) is the choice of the matrix \(\mathbf {Q}_\varDelta \). The idea is to choose a “simpler” quadrature rule to generate a triangular matrix \(\mathbf {Q}_\varDelta \) such that solving System (4) can be done by forward substitution. Common choices include the implicit Euler method or the so-called “LU-trick”, where the LU decomposition of \(\mathbf {Q}^T\) with

$$\begin{aligned} \mathbf {Q}_\varDelta ^{\mathrm {LU}} = \mathbf {U}^{\mathbf {T}}\quad \text {for}\quad \mathbf {Q}^T = \mathbf {L}\mathbf {U} \end{aligned}$$
(5)

is used [27].

System (4) establishes the method of spectral deferred corrections, which can be used to approximate the solution of the collocation problem on a single time-step. In the next step, we will couple multiple collocation problems and use SDC to explain the idea of the parallel full approximation scheme in space and time.

2.2 Parallel full approximation scheme in space and time

The idea of PFASST is to solve a “composite collocation problem” for multiple time-steps at once using multigrid techniques and SDC for each step in parallel. This composite collocation problem for L time-steps can be written as

$$\begin{aligned} \begin{pmatrix} \mathbf {C}^{{\text {coll}}}_{\varvec{f}} \\ -\mathbf {H} &{} \mathbf {C}^{{\text {coll}}}_{\varvec{f}} \\ &{} \ddots &{} \ddots \\ &{} &{} -\mathbf {H} &{} \mathbf {C}^{{\text {coll}}}_{\varvec{f}} \end{pmatrix} \begin{pmatrix} \varvec{u}_{1}\\ \varvec{u}_{2}\\ \vdots \\ \varvec{u}_{L} \end{pmatrix} =\begin{pmatrix} \varvec{u}_{0,0} \\ \varvec{0}\\ \vdots \\ \varvec{0} \end{pmatrix}, \end{aligned}$$

where the matrix \(\mathbf {H}\in \mathbb {R}^{M\times M}\) on the lower subdiagonal transfers the information from one time-step to the next one. It takes the value of the last node \(\tau _{l,M}\) of an interval \([t_l, t_{l+1}]\), which is by requirement equal to the left boundary \(t_{l+1}\) of the following interval \([t_{l+1}, t_{l+2}]\), and provides it as a new starting value for this interval. Therefore, the matrix \(\mathbf {H}\) contains the value 1 on every position in the last column and zeros elsewhere. To write the composite collocation problem in a more compact form we define the vector \(\varvec{u} = (\varvec{u}_{1}, \ldots , \varvec{u}_{L})^T\in \mathbb {R}^{LM}\), which contains the solution at all quadrature nodes at all time-steps, and the vector \(\varvec{b} = (\varvec{u}_{0,0}, \varvec{0}, \ldots , \varvec{0})^T\in \mathbb {R}^{LM}\), which contains the initial condition for all nodes at the first interval and zeros elsewhere. We define \({\varvec{F}: \mathbb {R}^{LM} \rightarrow \mathbb {R}^{LM}}\) as an extension of \(\varvec{f}\) so that \({\varvec{F}} ({\varvec{u}}) = \left( {\varvec{f}} ({\varvec{u}}_{1}), \ldots , {\varvec{f}} ({\varvec{u}}_{L}) \right) ^T\). Then, the composite collocation problem can be written as

$$\begin{aligned} \mathbf {C}_{\varvec{F}}(\varvec{u}) = \varvec{b}. \end{aligned}$$
(6)

with

$$\begin{aligned} \mathbf {C}_{\varvec{F}}(\varvec{u}) = \left( \mathbf {I} - \varDelta t(\mathbf {I}_L\otimes \mathbf {Q} )\varvec{F} - \mathbf {E}\otimes \mathbf {H}\right) (\varvec{u}), \end{aligned}$$

where the matrix \(\mathbf {E}\in \mathbb {R}^{L\times L}\) just has ones on the first subdiagonal and zeros elsewhere. If \(u \in \mathbb {R}^N\), we need to replace \(\mathbf {H}\) by \(\mathbf {H}\otimes \mathbf {I}_N\).

SDC can be used to solve the composite collocation problem by forward substitution in a sequential way, which means to solve one time-step after each other using the previous solution as initial value of the current time-step. The parallel-in-time integrator PFASST, on the other hand solves the composite collocation problem by calculating on all time-steps simultaneously and is therefore an attractive alternative. The first step from SDC towards PFASST is the introduction of multiple levels, which are representations of the problem with different accuracies in space and time. In order to simplify the notation we focus on a two-level scheme consisting of a fine and a coarse level. Coarsening can be achieved for example by reducing the resolution in space, by decreasing the number of quadrature nodes on each interval or by solving implicit systems less accurately. Especially a coarsening through the reduction of quadrature points does not seem to be worthwhile for our idea to parallize the belonging calculations, since there would no longer be a full employment regarding the calculations on the coarse grid, but instead individual processors would have to communicate larger amounts of data. For this work, we only consider coarsening in space, i.e., by using a restriction operator R on a vector \(u\in \mathbb {R}^{N}\) we obtain a new vector \(\tilde{u}\in \mathbb {R}^{\tilde{N}}\). Vice versa, the interpolation operator T is used to interpolate values from \(\tilde{u}\) to u. Operators, vectors and numbers on the coarse level will be denoted by a tilde to avoid further index cluttering. Thus, the composite collocation operator on the coarse-level is given by \( \tilde{\mathbf {C}}_{\varvec{F}}\). While \(\mathbf {C}_{\varvec{F}}\) is defined on \(\mathbb {R}^{LMN}\), \(\tilde{\mathbf {C}}_{\varvec{F}}\) acts on \(\mathbb {R}^{L M \tilde{N}}\) with \(\tilde{N} \le N\), but as before we will neglect the space dimension in the following notation. The extension of the spatial transfer operators to the full space–time domain is given by \(\mathbf {R} = \mathbf {I}_{LM} \otimes R\) and \(\mathbf {T} = \mathbf {I}_{LM} \otimes T\).

The main goal of the introduction of a coarse level is to move the serial part of the computation to this hopefully cheaper level, while being able to run the expensive part in parallel. For that, we define two preconditioners: a serial one with a lower subdiagonal for the coarse level and a parallel, block-diagonal one for the fine level. The serial preconditionier for the coarse level is defined by

$$\begin{aligned} \tilde{\mathbf {P}}_{\varvec{F}} = \begin{pmatrix} \tilde{\mathbf {P}}_{\varvec{f}}^{{\text {sdc}}} \\ -\tilde{\mathbf {H}} &{} \tilde{\mathbf {P}}_{\varvec{f}}^{{\text {sdc}}} \\ &{} \ddots &{} \ddots \\ &{} &{} -\tilde{\mathbf {H}} &{} \tilde{\mathbf {P}}_{\varvec{f}}^{{\text {sdc}}} \\ \end{pmatrix}, \end{aligned}$$

or, in a more compact way, by

$$\begin{aligned}&\tilde{\mathbf {P}}_{\varvec{F}}(\tilde{\varvec{u}}) = \left( \tilde{\mathbf {I}} - \varDelta t(\mathbf {I}_L \otimes \tilde{\mathbf {Q}}_\varDelta )\tilde{\mathbf {F}} - \mathbf {E}\otimes \tilde{\mathbf {H}} \right) (\tilde{\varvec{u}}). \end{aligned}$$

Inverting this corresponds to a single inner iteration of SDC (a “sweep”) on step 1, then sending forward the result to step 2, an SDC sweep there and so on. The parallel preconditioner on the fine level then simply reads

$$\begin{aligned}&\mathbf {P}_{\varvec{F}}(\varvec{u}) = (\mathbf {I} - \varDelta t(\mathbf {I}_L\otimes \mathbf {Q}_\varDelta ) \varvec{F}) (\varvec{u}). \end{aligned}$$

Applying \(\mathbf {P}_{\varvec{F}}\) on the fine level leads to L decoupled SDC sweeps, which can be run in parallel.

For PFASST, these two preconditioners and the levels they work on are coupled using a full approximation scheme (FAS) known from nonlinear multigrid theory [25]. Following [1] one iteration of PFASST can then be formulated in four steps:

  1. 1.

    the computation of the FAS correction \({\tau }^k\), including the restriction of the fine value to the coarse level

    $$\begin{aligned} {\tau }^k =\tilde{\mathbf {C}}_{\varvec{F}} (\mathbf {R} {\varvec{u}}^k) - \mathbf {R} \mathbf {C}_{\varvec{F}} ( {\varvec{u}}^k) , \end{aligned}$$
  2. 2.

    the coarse sweep on the modified composite collocation problem on the coarse level

    $$\begin{aligned} \tilde{\mathbf {P}}_{\varvec{F}} (\tilde{\varvec{u}}^{k+1} )&= (\tilde{\mathbf {P}}_{\varvec{F}} - \tilde{\mathbf {C}}_{\varvec{F}})({\tilde{\varvec{u}}}^{k}) + \tilde{\varvec{b}} + \tau ^k, \end{aligned}$$
    (7)
  3. 3.

    the coarse grid correction applied to the fine level value

    $$\begin{aligned} \varvec{u}^{k+\frac{1}{2}}&= \varvec{u}^{k} + \mathbf {T} ( \tilde{\varvec{u}}^{k+1} -\mathbf {R} \varvec{u}^k ), \end{aligned}$$
    (8)
  4. 4.

    the fine sweep on the composite collocation problem on the fine level

    $$\begin{aligned} \mathbf {P}_{\varvec{F}} ( \varvec{u}^{k+1} )&= (\mathbf {P}_{\varvec{F}} - \mathbf {C}_{\varvec{F}})( \varvec{u}^{k+\frac{1}{2}} ) + \varvec{b} . \end{aligned}$$
    (9)
Fig. 1
figure 1

Schematic view of PFASST on four processors. The figure was created with pfasst-tikz [14]

In Fig. 1, we see a schematic representation of the described steps. The time-step parallel procedure, which we describe here is also the same for all PFASST versions, that we will introduce later. It is common to use as many processors as time-steps: In the given illustration four processors work on four time-steps. Therefore the temporal domain is divided into four intervals, which are assigned to four processors \(P_0, \ldots , P_3\). Every processor performs SDC sweeps on its assigned interval on alternating levels. The big red blocks represent fine sweeps, given by Eq. (9), and the small blue blocks coarse sweeps, given by Eq. (7).

The coarse sweep over all intervals is a serial process: after a processor finished its coarse sweeps, it sends forward its results to the next processor, which takes this result as an initial value for its own coarse sweeps. We see the communication in the picture represented by small arrows, which connect the coarse sweeps of each interval. In (7), the need for communication with a neighboring process is obvious, because \(\tilde{\mathbf {P}}_{\varvec{F}}\) is not a (block-) diagonal matrix, but has entries on its lower block-diagonal. \(\mathbf {P}_{\varvec{F}}\) on the other hand is block-diagonal, which means that the processors can calculate on the fine level in parallel. We see in (9) that there is only a connection to previous time-steps through the right-hand side, where we gather values from the previous time-step and iteration but not from the current iteration. The picture shows this connection by a fine communication, which forwards data from each fine sweep to the following fine sweep of the right neighbor. The fine and coarse calculations on every processor are connected through the FAS corrections, which in our formula are part of the coarse sweep.

2.3 PFASST-Newton

For each coarse and each fine sweep within each PFASST iteration, System (7) and System (9), respectively, need to be solved. If f is a nonlinear function these systems are nonlinear as well. The obvious and traditional way to proceed in this case is to linearize the problem locally (i.e. for each time-step, at each quadrature node) using Newton’s method. This way, PFASST is the outer solver with an inner Newton iteration. For triangular \(\mathbf {Q}_\varDelta \), the mth equation on the lth time-step on the coarse level reads

$$\begin{aligned} (1 - \varDelta t\ \tilde{q}^\varDelta _{l,m} \tilde{f}) (\tilde{u}^{k+1}_{l,m}) =&\ \tilde{u}^{k+1}_{l, 0} + \varDelta t\sum _{n=1}^{m-1} \tilde{q}^\varDelta _{l,n} \tilde{f}(\tilde{u}^{k+1}_{l,n}) \\&+ \tilde{\varvec{c}}( \tilde{\varvec{u}}^k)_{l,m}, \end{aligned}$$

where \(\tilde{u}^{k+1}_{0,0} = \tilde{u}_{0,0}\) and \(\tilde{\varvec{c}}(\tilde{\varvec{u}}^k)_{l,m}\) is the mth entry the lth block of \(\tilde{\varvec{c}}(\tilde{\varvec{u}}^k) := (\tilde{\mathbf {P}}_{\varvec{F}} - \tilde{\mathbf {C}}_{\varvec{F}})({\tilde{\varvec{u}}}^{k}) + \tau ^k.\) This term gathers all values of the previous iteration. The first summand of the right-hand side of the coarse level equation corresponds to \(\tilde{\varvec{b}}\) and \(\tilde{\mathbf {H}}\), while the following sum comes from the lower triangular structure of \(\tilde{\mathbf {Q}}_\varDelta \).

For time-step l these equations can be solved one by one using Newton iterations and forward substitution. This is inherently serial, because the solution on the mth quadrature node depends on the solution at all previous nodes through the sum. Thus, while running parallel across the steps, each solution of the local collocation problem is found in serial. In the next section, we will present a novel way of applying Newton’s method, which allows one to parallelize this part across the collocation nodes, joining parallelization across the step with parallelization across the method.

3 PFASST-ER

From the perspective of a single time-step \([t_l, t_{l+1}]\) or processor \(P_l\), Eq. (7) on the coarse level for this step reads

$$\begin{aligned} \tilde{\mathbf {P}}_{\varvec{f}}^{{\text {sdc}}}(\tilde{\varvec{u}}_{l}^{k+1}) - \tilde{\varvec{u}}_{l,0}^{k+1} =&\ (\tilde{\mathbf {P}}_{\varvec{f}}^{{\text {sdc}}} - \tilde{\mathbf {C}}_{\varvec{f}}^{{\text {coll}}})(\tilde{\varvec{u}}_{l}^{k}) + \tau ^k_{l}, \end{aligned}$$

where \(\tau ^k_{l}\) is the lth component of \(\tau ^k\), belonging to the interval \([t_l, t_{l+1}]\). Note that the serial dependency is given by the term \(\tilde{\varvec{u}}_{l,0}^{k+1}\), so that it does not depend on the solution \(\tilde{\varvec{u}}_{l}^{k+1}\) of this equation and can thus be considered as part of a given right-hand side. On the fine level, this is even simpler, because there we have to solve

$$\begin{aligned} \mathbf {P}_{\varvec{f}}^{{\text {sdc}}}({\varvec{u}}_{l}^{k+1}) = (\mathbf {P}_{\varvec{f}}^{{\text {sdc}}} - \mathbf {C}_{\varvec{f}}^{{\text {coll}}})(\varvec{u}_{l}^{k+\frac{1}{2}}) + \varvec{u}_{l,0}^{k+\frac{1}{2}} , \end{aligned}$$

where the \(\varvec{u}_{l,0}^{k+\frac{1}{2}}\)-term is independent of the current iteration (which, of course, leads to the parallelism on the fine level).

As we have seen above, the typical strategy would be to solve these systems line by line, node by node, using forward substitution and previous PFASST iterates as initial guesses. An alternative approach has been presented in [19], where each SDC iteration can be parallelized across the nodes. While this is trivial for linear problems, nonlinear ones require the linearization of the full equations, not node-wise as before. For the fine sweep, let

$$\begin{aligned} \mathbf {G}_{\varvec{f}}^{{\text {sdc}}}(\varvec{v}) :=&\ \mathbf {P}_{\varvec{f}}^{{\text {sdc}}}(\varvec{v}) - (\mathbf {P}_{\varvec{f}}^{{\text {sdc}}} - \mathbf {C}_{\varvec{f}}^{{\text {coll}}})(\varvec{u}_{l}^{k+\frac{1}{2}}) -\varvec{u}_{l,0}^{k+\frac{1}{2}} \end{aligned}$$

then a Newton step for \(\mathbf {G}_{\varvec{f}}^{{\text {sdc}}}(\varvec{v}) = 0\) is given by

$$\begin{aligned} \nabla \mathbf {G}_{\varvec{f}}^{{\text {sdc}}}(\varvec{v}^j)\varvec{e}^{j}&= -\mathbf {G}_{\varvec{f}}^{{\text {sdc}}}(\varvec{v}^j),\\ \varvec{v}^{j+1}&= \varvec{v}^j + \varvec{e}^{j}, \end{aligned}$$

for Jacobian matrix \(\nabla \mathbf {G}_{\varvec{f}}^{{\text {sdc}}}(\varvec{v}^j)\) of \(\mathbf {G}_{\varvec{f}}^{{\text {sdc}}}\) evaluated at \(\varvec{v}^j\). We have

$$\begin{aligned} \nabla \mathbf {G}_{\varvec{f}}^{{\text {sdc}}}(\varvec{v}^j)&= \nabla \mathbf {P}_{\varvec{f}}^{{\text {sdc}}}(\varvec{v}^j) \\&=\mathbf {I} - \varDelta t\mathbf {Q}_\varDelta \nabla \varvec{f}(\varvec{v}^j) \end{aligned}$$

for Jacobian matrix \(\nabla \varvec{f}(\varvec{v}^j)\) of \(\varvec{f}\) evaluated at \(\varvec{v}^j\) which in turn is given by

$$\begin{aligned} \nabla \varvec{f}(\varvec{v}^j) = \mathrm {diag}(f'(v_1^j), \ldots , f'(v_M^j))^T. \end{aligned}$$

There is still no parallelism to exploit, but when we replace the full Jacobian matrix \(\nabla \varvec{f}(\varvec{v}^j)\) by the approximation \(f'(v_{l,0})\mathbf {I}_M\), which is the derivative of f at the initial value for the current time-step, we can use

$$\begin{aligned} \nabla \mathbf {G}_{\varvec{f}}^{{\text {sdc}}}(\varvec{v}^j) \approx \nabla \mathbf {G}_{\varvec{f}}^{\varDelta \text {-}\mathrm {QN}}(v_{l,0}) := \mathbf {I} - f'(v_{l,0})\varDelta t\mathbf {Q}_\varDelta \end{aligned}$$

to establish a Quasi-Newton iteration as

$$\begin{aligned} \nabla \mathbf {G}_{\varvec{f}}^{\varDelta \text {-}\mathrm {QN}}(v_{l,0})\varvec{e}^{j}&= -\mathbf {G}_{\varvec{f}}^{{\text {sdc}}}(\varvec{v}^j),\\ \varvec{v}^{j+1}&= \varvec{v}^j + \varvec{e}^{j}. \end{aligned}$$

This decouples the evaluation of the Jacobian matrix from the current quadrature nodes and now \(\mathbf {Q}_\varDelta \) can be diagonalized, so that the inversion of \(\nabla \mathbf {G}_{\varvec{f}}^{\varDelta \text {-}\mathrm {QN}}(v_{l,0})\) can be parallelized across the nodes. Note that there are other options for approximating the full Jacobian matrix. Most notably, in [9] the mean over all Jacobian matrices is used (there across the time-steps). We did not see any impact on the convergence when following this strategy, most likely because the number of quadrature nodes is typically rather low. The advantage of using the initial value is that it reduces the number of evaluations of the Jacobian matrix, which also includes communication time.

Provided that \(\mathbf {Q}_\varDelta \) is diagonalizable, we can decompose it by \(\mathbf {Q}_\varDelta =\mathbf {V}_\varDelta \varvec{\Lambda }_\varDelta \mathbf {V}^{-1}_\varDelta \), where \(\varvec{\Lambda }_\varDelta =\mathrm {diag}((\mathbf {Q}_\varDelta )_{ii})\) contains the eigenvalues \((\mathbf {Q}_\varDelta )_{ii} \in \mathbb {R}\) of \(\mathbf {Q}_\varDelta \) and \(\mathbf {V}\) contains its eigenvectors.

Using the given diagonalization the algorithm reads:

  1. 1.

    replace \(\varvec{r}^j = -\mathbf {G}_{\varvec{f}}^{{\text {sdc}}}(\varvec{v}^{j})\) by \(\bar{\varvec{r}}^j = -\mathbf {V}_\varDelta ^{-1}\mathbf {G}_{\varvec{f}}^{{\text {sdc}}}(\varvec{v}^{j})\) (serial),

  2. 2.

    solve \(\left( \mathbf {I} - f'(v_{l,0})\varDelta t\varvec{\Lambda }_\varDelta \right) \bar{\varvec{e}}^{j} = \bar{\varvec{r}}^j\) (parallel in M),

  3. 3.

    replace \(\bar{\varvec{e}}^{j}\) by \(\varvec{e}^{j} = \mathbf {V}_\varDelta \bar{\varvec{e}}^{j}\) (serial),

  4. 4.

    set \(\varvec{v}^{j+1} = \varvec{v}^{j} + \varvec{e}^{j}\) (parallel in M).

This can be iterated until a certain threshold is reached and then set \(\varvec{u}^{k+1}_{l} = \varvec{v}^J\) to obtain the solution of the equation for the fine sweep. On the coarse level, the procedure is very similar, with a slightly different definition of \(\tilde{\mathbf {G}}_{\tilde{\varvec{f}}}^{{\text {sdc}}}(\tilde{\varvec{v}})\). In practice, choosing only a single Newton iteration (i.e. \(J=1\)) is sufficient, because this is only the inner solver for an outer PFASST iteration. In all cases we have studied so far, using more inner iterations does not lead to a faster overall method.

This linearization and diagonalization strategy immediately suggests a second approach: instead of using \(\mathbf {Q}_\varDelta \) for the preconditioner, we can use the original quadrature matrix \(\mathbf {Q}\) directly. The intention of using \(\mathbf {Q}_\varDelta \) in the first place was to obtain a preconditioner which allowed inversion using forward substitutions. Now, with diagonalization in place, this is no longer necessary. Instead, we can use

$$\begin{aligned} \mathbf {P}_{\varvec{f}}^{{\text {coll}}} := \mathbf {C}_{\varvec{f}}^{{\text {coll}}} \end{aligned}$$

and thus

$$\begin{aligned} \mathbf {G}_{\varvec{f}}^{{\text {coll}}}(\varvec{v}) := \mathbf {C}_{\varvec{f}}^{{\text {coll}}}(\varvec{v}) - \varvec{u}_{l,0}^{k+\frac{1}{2}}. \end{aligned}$$

Note that this is just the lth block of the original composite collocation problem. Following the same ideas as before, we end up with

$$\begin{aligned} \nabla \mathbf {G}_{\varvec{f}}^{{\text {coll}}}(\varvec{v}^j) \approx \nabla \mathbf {G}_{\varvec{f}}^{\mathrm {QN}}(v_{l,0}) := \mathbf {I} - f'(v_{l,0})\varDelta t\mathbf {Q}, \end{aligned}$$

which can be diagonalized using \(\mathbf {Q}= \mathbf {V}\varvec{\Lambda }\mathbf {V}^{-1}\), where \(\varvec{\Lambda }\) is a diagonal matrix with eigenvalues \(\lambda _i(\mathbf {Q}) \in \mathbb {C}\). The same idea can be applied to the coarse level sweep, of course. As a result, the original nonlinear SDC sweeps within PFASST are now replaced by Quasi-Newton iterations which can be done parallel across the nodes. We note that using simplified or Quasi-Newton methods for solving implicit Runge-Kutta schemes is a standard approach, as e.g. [26] shows. We further refer to [19] for more details on the idea of parallel SDC sweeps with \(\mathbf {Q}\) and \(\mathbf {Q}_\varDelta \).

The question now is, how much the approximation of the Jacobians affects the convergence and runtime of the method and how all this compares to standard PFASST iterations. It is well known that for suitable right-hand sides and initial guesses the standard, unmodified Newton method converges quadratically while the Quasi-Newton method as well as SDC show linear convergence, see e.g. [11, 13, 24]. We will examine the impact of these approaches in the following section along the lines of two numerical examples. A more rigorous mathematical analysis is currently ongoing work, as it can be embedded into a larger convergence theory for PFASST with inner Newton-type solvers.

4 Numerical results

We apply PFASST and PFASST-ER to two different, rather challenging reaction-diffusion problems, starting with a detailed analysis of the parallelization strategies for the Allen–Cahn equation and highlighting differences to these findings for the Gray-Scott equations.

4.1 Allen–Cahn equation

We study the two-dimensional Allen–Cahn equation, which is given by

$$\begin{aligned} u_t = \varDelta u + \frac{1}{\varepsilon ^2} u (1 - u) \end{aligned}$$
(10)

on the spatial domain \([-0.5,0.5]^2\) and with initial condition

$$\begin{aligned} {\begin{matrix} u_0 = \text {tanh}\left( \frac{R_0 -(x^2 + y^2 )}{\sqrt{2} \varepsilon } \right) , \end{matrix}} \end{aligned}$$

and periodic boundary conditions. We use simple second-order finite differences for discretization in space and take 256 elements in each dimension on the fine level and 128 on the coarse one. We furthermore use \(M = 4\) Gauß-Radau nodes, set \(\varepsilon = 0.04\), \(\varDelta t = 0.001 < \varepsilon ^2\) and stop the simulation after 24 time-steps at \(T = 0.024\). The initial condition describes a circle with a radius \(R_0 = 0.25\), see e.g. [28].

Note that since our focus is on the temporal parallelization, the temporal resolution was chosen to be quite high in contrast to the spatial resolution. Errors in space and time are not balanced here (and in the following example), with the spatial error being much higher. This has been done deliberately to avoid higher computational costs. By increasing the accuracy in space, we would increase the amount of large parallelizable computations in relation to communication. This would in turn improve the overall parallel efficiency, which would in the end lead to even better scaling results. However, when using parallelization in space, all processors have very few degrees-of-freedom anyway, so our results may even reflect a “real” situation better.

The results we present in the following were computed with pySDC [20, 21] on the supercomputer JURECA [12]. We run a serial single-level simulation using SDC (“SL” in the plots), a serial multi-level simulation using multi-level SDC (“ML”, which is PFASST on one processor, see [23]) and parallel simulations with 2, 4, 8, 12 and 24 processors (“P2” to “P24”), all until a given residual tolerance of \(10^{-10}\) is reached.

If less processors than time-steps are used, the time domain is split into blocks of parallel PFASST runs. These are handled sequentially, using the solution of the previous block as the initial data for the next one. For example, 6 processors work on the first 6 time-steps until convergence and the solution is used as new initial condition for the next block of 6 time-steps. This is repeated until all 24 time-steps have been completed.

Fig. 2
figure 2

Number of linear solves for the Allen–Cahn example, all methods run serial on the nodes

In Fig. 2 we show the maximum number of linear solves which were performed by the slowest processor (i.e. last processor in time) for different versions of the solvers, aggregated over all its time-steps and quadrature nodes, over all outer and inner iterations.

Here, two versions of the original PFASST algorithm are run: The first one performs exactly one inner Newton iteration in every PFASST iteration; this version is labeled as “PFASST: 1 iter”. In contrast, “PFASST: N iter” performs as many inner Newton iterations required so that the residual of the nonlinear inner problem is less than \(10^{-11}\). Both PFASST versions use the quadrature matrix \(\mathbf {Q}_\varDelta ^{LU}\) from Eq. (5) inside the preconditioner. For PFASST-ER we also show two variants: The PFASST-ER algorithm, which uses the original \(\mathbf {Q}\) inside the preconditioner is labeled as “PFASST-ER: Q” and the one which uses \(\mathbf {Q}^{LU}_\varDelta \) is labeled as “PFASST-ER: \(Q_\varDelta \)”. Solving the innermost linear systems is done using GMRES with a tolerance of \(10^{-12}\) in all cases.

We can see that performing more than one inner Newton iteration (“PFASST: N iter” vs. “PFASST: 1 iter”) does not improve the convergence of the overall algorithm. Although it is possible that by increasing the number of inner Newton iterations the number of outer PFASST iterations decreases, the total effort, which can be measured by the total number of linear solves, increases due to a higher number of inner Newton iterations.

Using the Quasi-Newton approach with the same preconditioner instead of the classical Newton solver (“PFASST-ER: \(Q_\varDelta \)” vs. “PFASST: 1 iter”) shows little effect on the total iteration numbers, but using the original quadrature matrix \(\mathbf {Q}\) instead of \(\mathbf {Q}^{LU}_\varDelta \) inside the preconditioner (“PFASST-ER: Q” vs. “PFASST-ER: \(Q_\varDelta \)”) greatly reduces the number of iterations.

However, without parallelization one iteration of PFASST-ER with \(\mathbf {Q}\) is in general more expensive than one iteration of the other algorithms, because it requires the solution of a full system via diagonalization instead of stepping through a triangular system via forward substitution.

Fig. 3
figure 3

Time to solution for Allen–Cahn with parallelization only across time-steps

In Fig. 3, we thus examine whether the lower number of more expensive iterations actually pays off. The plot shows results for the same setup as Fig. 2, but now we focus on the runtime instead of the iteration numbers. We only consider parallelization across the time-steps to compare the impact of the algorithmic change first. We see that despite the fact that the iterations are more expensive, PFASST-ER with \(\mathbf {Q}\) already in this example shows a lower runtime than the original PFASST method. This is also true when using \(\mathbf {Q}_\varDelta \) instead of \(\mathbf {Q}\).

At this point, we have not yet considered the additional direction of concurrency exposed by PFASST-ER. For that, we next compare different distributions of up to 24 cores on the 4 quadrature nodes and the 24 time-steps. All divisions of 24 were tested with all possible distributions.

Fig. 4
figure 4

Runtimes in seconds (first number) and efficiencies (second number) with different distribution of cores using PFASST-ER with \(\mathbf {Q}\) for the Allen–Cahn equation

The two plots in Fig. 4 show different combinations of cores used for step-parallelization (x-axis) and for node-parallelization (y-axis) with PFASST-ER and \(\mathbf {Q}\). Multiplying the numbers on both axes gives the total number of cores used for this simulation. This is also the reason why there are two plots, because not all combinations are actually possible or meaningful. Within each colored block the total runtime (in seconds, first number) and parallel efficiencies (second number) for this setup are given. We can see that using all available cores for parallelization across the step is by far not the most efficient choice. In turn, more than 4 cores cannot be used for parallelization across the nodes, although 4 gives the best speedup. Indeed, the best combination for this problem is to maximize node-parallelization first and then add step-parallelization (31.3 seconds with 4 cores on the nodes and 6 on the steps, lower picture). This is about 1.8 times faster than using 24 cores for the steps alone and more than 5 times faster than the serial PFASST-ER run.

Although using \(\mathbf {Q}\) instead of \(\mathbf {Q}_\varDelta \) in PFASST-ER is faster for this example, it is quite revealing to repeat the simulations using \(\mathbf {Q}_\varDelta \). These results are shown in Fig. 5 and it is obvious that using as many cores as possible for the parallelization across the nodes now is not the optimal strategy. Here, using 2 cores on the nodes and 12 on the steps is the most efficient combination, albeit still significantly slower than using PFASST-ER with \(\mathbf {Q}\), even with the same combination. The reason for this potentially surprising result is that solving the innermost linear systems heavily depends on the structure of these systems, in particular when using an iterative solver like GMRES. Moreover, initial guesses are a crucial factor, too. For PFASST-ER, we use the current solution at node zero of the respective time-step as the initial guess. This is particularly suitable for the closest first nodes, but potentially less so for later ones. While both effects did not lead to significant variations in the time spent for solving the linear systems when using \(\mathbf {Q}\), it does produce a severe load imbalance when using \(\mathbf {Q}_\varDelta \). More specifically, using 4 cores for the nodes and only 1 for the time-steps, i.e. exploiting only parallelization across the nodes, the first core takes about 118.2 seconds for all linear system solves together at the first node, while the last core takes about 194.6 seconds on the last node. Therefore, using 2 cores on the nodes, which enables a better load distribution is the ideal choice. One possibility would be that core 1 deals with nodes 1 and 4 and core 2 with 2 and 3, but because node 3 and 4 are very close to each other and the corresponding calculations are almost equally expensive also an alternating distribution is an ideal choice. This is precisely what has been done for Fig. 5, leading to the best speedup with 2 cores on the nodes. For other examples, an optimal distribution might be more difficult to find.

Fig. 5
figure 5

Runtimes in seconds (first number) and efficiencies (second number) with different distribution of cores using PFASST-ER with \(\mathbf {Q}_\varDelta \) for the Allen–Cahn equation

In Fig. 6 we now summarize the best results: PFASST with one inner Newton iteration in comparison to PFASST-ER using \(\mathbf {Q}_\varDelta \) and 2 cores on the nodes and PFASST-ER using \(\mathbf {Q}\) with 4 cores on the node. The plot shows the simulation time for each variant based on the number of processors used in total. We see that PFASST-ER is always much more time efficient in doing the calculations than PFASST, with another significant gain when using \(\mathbf {Q}\) instead of \(\mathbf {Q}_\varDelta \). Now, since PFASST-ER adds another direction of parallelization compared to PFASST, we can not only increase parallel efficiency as shown, but also extend the number of usable cores to obtain a better time-to-solution. This has been done in Fig. 7: taking 48 or 96 cores in total further reduces the computing time for 24 time-steps. With PFASST-ER, the number of resources that can be used for parallel-in-time integration is no more limited by the number of time-steps, but can be increased by the factor given by the number of quadrature nodes.

Fig. 6
figure 6

Runtimes for the three best variants, Allen–Cahn example

Fig. 7
figure 7

Runtimes for different number of processors, Allen–Cahn example

4.2 Gray-Scott equations

The second example we present here is the Gray-Scott system [16], which is given by

$$\begin{aligned} u_t&= D_u \varDelta u -2uv + F(1-u),\\ v_t&= D_v \varDelta v +2uv - (F+K)v, \end{aligned}$$

on the spatial domain \([0,1]\times [0,1]\), with periodic boundary conditions. As initial condition we choose a circle with radius 0.05 centred in the spatial domain, where \(u=0.5\) and \(v=0.25\) on the inside, and \(u=1.0\) and \(v=0\) outside of this circle. We use \(D_u=10^{-4}\), \(D_v=10^{-5}\) and set a feed rate of \(F=0.0367\) and a kill rate of \(K=0.0649\). This leads after some time to a process similar to cellular division and is known as “mitosis”. We discretize the spatial domain with 128 points in each dimension on the fine level and with 64 on the coarse one, using standard finite differences. We discretize every time-step of size \(\varDelta t =1\) with 4 quadrature nodes and run the simulation again for 24 time-steps.

The results are similar to the ones for the Allen–Cahn equation in the previous section. We will omit the case of PFASST with more than one inner Newton iteration, though.

Fig. 8
figure 8

Number of linear solves for the Gray-Scott example, all methods run serial on the nodes

We start again by looking at the total number of linear solves the different algorithms need to perform. Figure 8 shows the number of linear solves for the methods, which run until a residual tolerance of \(10^{-12}\) is reached. The results look quite similar to the ones for the previous example, with one critical difference: The difference between the \(\mathbf {Q}\)-variant of PFASST-ER and the other algorithms becomes smaller more rapidly the more parallel time-steps are used. There is no obvious explanation (at least, obvious to us) for this behavior, though. The more time steps are approximated simultaneously, the less suitable \(u_0\) works as initial value for more distant time-steps. Although the full Newton and the Quasi-Newton methods differ by an order of convergence in theory, in our scenario this seems relevant only for good initial values. One can expect that the runtime will increase when using PFASST-ER with \(\mathbf {Q}\), while it stayed about the same in the case of the Allen–Cahn example.

Fig. 9
figure 9

Time to solution for Gray-Scott with parallelization only across time-steps

This is precisely what we can see in Fig. 9. The more parallel time-steps are run, the less efficient PFASST-ER with \(\mathbf {Q}\) in this variant becomes. Already at 3 parallel steps, it is as costly as the original PFASST version, at least when parallelization across the nodes is not considered.

Now, adding node-parallelization, the findings are again similar to the ones in the previous section: Figure 10 shows that PFASST-ER with \(\mathbf {Q}\) is still more efficient than using PFASST. In particular, using more cores on the nodes is better and the best combination is again 4 cores on the nodes and 6 on the steps. Again, this changes when considering PFASST-ER with \(\mathbf {Q}_\varDelta \) as in Fig. 11, where the ideal setup uses only 2 cores on the nodes, but 12 on the steps. This is again due to load imbalances of the innermost linear solves. However, note the key difference to the previous results: The fastest run of the \(\mathbf {Q}_\varDelta \)-variant is now faster than the one of the \(\mathbf {Q}\)-variant.

Fig. 10
figure 10

Runtimes in seconds (first number) and efficiencies (second number) with different distribution of cores using PFASST-ER with \(\mathbf {Q}\) for the Gray-Scott equations

Fig. 11
figure 11

Runtimes in seconds (first number) and efficiencies (second number) with different distribution of cores using PFASST-ER with \(\mathbf {Q}_\varDelta \) for the Gray-Scott equations

Fig. 12
figure 12

Runtimes for the three best variants, Gray-Scott example

In Fig. 12 we now give an overview of the best results: If we use parallelism across the nodes in a suitable way, both PFASST-ER versions are more efficient based on the simulation time than the classical PFASST algorithm. Both can be used to extend the scaling capabilities beyond the number of time-steps, and both scale rather well in this regime. Note, however, that the \(\mathbf {Q}_\varDelta \)-variant can here only leverage \(2\times 24\) cores. It is then faster than the \(\mathbf {Q}\)-variant with twice as many cores.

5 Conclusion and outlook

Today’s supercomputers are designed with an ever increasing number of processors. Therefore we need our software and the underlying numerical algorithms to handle this increasing degree of parallelism. Time-parallel integrators are one promising research direction, with quite a number of different approaches. Some approaches parallelize each individual time-step and others act on multiple time-steps simultaneously. In this paper we have introduced a solver that works in parallel across the method as well as across the steps. More precisely, we combine node-parallel spectral deferred corrections with the parallel full approximation scheme in space and time. While PFASST allows one to compute multiple time-steps simultaneously and target large-scale parallelism in time, the new version called PFASST-ER presented here extends this idea with an efficient small-scale parallelization for every single time-step itself. The scaling studies show that a combination of both concepts seems to be the most efficient way to solve time-dependent PDEs. Here we tested two different preconditioners: ones using the traditional, triangular quadrature matrix \(\mathbf {Q}_\varDelta \), generated by a LU-decomposition and one using the original matrix \(\mathbf {Q}\). Both can be diagonalized and used as parallel-across-the-node preconditioners. For the \(\mathbf {Q}_\varDelta \)-preconditioner, we saw load imbalances when using an inner iterative linear solver, but by grouping nodes we still can speed up the simulation beyond the number of parallel time-steps. For the \(\mathbf {Q}\)-preconditioner, the overall number of iterations was lower and time-to-solution was faster. Adding node-parallelization, parallel efficiency can be increased and speedup extended when compared to PFASST. Both PFASST-ER versions lead in the end to better scaling results than the classical PFASST algorithm. PFASST-ER Q especially offers an almost equal distribution of work for iterative linear solvers with respect to the individual quadrature nodes of a time-step. This advantage makes this algorithm particularly flexible and can be used for any number of quadrature points.

PFASST-ER is particularly favorable if an increase in parallelism across the steps would lead to a severe increase in the number of iterations. This could be due to e.g. the type of the equation or the coarsening strategy. During our experiments we saw that it is not clear a priori which combination of node- and step-parallelization is the most efficient one. This could lead to many, potentially irrelevant runs to find the sweet spot. Here, a performance model and a suitable convergence theory are needed to at least narrow down the relevant options. This has to be accompanied by more numerical tests, relating e.g. model parameters with load imbalances, to identify the limits of this approach.