PFASST-ER: Combining the Parallel Full Approximation Scheme in Space and Time with parallelization across the method

To extend prevailing scaling limits when solving time-dependent partial differential equations, the parallel full approximation scheme in space and time (PFASST) has been shown to be a promising parallel-in-time integrator. Similar to a space-time multigrid, PFASST is able to compute multiple time-steps simultaneously and is therefore in particular suitable for large-scale applications on high performance computing systems. In this work we couple PFASST with a parallel spectral deferred correction (SDC) method, forming an unprecedented doubly time-parallel integrator. While PFASST provides global, large-scale"parallelization across the step", the inner parallel SDC method allows to integrate each individual time-step"parallel across the method"using a diagonalized local Quasi-Newton solver. This new method, which we call"PFASST with Enhanced concuRrency"(PFASST-ER), therefore exposes even more temporal parallelism. For two challenging nonlinear reaction-diffusion problems, we show that PFASST-ER works more efficiently than the classical variants of PFASST and can be used to run parallel-in-time beyond the number of time-steps.


Introduction
The efficient use of modern high performance computing systems for solving space-time-dependent differential equations has become one of the key challenges in computational science. Exploiting the exponentially growing number of processors using traditional techniques for spatial parallelism becomes problematic when, for example, for a fixed problem size communication costs starts to dominate. Parallel-in-time integration methods have recently been shown to provide a promising way to extend these scaling limits.
As one example, the "Parallel Full Approximation Scheme in Space and Time" (PFASST) by  allows to integrate multiple time-steps simultaneously by using inner iteration of spectral deferred corrections (SDC) on a spacetime hierarchy. It works on the so called composite collocation problem, where each time-step includes a further discretization through quadrature nodes. This "parallelization across the steps" approach (Burrage, 1997) targets large-scale parallelization on top of saturated spatial parallelization of partial differential equations (PDEs), where parallelization in the temporal domain acts as a multiplier for standard parallelization techniques in space. In contrast, "parallelization across the method" approaches (Burrage, 1997) try to parallelize the integration within an individual time-step. While this typically results in smaller-scale parallelization in the time-domain, parallel efficiency and applicability of these methods are often more favorable. Most notably, the "revisionist integral deferred correction method" (RIDC) by Christlieb et al. (Christlieb et al., 2010) makes use of integral deferred corrections (which are indeed closely related to SDC) in order to compute multiple iterations in a pipelined way. In (Speck, 2018), different approaches for parallelizing SDC across the method have been discussed, allowing the simultaneous computation of updates on multiple quadrature nodes. A much more structured and complete overview of parallel-in-time integration approaches can be found in (Gander, 2015). In addition, the Parallel-in-Time website (https://parallel-in-time.org) offers a comprehensive list of references.
The key goal of parallel-in-time integrators is to expose additional parallelism in the temporal domain in the cases where classical strategies like parallelism in space are either already saturated or not even possible. In (Clarke et al., 2019) the classical Parareal method (Lions et al., 2001) is used to overcome the scaling limit of a space-parallel simulation of a kinematic dynamo on up to 1600 cores. The multigrid extension of Parareal, the "multigrid reduction in time" method (MGRIT), has been shown to provide significant speedup beyond spatial parallelization (Falgout et al., 2017) for a multitude of problems. Using PFASST, a space-parallel Nbody solver has been extended in (Speck et al., 2012) to run on up to 262 244 cores, while in (Ruprecht et al., 2013) it has been coupled to a space-parallel multigrid solver on up to 458 752 cores.
So far, parallel-in-time methods have been implemented and tested either without any additional parallelization techniques or in combination with spatial parallelism. The goal for this work is to couple two different parallel-in-time strategies in order to extend the overall temporal parallelism exposed by the resulting integrator. To this end, we take the diagonalization idea for SDC presented in (Speck, 2018) (parallel across the method) and use it within PFASST (parallel across the steps). This way we create an algorithm that computes approximations for different time-steps simultaneously but also works in parallel on each time-step itself. Doing so we combine the advantages of both parallelization techniques and create the "Parallel Full Approximation Scheme in Space and Time with Enhanced concuRrency" (PFASST-ER), an unprecedented doubly time-parallel integrator for PDEs. In the next section we will first introduce SDC and PFASST from an algebraic point of view, following (Bolten et al., 2017(Bolten et al., , 2018. We particularly focus on nonlinear problems and briefly explain the application of a Newton solver within PFASST. Then, this Newton solver is modified in Section 3 so that by using a diagonalization approach the resulting Quasi-Newton method can be computed in parallel across the quadrature nodes of each time-step. In Section 4, we compare different variants of this idea to the classical PFASST implementation along the lines of two nonlinear reaction-diffusion equations. We show parallel runtimes for different setups and evaluate the impact of the various Newton and diagonalization strategies. Section 5 concludes this work with a short summary and an outlook.

Parallelization across the steps with PFASST
We focus on an initial value problem In order to keep the notation simple, we do not consider systems of initial value problems for now, where u(t) ∈ R N . Necessary modifications will be mentioned where needed. In a first step, we now discretize this problem in time and review the idea of single-step, time-serial spectral deferred corrections (SDC).

Spectral deferred corrections
For one time-step on the interval [t l , t l+1 ] the Picard formulation of Equation (1) is given by To approximate the integral we use a spectral quadrature rule. We define M quadrature nodes τ l,1 , ..., τ l,M , which are given by t l ≤ τ l,1 < ... < τ l,M = t l+1 . We will in the following explicitly exploit the condition that the last node is equal to the right integral boundary. Quadrature rules like Gauß-Radau or Gauß-Lobatto quadrature satisfy this property. We can then approximate the integrals from t l to the nodes τ l,m , such that where u l,m ≈ u(τ l,m ), ∆t = t l+1 − t l and q m,j represent the quadrature weights for the interval [t l , τ l,m ] such that We combine these M equations into one system which we call the "collocation problem". Here, u l = (u l,1 , ..., u l,M ) T ≈ (u(τ l,1 ), ..., u(τ l,M )) T ∈ R M , u l,0 = (u l,0 , ..., u l,0 ) T ∈ R M , Q = (q ij ) i,j ∈ R M ×M is the matrix gathering the quadrature weights and the vector function f : R M → R M is given by f (u l ) = (f (u l,1 ), ..., f (u l,M )) T .
To simplify the notation we define We note that for u(t) ∈ R N , we need to replace Q by Q ⊗ I N .
System (3) is dense and a direct solution is not advisable, in particular if f is a nonlinear operator. The spectral deferred correction method solves the collocation problem in an iterative way. While it has been derived originally from classical deferred or defect correction strategies, we here follow (Huang et al., 2006;Weiser, 2014;Ruprecht and Speck, 2016) to present SDC as preconditioned Picard iteration. A standard Picard iteration is given by for k = 0, . . . , K, and some initial guess u 0 l . In order to increase range and speed of convergence, we now precondition this iteration. The standard approach to preconditioning is to define an operator P sdc f , which is easy to invert but also close to the operator of the system. We define this "SDC preconditioner" as so that the preconditioned Picard iteration reads The key for defining P sdc f is the choice of the matrix Q ∆ . The idea is to choose a "simpler" quadrature rule to generate a triangular matrix Q ∆ such that solving System (4) can be done by forward substitution. Common choices include the implicit Euler method or the so-called "LU-trick", where the LU decomposition of is used (Weiser, 2014). System (4) establishes the method of spectral deferred corrections, which can be used to approximate the solution of the collocation problem on a single timestep. In the next step, we will couple multiple collocation problems and use SDC to explain the idea of the parallel full approximation scheme in space and time.

Parallel full approximation scheme in space and time
The idea of PFASST is to solve a "composite collocation problem" for multiple time-steps at once using multigrid techniques and SDC for each step in parallel. This composite collocation problem for L time-steps can be written as where the matrix H ∈ R M ×M on the lower subdiagonal transfers the information from one time-step to the next one. It takes the value of the last node τ l,M of an interval [t l , t l+1 ], which is by requirement equal to the left boundary t l+1 of the following interval [t l+1 , t l+2 ], and provides it as a new starting value for this interval. Therefore, the matrix H contains the value 1 on every position in the last column and zeros elsewhere. To write the composite collocation problem in a more compact form we define the vector u = (u 1 , ..., u L ) T ∈ R LM , which contains the solution at all quadrature nodes at all time-steps, and the vector b = (u 0,0 , 0, ..., 0) T ∈ R LM , which contains the initial condition for all nodes at the first interval and zeros elsewhere. We define F : R LM → R LM as an extension of f so that F (u) = (f (u 1 ), . . . , f (u L )) T . Then, the composite collocation problem can be written as where the matrix E ∈ R L×L just has ones on the first subdiagonal and zeros elsewhere. If u ∈ R N , we need to replace H by H ⊗ I N .
SDC can be used to solve the composite collocation problem by forward substitution in a sequential way. As a parallel-in-time integrator PFASST is an attractive alternative. The first step from SDC towards PFASST is the introduction of multiple levels, which are representations of the problem with different accuracies in space and time. In order to simplify the notation we focus to a two-level scheme consisting of a fine and a coarse level. Coarsening can be achieved for example by reducing the resolution in space, by decreasing the number of quadrature nodes on each interval or by solving implicit systems less accurately. For this work, we only consider coarsening in space, i.e., by using a restriction operator R on a vector u ∈ R N we obtain a new vectorũ ∈ RÑ . Vice versa, the interpolation operator T is used to interpolate values fromũ to u. Operators, vectors and numbers on the coarse level will be denoted by a tilde to avoid further index cluttering. Thus, the composite collocation operator on the coarse-level is given byC F . While C F is defined on R LM N ,C F acts on R LMÑ withÑ ≤ N , but as before we will neglect the space dimension in the following notation. The extension of the spatial transfer operators to the full space-time domain is given by R = I ⊗ R and T = I ⊗ T .
The main goal of the introduction of a coarse level is to move the serial part of the computation to this hopefully cheap level, while being able to run the expensive part in parallel. For that, we define two preconditioners: a serial one with a lower subdiagonal for the coarse level and a parallel, block-diagonal one for the fine level. The serial preconditionier for the coarse level is defined byP or, in a more compact way, bỹ Inverting this corresponds to a single inner iteration of SDC (a "sweep") on step 1, then sending forward the result to step 2, an SDC sweep there and so on. The parallel preconditioner on the fine level then simply reads Applying P F on the fine level leads to L decoupled SDC sweeps, which can be run in parallel. For PFASST, these two preconditioners and the levels they work on are coupled using a full approximation scheme (FAS) known from nonlinear multigrid theory (Trottenberg et al., 2000). Following (Bolten et al., 2017) one iteration of PFASST can then be formulated in four steps: 1. the computation of the FAS correction τ k , including the restriction of the fine value to the coarse level computation time predictor Fig. 1: Schematic view of PFASST on four processors. The figure was created with pfasst-tikz (Koehler, 2015).
2. the coarse sweep on the modified composite collocation problem on the coarse level 3. the coarse grid correction applied to the fine level value 4. the fine sweep on the composite collocation problem on the fine level In Figure 1, we see a schematic representation of the described steps. The time-step parallel procedure, which we describe here is also the same for all PFASST versions, that we will introduce later. It is common to use as many processors as time-steps: In the given illustration four processors work on four time-steps. Therefore the temporal domain is divided into four intervals, which are assigned to four processors P 0 , ..., P 3 . Every processor performs SDC sweeps on its assigned interval on alternating levels. The big red blocks represent fine sweeps, given by Equation (9), and the small blue blocks coarse sweeps, given by Equation (7). The coarse sweep over all intervals is a serial process: after a processor finished its coarse sweeps, it sends forward its results to the next processor, that take this result as an initial value for its own coarse sweeps. We see the communication in the picture represented by small arrows, which connect the coarse sweeps of each interval. In formula (7), the need for communication with a neighboring process is obvious, becauseP F is not a (block-) diagonal matrix, but has entries on its lower block-diagonal. P F on the other hand is block-diagonal, which means that the processors can calculate on the fine level in parallel. We see in Formula (9) that there is only a connection to previous time-steps through the right-hand side, where we gather values from the previous time-step and iteration but not from the current iteration. The picture shows this connection by a fine communication, which forwards data from each fine sweep to the following fine sweep of the right neighbor. The fine and coarse calculations on every processor are connected through the FAS corrections, which in our formula are part of the coarse sweep.

PFASST-Newton
For each coarse and each fine sweep within each PFASST iteration, System (7) and System (9), respectively, need to be solved. If f is a nonlinear function these systems are nonlinear as well. The obvious and traditional way to proceed in this case is to linearize the problem using Newton's method. This way, PFASST is the outer solver with an inner Newton iteration. For triangular Q ∆ , the mth equation on the lth time-step on the coarse level reads whereũ k+1 0,0 =ũ 0,0 andc(ũ k ) l,m is the mth entry the lth block ofc(ũ k ) := (P F −C F )(ũ k ) + τ k . This term gathers all values of the previous iteration. The first summand of the right-hand side of the coarse level equation corresponds to theb and theH, while the following sum comes from the lower triangular structure ofQ ∆ .
For time-step l these equations can be solved one by one using Newton iterations and forward substitution. This is inherently serial, because the solution on the mth quadrature node depends on the solution at all previous nodes through the sum. Thus, while running parallel across the steps, each of the solution of the local collocation problems is found in serial. In the next section, we will present a novel way of applying Newton's method, which allows to parallelize this part across the collocation nodes, joining parallelization across the step with parallelization across the method. We call this method PFASST-ER: the "Parallel Full Approximation Scheme in Space and Time with Enhanced concuRrency".

PFASST-ER
From the perspective of a single time-step [t l , t l+1 ] or processor P l , equation (7) on the coarse level for this step reads where τ k l is the lth component of τ k , belonging to the interval [t l , t l+1 ]. Note that the serial dependency is given by the termũ k+1 l,0 , so that it does not depend on the solutionũ k+1 l of this equation and can thus be considered as part of a given right-hand side. On the fine level, this is even simpler, because there we have to solve making the u k+ 1 2 l,0 -term not even dependent on the current iteration (which, of course, leads to the parallelism on the fine level).
As we have seen above, the typical strategy would be to solve these systems line by line, node by node, using forward substitution and previous PFASST iterates as initial guesses. An alternative approach has been presented in (Speck, 2018), where each SDC iteration can be parallelized across the node. While this is trivial for linear problems, nonlinear ones require the linearization of the full equations, not node-wise as before. For the fine sweep, let then a Newton step for G sdc f (v) = 0 is given by for Jacobian matrix ∇f (v j ) of f evaluated at v j which in turn is given by There is still no parallelism to exploit, but when we replace the full Jacobian matrix ∇f (v j ) by the approximation f (v l,0 )I M , which is the derivative of f at the initial value for the current time-step, we can use to establish a Quasi-Newton iteration as This decouples the evaluation of the Jacobian matrix from the current quadrature nodes and now Q ∆ can be diagonalized, so that the inversion of ∇G ∆-QN f (v l,0 ) can be parallelized across the nodes. Note that there are other options for approximating the full Jacobian matrix. Most notably, in (Gander et al., 2016) the mean over all Jacobian matrices is used (there across the time-steps). We did not see any impact on the convergence when following this strategy, most likely because the number of quadrature nodes is typically rather low. The advantage of using the initial value is that it reduces the number of evaluations of the Jacobian matrix, which also includes communication time. With This can be iterated until a certain threshold is reached and then set u k+1 l = v J to obtain the solution of the equation for the fine sweep. On the coarse level, the procedure is very similar, with a slightly different definition ofG sdc f (ṽ). In practice, choosing J = 1 is sufficient, because this is already the inner solver for an outer PFASST iteration.
This linearization and diagonalization strategy immediately suggests a second approach: instead of using Q ∆ for the preconditioner, we can use the original quadrature matrix Q directly. The intention of using Q ∆ in the first place was to obtain a preconditioner which allowed inversion using forward substitutions. Now, with diagonalization in place, this is no longer necessary. Instead, we can use l,0 . Note that this is just the lth block of the original composite collocation problem. Following the same ideas as before, we end up with which can be diagonalized using Q = VΛV −1 . The same idea can be applied to the coarse level sweep, of course. As a result, the original nonlinear SDC sweeps within PFASST are now replaced by Quasi-Newton iterations which can be done parallel across the nodes. We refer to (Speck, 2018) for more details on the idea of parallel SDC sweeps with Q and Q ∆ .
The question now is, how much the approximation of the Jacobians affects the convergence and runtime of the method and how all this compares to standard PFASST iterations with nonlinear SDC. It is well known that for suitable right-hand sides and initial guesses the standard, unmodified Newton method converges quadratically while the Quasi-Newton method as well as SDC show linear convergence, see e.g. (Kelley, 1995;Jackson et al., 1994;Tang et al., 2013). We will examine the impact of these approaches in the following section along the lines of two numerical examples. A more rigorous mathematical analysis is currently ongoing work, as it can be embedded into a larger convergence theory for PFASST with inner Newton-type solvers.

Numerical Results
We apply PFASST and PFASST-ER to two different, rather challenging reaction-diffusion problems, starting with a detailed analysis of the parallelization strategies for the Allen-Cahn equation and highlighting differences to these findings for the Gray-Scott equations.

Allen-Cahn equation
We study the two-dimensional Allen-Cahn equation, which is given by on the spatial domain [−0.5, 0.5] 2 and with initial condition We use simple second-order finite differences for discretization in space and take 256 elements in each dimension on the fine level and 128 on the coarse one. We furthermore use M = 4 Gauß-Radau nodes, set ε = 0.04, ∆t = 0.001 < ε 2 and stop the simulation after 24 time-steps at T = 0.024. The initial condition describes a circle with a radius R 0 = 0.25, see e.g. (Zhang and Du, 2009).
The results we present in the following were computed with pySDC (Speck, 2019a,b) on the supercomputer JURECA (Jülich Supercomputing Centre, 2016). We run a serial single-level simulation using SDC ("SL" in the plots), a serial multi-level simulation using multilevel SDC ("ML", which is PFASST on one processor, see (Speck et al., 2015)) and parallel simulations with 2, 4, 8, 12 and 24 processors ("P2" to "P24"), all until a given residual tolerance of 10 −10 is reached. In Figure 2 we show the number of linear solves for different versions of the solvers, aggregated over all time-steps, quadrature nodes, outer and inner iterations. Here, two versions of the original PFASST algorithm are run: The first one performs exactly one inner Newton iteration in every PFASST iteration; this version is labeled as "PFASST: 1 iter". In contrast, "PFASST: N iter" performs so many inner Newton iterations that the residual of the nonlinear inner problem is smaller than 10 −11 . Both PFASST versions use the quadrature matrix Q LU ∆ from Equation (5) inside the preconditioner. For PFASST-ER we also differentiate between two variants: The PFASST-ER algorithm, which uses the original Q inside the preconditioner is labeled as "PFASST-ER: Q" and the one which uses Q LU ∆ is labeled as "PFASST-ER: Q ∆ ". Solving the innermost linear systems is done using GMRES.
We can see that performing more than one inner Newton iteration ("PFASST: N iter" vs. "PFASST: 1 iter") does not improve the convergence of the overall algorithm. On the contrary more iterations are needed to achieve the same result. Using the Quasi-Newton approach with the same preconditioner instead of the classical Newton solver ("PFASST-ER: Q ∆ " vs. "PFASST: 1 iter") only shows little effect on the total iteration numbers, but using the original quadrature matrix Q instead of Q LU ∆ inside the preconditioner ("PFASST-ER: Q" vs. "PFASST-ER: Q ∆ ") greatly reduces the number of iterations.
However, without parallelization one iteration of PFASST-ER with Q is in general more expensive as one iteration of one of the other algorithms, because it requires the solution of a full system via diagonalization instead of stepping through a triangular system via forward substitution. In Figure 3 lower number of higher costly iterations actually pays off. The plot shows results for the same setup as Figure  2, but now we focus on the runtime instead of the iteration numbers. We only consider parallelization across the time-steps to compare the impact of the algorithmic change first. We see that despite the fact that the iterations are much more expensive, PFASST-ER with Q already in this example shows a lower runtime than the original PFASST method. This is also true when using Q ∆ instead of Q.
Until now we did not yet consider the additional direction of concurrency exposed by PFASST-ER. For that, we next compare different distributions of up to 24 cores on the 4 quadrature nodes and the 24 time-steps. The two plots in Figure 4 show different combinations of cores used for step-parallelization (x-axis) and for node-parallelization (y-axis) with PFASST-ER and Q. Multiplying the numbers on both axes gives the total number of cores used for this simulation. This is also the reason why there are two plots, because not all combinations are actually possible or meaningful. Within each colored block the total runtime for this setup is given. We can nicely see that using all available cores for parallelization across the step is by far not the most efficient way. In turn, more than 4 cores cannot be used for parallelization across the nodes, although this gives the best speedup. Indeed, the best combination for this problem is to maximize node-parallelization first and then add step-parallelization (31.3 seconds with 4 cores on the nodes and 6 on the steps, lower picture). This is about 1.8 times faster than using 24 cores for the steps alone and more than 5 times faster than the serial PFASST-ER run.  Although using Q instead of Q ∆ in PFASST-ER is faster for this example, it is quite revealing to repeat the simulations using Q ∆ .
These results are shown in Figure 5 and it is obvious that using as many cores as possible for the parallelization across the nodes now is not the optimal strategy. Here, using 2 cores on the nodes and 12 on the steps is the most efficient combination, albeit still significantly slower than using PFASST-ER with Q, even with the same combination. The reason for this potentially surprising result is that solving the innermost linear systems heavily depends on the structure of these systems, in particular when using an iterative solver like GM-RES. Moreover, initial guesses are a crucial factor, too. For PFASST-ER, we use the current solution at node zero of the respective time-step as initial guess. This is particularly suitable for the closest first nodes, but potentially less for later ones. While both effects did not lead to significant variations in the time spent for solving the linear systems when using Q, it does produce a severe load imbalance when using Q ∆ . More specifically, using 4 cores for the nodes and only 1 for the time-steps, i.e. exploiting only parallelization across the nodes, the first core takes about 118.2 seconds for all linear system solves together at the first node, while  the last core takes about 194.6 seconds on the last node. Therefore, using 2 cores on the nodes, where core 1 deals with nodes 1 and 3 and core 2 with 2 and 4 is the ideal choice. This is precisely what has been done for Figure 5,leading to the best speedup with 2 cores on the nodes.
In Figure 6 we now summarize the best results: PFASST with one inner Newton iteration in comparison to PFASST-ER using Q ∆ and 2 cores on the nodes and PFASST-ER using Q with 4 cores on the node. The plot shows the simulation time for each variant based on the number of processors used in total. We see that PFASST-ER is always much more time efficient in doing the calculations than PFASST, with another significant gain when using Q instead of Q ∆ . Now, since PFASST-ER adds another direction of parallelization compared to PFASST, we can not only increase parallel efficiency as shown, but also extend the number of usable cores to obtain a better time-to-solution (but not necessarily a better parallel efficiency). This has been done in Figure  of time-steps, but can be increased by the factor given by the number of quadrature nodes.

Gray-Scott equations
The second example we present here is the Gray-Scott system (Pearson, 1993), which is given by and is known as "mitosis". We refine the spatial domain with 128 points in each dimension on the fine level and with 64 on the coarse one, using standard finite differences. We discretize every time-step of size ∆t = 1 with 4 quadrature nodes and run the simulation again for 24 time-steps. The results will be presented very similar to the ones for the Allen-Cahn equation in the previous section. We will omit the case of PFASST with more than one inner Newton iteration, though.
We start again looking at the total number of linear solves the different algorithms need to perform. Figure  8 shows the number of linear solves for the methods, which run until a residual tolerance of 10 −12 is reached. The results look quite similar to the ones we got for the previous example, with one critical difference: The difference between the Q-variant of PFASST-ER and the other algorithms becomes smaller more rapidly the more parallel time-steps are used. In particular, it needs about the same number of inner solves as the others for 24 cores. Thus, one can expect that the runtime will increase when using PFASST-ER with Q, while it stayed about the same in the case of the Allen-Cahn example. This is precisely what we can see in Figure 9. The more parallel time-steps are run, the less efficient PFASST-ER with Q in this variant becomes. Already at 3 parallel steps, it is as costly as the original PFASST version, at least when parallelization across the nodes is not considered. Now, adding node-parallelization, the findings are again similar to the ones in the previous section: Figure 10 shows that PFASST-ER with Q is still more efficient than using PFASST. In particular, using more cores on the nodes is better and the best combination is again 4 cores on the nodes and 6 on the steps. Again, this changes when considering PFASST-ER with Q ∆   as in Figure 11, where the ideal setup uses only 2 cores on the nodes, but 12 on the steps. This is again due to load imbalances of the innermost linear solves. However, note the key difference to the previous results: The fastest run of the Q ∆ -variant is now faster than the one of the Q-variant.
In Figure 12 we now give an overview about the best results: If we use parallelism across the nodes in  a suitable way, both PFASST-ER versions are more efficient based on the simulation time than the classical PFASST algorithm. Both can be used to extend the scaling capabilities beyond the number of time-steps, and both scale rather well in this regime. Note, however, that the Q ∆ -variant can here only leverage 2 × 24 cores. It is then faster than the Q-variant with twice as many cores.

Conclusion and outlook
Nowadays supercomputers are designed with an ever increasing number of processors. Therefore we need our software and the underlying numerical algorithms to handle this increasing degree of parallelism. Time-parallel integrators are one promising research direction, with quite a number of different approaches. Some approaches parallelize each individual time-step and others act on multiple time-steps simultaneously. In this paper we have introduced a solver that works parallel across the method as well as parallel across the steps. More precisely, we could combine node-parallel spectral deferred corrections with the parallel full approximation scheme in space and time. While PFASST allow to compute multiple time-steps simultaneously and target large-scale parallelism in time, the new version called PFASST-ER presented here extend this idea with a very efficient small-scale parallelization for every single time-step itself. The scaling studies showed that a combination of both concepts seems to be the most efficient way to solve time-dependent PDEs. Here we tested two different preconditioners: ones using the traditional, triangular quadrature matrix Q ∆ and one using the original matrix Q. Both could be diagonalized and used as parallel-across-the-node preconditioners. For the Q ∆preconditioner, we saw load imbalances when using an inner iterative linear solver, but by grouping nodes we still could speed up the simulation beyond the number of parallel time-steps. For the Q-preconditioner, the overall number of iterations was lower and time-tosolution was faster. Adding node-parallelization, parallel efficiency could be increased and speedup extended when compared to PFASST. Both PFASST-ER versions lead in the end to better scaling than the classical PFASST algorithm.
During our experiments we saw that it is not clear a priori, which combination of node-and step-parallelization is the most efficient one. This leads to a lot of, potentially irrelevant runs to find the sweet spot. Here, a performance model and a suitable convergence theory are needed to at least narrow down the relevant options. This has to be accompanied by more numerical tests, relating e.g. model parameters with load imbalances, to identify the limits of this approach.