GParareal: a time-parallel ODE solver using Gaussian process emulation

Sequential numerical methods for integrating initial value problems (IVPs) can be prohibitively expensive when high numerical accuracy is required over the entire interval of integration. One remedy is to integrate in a parallel fashion, “predicting” the solution serially using a cheap (coarse) solver and “correcting” these values using an expensive (fine) solver that runs in parallel on a number of temporal subintervals. In this work, we propose a time-parallel algorithm (GParareal) that solves IVPs by modelling the correction term, i.e. the difference between fine and coarse solutions, using a Gaussian process emulator. This approach compares favourably with the classic parareal algorithm and we demonstrate, on a number of IVPs, that GParareal can converge in fewer iterations than parareal, leading to an increase in parallel speed-up. GParareal also manages to locate solutions to certain IVPs where parareal fails and has the additional advantage of being able to use archives of legacy solutions, e.g. solutions from prior runs of the IVP for different initial conditions, to further accelerate convergence of the method — something that existing time-parallel methods do not do.


Introduction 1.Motivation and background
This paper is concerned with the numerical solution of a system of d ∈ N ordinary differential equations (ODEs) of the form where f : [t 0 , T ] × U → R d is a nonlinear function with sufficiently many continuous partial derivatives, u : [t 0 , T ] → U is the time-dependent solution, and u 0 is the initial value at time t 0 .We seek numerical solutions U j ≈ u(t j ) to the initial value problem (IVP) in (1.1) on a pre-defined mesh t = (t 0 , . . ., t J ), where t j+1 = t j + ∆T for fixed ∆T = (T − t 0 )/J.
More specifically, we are concerned with IVPs where: (i) the interval of integration, [t 0 , T ]; (ii) the number of mesh points, J + 1; or (iii) the wallclock time to evaluate the vector field, f , is so large that such numerical solutions take hours, days, or even weeks to obtain using classical sequential integration methods, e.g.implicit/explicit Runge-Kutta methods (Hairer et al., 1993).Expensive vector fields f can, for example, arise when (spatially) discretising partial differential equations (PDEs) into a system of ODEs.Runtime issues also arise when solving IVPs with spatial or other non-temporal dependencies in that, even though highly efficient domain decomposition methods exist (Dolean et al., 2015), the parallel speed-up of such methods on high performance computers (HPCs) is still constrained by the serial nature of the time-stepping scheme.Therefore, with the advent of exascale HPCs on the horizon (Mann, 2020), there has been renewed interest in developing more efficient and robust time-parallel algorithms to reduce wallclock runtimes for IVP simulations in applications spanning numerical weather prediction (Hamon et al., 2020), kinematic dynamo modelling (Clarke et al., 2020), and plasma physics (Samaddar et al., 2010(Samaddar et al., , 2019) ) to name but a few.In this work, we focus on the development of such a time-parallel method.
To solve (1.1) in parallel, one must overcome the causality principle of time: solutions at later times depend on solutions at earlier times.In recent years, a growing number of time-parallel algorithms, whereby one partitions [t 0 , T ] into J 'slices' and attempts to solve J smaller IVPs using J processors, have been developed to speed-up IVP simulations; see Gander (2015) and Ong and Schroder (2020) for comprehensive reviews.We take inspiration from the parareal algorithm (Lions et al., 2001), a multiple shooting-type (or multigrid (Gander and Vandewalle, 2007)) method that uses a predictor-corrector update rule based on two numerical integrators, one coarse-and one fine-grained in time, to iteratively locate solutions U k j to (1.1) in parallel.At any iteration k ∈ {1, . . ., J} of parareal, the 'correction' is given by the residual between fine and coarse solutions obtained during iteration k − 1 (further details are provided in Section 2).In a Markovian-like manner, all fine/coarse information about the solution obtained prior to iteration k − 1 is ignored by the predictor-corrector rule, a feature present in most parareal-type algorithms and variants (Ait-Ameur et al., 2020;Dai et al., 2013;Elwasif et al., 2011;Maday and Mula, 2020;Pentland et al., 2022).Our goal is to demonstrate that such "acquisition" data, i.e. fine and coarse solution information accumulated up to iteration k, can be exploited using a statistical emulator in order to determine a solution in faster wallclock time than parareal.Making use of acquisition data in parareal is mentioned briefly in the appendix of Maday and Mula (2020), in the context of spatial domain decomposition and high-order time-stepping, but has yet to be investigated further.
In particular, we use a Gaussian process (GP) emulator (O'Hagan, 1978;Rasmussen, 2004) to rapidly infer the (expensive-to-simulate) multi-fidelity correction term in parareal.The emulator is trained using acquisition data from all prior iterations, with data from the fine integrator having been obtained in parallel.Similar to parareal, we derive a predictor-corrector-type scheme where the coarse integrator makes rapid low-accuracy predictions about the solutions which are subsequently refined using a correction, now inferred from the GP emulator.In addition to using an emulator, the difference between this approach and parareal is that the new correction term is formed of integrated solutions values at the current iteration k, rather than k − 1. Supposing that the fine solver is of sufficient accuracy to exactly solve the IVP, the algorithm presented in this paper determines numerical solutions U k j that converge (assuming the emulator is sufficiently well trained) toward the exact solutions U j over a number of iterations.This new approach is particularly beneficial if one wishes to fully understand and evaluate the dynamics of (1.1) by simulating solutions for a range of initial values u 0 or over different time intervals.Firstly, if one can obtain additional parallel speedup, generating such a sequence of independent simulations becomes more computationally tractable in feasible time.Secondly, the "legacy" data, i.e. solution information accumulated between independent simulations, can be used to inform future simulations by increasing the size of the dataset available to the emulator.Being able to re-use (expensive) acquisition or legacy data to integrate IVPs such as (1.1) in parallel is not something, to the best of our knowledge, that existing time-parallel algorithms currently do.
In recent years, there has been a surge in interest in the field of probabilistic numerics (Hennig et al., 2022;Oates and Sullivan, 2019), where "ODE filters" have been developed to solve ODEs using GP regression techniques.Instead of calculating a numerical solution on the mesh t, as classical integration methods do, ODE filters return a probability measure over the solution at any t ∈ [t 0 , T ] (Bosch et al., 2021;Schober et al., 2019;Tronarp et al., 2019;Wenger et al., 2021).Such methods solve sequentially in time, conditioning the GP on acquisition data, i.e. solution and derivative evaluations, at competitive computational cost (compared to classical methods) (Kersting et al., 2020;Krämer et al., 2022).However, integrating IVPs with large time intervals or expensive vector fields using such filters is still a computationally intractable process.As such, our aim is to fuse aspects of time-parallelism with the Bayesian methods showcased in ODE filters-something briefly mentioned in Kersting and Hennig (2016) and Pentland et al. (2022), but not yet explored.Whereas ODE filters use GPs to explicitly model the solution to an IVP, we instead use them to model the residual between approximate solutions provided by the deterministic fine and coarse solvers, i.e. the parareal correction.While the method proposed in this paper does not return a probabilistic solution to (1.1), we believe that it constitutes a positive step in this direction.

Contributions and outline
The rest of this paper is structured as follows.In Section 2, we introduce parareal, providing an overview of the algorithm and its computational complexity for a scalar ODE.In Section 3, we present our algorithm, henceforth referred to as GParareal, in which we describe how a GP emulator, conditioned on acquisition data obtained in parallel throughout the simulation, is used to refine coarse numerical solutions to a scalar ODE.In addition, we detail the computational complexity of GParareal, provide a bound for its numerical error at a given iteration, and describe the extension for solving systems of ODEs.Numerical experiments are performed using HPC facilities in Section 4. We demonstrate good performance of GParareal against parareal in terms of convergence, wallclock time, and solution accuracy on a number of lowdimensional ODE problems using just acquisition data.Furthermore, we demonstrate how the GP emulator enables convergence in cases where the coarse solver is too inaccurate for parareal to converge and show that legacy simulation data can be used to obtain solutions even faster, retaining comparable numerical accuracy.We discuss the benefits, drawbacks, and open questions surrounding GParareal in Section 5.

Parareal
Here we briefly recall the parareal algorithm (Lions et al., 2001), first describing the fine-and coarse-grained numerical solvers it uses, then the algorithm itself, and finally some remarks on complexity, numerical speed-up, and choice of solvers.For a full mathematical derivation and exposition of parareal, refer to Gander and Vandewalle (2007).To simplify notation, we describe parareal for solving a scalar-valued autonomous ODE, i.e. f (t, u(t)) := f (u(t)) in (1.1), without loss of generality.

The solvers
To calculate a solution to (1.1), parareal uses two one-step1 numerical integrators.The first, referred to as the fine solver F ∆T , is a computationally expensive integrator that propagates an initial value at t j , over an interval of length ∆T , and returns a solution with high numerical accuracy at t j+1 .In this paper, we assume that F ∆T provides sufficient numerical accuracy to solve (1.1) for the solution to be considered 'exact', i.e.U j = u(t j ).The objective is to calculate the exact solutions U j = F ∆T (U j−1 ) for j = 1, . . ., J, where U 0 := u 0 , (2.1) without running F ∆T J times sequentially, as this calculation is assumed to be computationally intractable.To avoid this, parareal locates iteratively improved approximations U k j , where k = 0, 1, 2, . . . is the iteration number, that converge toward U j (note that U k 0 = U 0 = u 0 ∀k 0).To do this, parareal uses a second numerical integrator G ∆T , referred to as the coarse solver.G ∆T propagates an initial value at t j over an interval of length ∆T , however, it has lower numerical accuracy and is computationally inexpensive to run compared to F ∆T .This means that G ∆T can be run serially across a number of time slices to provide relatively cheap low accuracy solutions whilst F ∆T is permitted only to run in parallel over multiple slices.

The algorithm
To begin (iteration k = 0), approximate solutions to (1.1) are calculated sequentially using G ∆T , on a single processor, such that Following this, the fine solver propagates each approximation in (2.2) in parallel, on J processors, to obtain F ∆T (U 0 j ) for j = 0, . . ., J − 1.These values are then used (during iteration k = 1) in the predictor-corrector correct for j = 1, . . ., J. (2.3) Here, G ∆T is applied sequentially to predict the solution at the next time step, before being corrected by the residual between coarse and fine values found during the previous iteration (note that (2.3) cannot be calculated in parallel).This is a discretised approximation of the Newton-Raphson method for locating the true roots U j with initial guess (2.2) (Gander and Vandewalle, 2007).For a pre-defined tolerance ε > 0, the parareal solution U k j is deemed to have converged up to time t I if This criterion is standard for parareal (Gander and Hairer, 2008;Garrido et al., 2006), however, other criteria, e.g.taking the average relative error between fine solutions over a time slice (Samaddar et al., 2010(Samaddar et al., , 2019) ) or measuring the total energy of the system, could be used instead.Unconverged solution values, i.e.U k j for j > I, are updated in future iterations (k > 1) by initiating further parallel F ∆T runs on each U k j , followed by an update using (2.3).The algorithm stops once I = J, converging in k (out of J) iterations.The version of parareal described here and implemented in Section 4 does not iterate over solutions that have already converged, avoiding the waste of computational resources (Elwasif et al., 2011;Garrido et al., 2006;Pentland et al., 2022).Extending parareal to the full nonautonomous system in (1.1) is straightforward: see Gander and Vandewalle (2007) for notation and Pentland et al. (2022) for pseudocode.

Convergence and computational complexity
After k iterations, the solution states up to time t k (at minimum) have converged, as the exact initial condition (u 0 ) has been propagated by F ∆T at least k times.Therefore, if parareal converges in k = J iterations, then the solution will be equal to the one found by calculating (2.1) serially, at an even higher computational cost.Convergence2 in k J iterations is necessary if significant parallel speed-up is to be realised.Refer to Gander and Hairer (2008); Gander and Vandewalle (2007) for derivations of explicit parareal error bounds.
Without loss of generality, assume running F ∆T over any [t j , t j+1 ], j ∈ {0, . . ., J − 1}, takes wallclock time T F (denote time T G similarly for G ∆T ).Therefore, calculating (2.1) using F ∆T serially, takes approximately T serial = JT F seconds.Using parareal, the total wallclock time (in the worst case, excluding any serial overheads) can be approximated by (2.5) The approximate parallel speed-up is therefore To maximise (2.6), both the convergence rate k and the ratio T G /T F should be as small as possible.In practice, however, there is a trade-off between these two quantities as fast G ∆T solvers (with sufficient accuracy to still guarantee convergence) typically require more iterations to converge, increasing k.An illustration of the computational task scheduling during the first few iterations of parareal vs. a full serial integration is given in Figure 2.1-optimised scheduling of parareal is studied in Elwasif et al. (2011).Selecting a fast but accurate coarse solver remains a trial and error process, entirely dependent on the system being solved.Typically, G ∆T is chosen such that it has a coarser temporal resolution/lower numerical accuracy (Baffico et al., 2002;Farhat and Chandesris, 2003;Samaddar et al., 2010;Trindade and Pereira, 2006), a coarser spatial resolution (when solving PDEs) (Ruprecht, 2014;Samaddar et al., 2019), and/or uses simplified model equations (Engblom, 2009;Legoll et al., 2020;Meng et al., 2020) compared to F ∆T .In Section 3, we aim to widen the pool of choices for G ∆T by using a GP emulator to capture variability in the residual F ∆T − G ∆T and showcase its effectiveness by demonstrating that GParareal can converge to a solution in cases where parareal cannot in Section 4.

GParareal
In this section, we present the GParareal algorithm, in which a GP emulator is used in the analogue of parareal's predictor-corrector step.Suppose we seek the same high resolution numerical solutions to (1.1) as expressed in (2.1), denoted now as V j instead of U j .Furthermore, we denote the iteratively improved approximations from GParareal as V k j (as before, V k 0 = V 0 = u 0 ∀k 0).In parareal, the predictor-corrector (2.3) updates the numerical solutions at iteration k using a correction term based on information calculated during the previous iteration k − 1.We propose the following update rule, again based on a coarse prediction and multi-fidelity correction, that instead refines solutions using information from the current iteration k, rather than k − 1: If V k j−1 is known, the prediction is rapidly calculable, however the correction is not known explicitly without running F ∆T at expensive cost.We propose using a GP emulator to model this correction term, trained on all previously obtained evaluations of F ∆T and G ∆T .The emulator returns a Gaussian distribution over (F ∆T − G ∆T )(V k j−1 ) from which we can extract an explicit value and carry out the refinement in (3.1).
In Section 3.1, we present the algorithm, giving an explanation of the kernel hyperparameter optimisation process in Section 3.2 and providing error analysis in Section 3.3.In Section 3.4, we detail the computational complexity, remarking that given large enough runtimes for the fine solver, an iteration of GParareal runs in approximately the same wallclock time as parareal.Again, to simplify notation, we first detail GParareal for an autonomous scalar-valued ODE, i.e. f (t, u(t)) := f (u(t)) in (1.1).The extension to the multivariate nonautonomous case is described in Section 3.5.

Gaussian process emulator
Before solving (1.1), we define a GP prior (Rasmussen and Williams, 2006) over the unknown correction function F ∆T − G ∆T .This function maps an initial value x j ∈ U at time t j to the residual difference between F ∆T (x j ) and G ∆T (x j ) at time t j+1 .More formally, we define the GP prior with mean function m : U → R and covariance kernel κ : U × U → R. Given some vectors of initial values, x, x ∈ U J , the corresponding vector of means is denoted µ(x) = (m(x j )) j=0,...,J−1 and the covariance matrix K(x, x ) = (κ(x i , x j )) i,j=0,...,J−1 .The correction term is expected to be small, depending on the accuracy of both F ∆T and G ∆T , hence we define a zero-mean process, i.e. m(x j ) = 0. Ideally, the covariance kernel will be chosen based on any prior knowledge of the solution to (1.1), e.g.regularity/smoothness.If no information is available a priori to simulation, we are free to select any appropriate kernel.In this work, we use the square exponential (SE) kernel , for some x, x ∈ U. (3. 3) The kernel hyperparameters, denoting the output length scale σ 2 and input length scale 2 , are referred to collectively in the vector θ and need to be initialised prior to simulation.The algorithm proceeds as follows; see Appendix A for pseudocode.
Firstly, run G ∆T sequentially from the exact initial value, on a single processor, to locate the coarse solutions V 0 j = G ∆T (V 0 j−1 ) j = 1, . . ., J.
Use F ∆T to propagate the values in (3.4) on each time slice in parallel, on J processors, to obtain the following values at t j F ∆T (V 0 j−1 ) j = 1, . . ., J. (3.5) At this stage, we diverge from the parareal method.Given x, store the values of F ∆T − G ∆T , using (3.4) and (3.5), in the vector At this point, the inputs x and evaluations y are used to optimise the kernel hyperparameters θ via maximum likelihood estimation-see Section 3.2.Conditioning the prior (3.2) using the acquisition data x and y, the GP posterior over (F ∆T − G ∆T )(x ), where x ∈ U is some initial value in the state space, is given by (3.9) Now we wish to determine updated solutions V 1 j at each mesh point.Given F ∆T has been run once, the exact solution is known at time t 1 .Specifically, at t 0 we know V k 0 = V 0 ∀k 0 and at ) is unknown, hence we need to calculate its value without running F ∆T again.To do this, we re-write the exact solution using (3.1): Both terms in (3.10) are initially unknown, but the prediction can be calculated rapidly at low computational cost while the correction can be inferred using the GP posterior (3.7) with x = V 1 1 .Therefore, we obtain a Gaussian distribution over the solution with variance stemming from uncertainty in the GP emulator.Repeating this process to determine a distribution for the solution at t 3 by attempting to propagate the random variable V 1 2 using G ∆T is computationally infeasible for nonlinear IVPs.To tackle this and be able to propagate V 1 2 , we approximate the distribution by taking its mean value, This approximation is a convenient way of minimising computational cost, at the price of ignoring uncertainty in the GP emulator-see Section 5 for a discussion of possible alternatives.The update process, applying (3.1) and then approximating the Gaussian distribution by taking its expectation, is repeated sequentially for later t j , yielding the approximate solutions This process is illustrated in Figure 3.1.Finally, we impose stopping criteria (2.4), identifying which V 1 j for j I have converged.Using the same stopping criteria as parareal will allow us to compare the performance of both algorithms in Section 4.

Iteration k 2
If the stopping criteria is not met, i.e.I < J, we can iteratively update any unconverged solutions by re-applying the previous steps.This means calculating F ∆T (V k−1 j ), j = I, . . ., J −1, in parallel and then storing new evaluations ŷ = ( . Hyperparameters are then re-optimised and the GP is re-conditioned using all prior acquisition data, i.e. x = [x; x] and y = [y; ŷ], generating an updated posterior.Here, [a; b] denotes the vertical concatenation of column vectors a and b.The update rule is then applied such that we obtain Once I = J, the solution, the number of iterations k taken to converge, and the acquisition data x and y are returned.A key advantage of GParareal is that the acquisition data can be used in future GParareal simulations (as "legacy data") to provide the GP emulator with more data and therefore exploit additional speedup-this will be demonstrated in Section 4.

Kernel hyperparameter optimisation
The hyperparameters θ of the kernel κ will need to be optimised in light of the acquisition data y (and corresponding input data x).We optimise each element of θ such that it maximises its (log) marginal likelihood (Rasmussen, 2004).To do this, first define g(x) := (F ∆T − G ∆T )(x) and g := (g(x j )) j=0,...,N −1 , where N is the current length of x (and y).Given the evaluations y are noise-free, the likelihood of obtaining such data is p(y|g, x, θ) = δ(y − g), where δ(•) is the multidimensional Dirac delta function.The marginal likelihood, given x and θ, is therefore where N (y|0, K(x, x)) denotes the probability density function of a multivariate Gaussian distribution evaluated at y, with mean vector 0 and covariance matrix K(x, x) that depends on θ, see (3.3).The hyperparameters in θ can then be estimated numerically by maximising the log marginal likelihood using any gradient-based optimiser.Optimisation is carried out once per iteration (up until the hyperparameters do not change significantly between iterations) and hyperparameters from the prior iteration are used as to start the optimisation at the current iteration.

Error Analysis
In this section, we are interested in analysing the absolute error (3.13) between the exact and GParareal solution at iteration k and time t j .We show that this error has an upper bound proportional to the fill distance (defined below) of the dataset at iteration k.To do this, we now denote the input dataset at iteration k as x k rather than x (because the dataset size strictly increases with each iteration of GParareal) and, similarly, denote the output dataset y as y k .We also introduce some assumptions on the solvers F ∆T and G ∆T , and a known result on the consistency of the GP posterior mean μ (3.8) to the true correction function g = F ∆T − G ∆T .For clarity, we re-state the GParareal update rule (3.14)

Preparatory assumptions and results
First, we state some assumptions on F ∆T and G ∆T , as in Gander and Hairer (2008).
Assumption 3.1 (Exact fine solver F ∆T ).F ∆T solves (1.1) exactly such that G ∆T is a one-step numerical solver with uniform local truncation error O(∆T p+1 ), for p 1, such that for u ∈ R and continuously differentiable functions c i (u), i = 1, 2, . ... For u, v ∈ R, we can then write where C 1 > 0 is the Lipschitz constant for c 1 (u).
for u, v ∈ R and some L G > 0.
Next, we define the concepts required to state a result on the consistency of the GP posterior mean.Firstly, we define the fill distance h x k to be the largest smallest distance between any point v ∈ U and any point v i ∈ x k , i.e.
It should be clear that each data point v i ∈ x k is also contained in U. Secondly, we define the reproducing kernel Hilbert space (RKHS), a Hilbert space H κ (U) of functions g : U → R with inner product •, • Hκ(U ) .See Stuart and Teckentrup (2018) for a more formal definition and conditions on the inner product itself.We can now state the following result on the GP posterior mean consistency, adapted from Wendland (2004, Theorem 11.14).
Theorem 3.4 (GP posterior mean consistency).Suppose U ⊂ R is a bounded interval and let κ be the SE kernel.Denote the GP posterior mean, built using x k , y k , and κ (3.8) as μ and the function being emulated as g ∈ H κ (U).Then, for every τ ∈ N, there exist constants h 0 (τ ) and Hκ(U ) = g, g Hκ(U ) .See Wendland (2004, Theorem 11.14) for a more general version of this result that holds when U ⊂ R d and for derivatives of both g and μ.It should be noted that Theorem 3.4 holds when g ∈ H κ (U), i.e. the function of interest lies within the RKHS of the SE kernel.If this is not the case, convergence issues may arise (see Karvonen (2022); Karvonen and Oates (2022)) and one would need to choose an alternative kernel function.For consistency results involving Matérn kernels, see Stuart and Teckentrup (2018).

Error bound for GParareal solutions
Theorem 3.5 (GParareal error bound).Suppose the solvers used in GParareal satisfy Assumptions 3.1, 3.2, and 3.3, and that the conditions required for Theorem 3.4 hold.Then, the absolute error (3.13) of the GParareal solution to the autonomous scalar-valued ODE, i.e. f (t, u(t)) := f (u(t)) in (1.1), at iteration k and time t j satisfies where Proof.First, consider the case 0 j k J.For j = 0, recall that which we in fact know from applying F ∆T to V 0 0 during the prior iteration (i.e.k = 0).Therefore, we have that We can repeat this process iteratively up to j = J to show that Now, consider the case 1 k < j J. Using the update rule (3.14), that F ∆T is the exact solver (3.15), and adding and subtracting the terms g(V k j ) and G ∆T (V j ), we can write Applying the triangle inequality and the definition of g, we obtain On the right hand side, the first term can be bounded using (3.16), the second by (3.17), and the third using Theorem 3.4, yielding the recursion . This recursion can be solved using the initial condition e k j = 0 ∀k j to obtain the desired result.Theorem 3.5 shows that the error is proportional to the fill distance at iteration k and that GParareal will recover the exact solution at time t j after k = j iterations.
Note that this result is rather general in the sense that we consider the fill distance with respect to the entire space U ⊂ R, whereas in reality we would measure the fill distance with respect to a moderately sized compact interval V ⊂ U in which the solution u(t) lies ∀t ∈ [t 0 , T ].Essentially, the accuracy of the GP posterior mean outside of V is inconsequential to the GParareal scheme because the mean will never be evaluated outside of V. Also note, the result will generalise for GParareal applied to systems of ODEs by using norms and the generalised version of Theorem 3.4 in Wendland (2004).

Computational complexity
The complexity of GParareal can be calculated similarly to that of parareal-refer back to Section 2.3 for notation.In GParareal, an additional cost is incurred when (serially) conditioning the emulator on acquisition/legacy data and optimising the hyperparameters.During the kth iteration, up to kJ evaluations of F ∆T −G ∆T have been collected, hence standard cubic complexity GP conditioning scales like O(k 3 J 3 ) in terms of floating point operations (and O(k 2 J 2 ) per hyperparameter).Given a fixed number of time slices J, let T GP (k) represent the total wallclock time taken to condition and optimise hyperparameters of the GP (using up to kJ observations) at iteration k-note this is a strictly increasing function of k.Ignoring serial overheads, we can write down the total wallclock time for GParareal as where T GP := k i=1 T GP (i).The approximate parallel speed-up is then given by Therefore, in addition to the parareal requirements that k be as small as possible and T G T F , GParareal requires that T GP T F in order to maximise parallel speedup.If this is the case, the complexity of GParareal is approximately the same as parareal.
This suggests that if k and/or J are large, then the cost of the emulation may begin to dominate that of the fine solver, limiting the parallel speedup from GParareal, see Section 4. This, however, need not hinder the usability of GParareal for a number of reasons.Firstly, time-parallelisation is typically deployed on problems where additional parallel speedup is needed beyond that achieved by traditional domain decomposition, i.e. spatio-temporal PDEs.This means that if P processors are required for the space-parallel computations of the PDE and J processors for the time-parallel computations, then JP processors are required in total.For moderate to large values of P , only leftover HPC resources are available to exploit time-parallelism and so J typically cannot be chosen very large, somewhat limiting how large T GP can be.Secondly, in the scenario that both T GP and T F are small, one does not need to use a time-parallel method in the first place, as F ∆T can simply be run serially in this case.Thirdly, if both T GP and T F are large or of a similar order, then one can reduce T GP by reducing the number of time slices J, thereby increasing T F at the same time.
Whilst there is no way to control the final value of k obtained by either parareal or GParareal, there are ways of reducing T GP using more efficient non-cubic complexity, emulation methods.For example, one could make use of sparse GPs, parallel matrix inversion methods, or sparse approximate linear algebra techniques (Schäfer et al., 2021) to reduce the cost of evaluating the inverse kernel matrix K(x, x) −1 .One could also reduce T GP by clustering the input data points and training 'local' GPs in parallel (Snelson and Ghahramani, 2007) or instead use inducing points to average over input data points that are located close together in state space (Quiñonero Candela and Rasmussen, 2005;Snelson and Ghahramani, 2006)-see Murphy (2023) for additional methods.To reduce the, often significant, cost of hyperparameter optimisation, one may deploy parallel optimisation routines if available or, as we implement in Section 4, stop the optimisation once additional data no longer improves the hyperparameter estimates.

Generalisation to ODE systems
The methodology in Section 3.1 can be generalised to solve systems of d autonomous ODEs.Accordingly, the correction term we wish to emulate is now vector-valued, i.e.U ⊂ R d , hence we require a vector-valued (or multi-output) GP, rather than a scalar GP.
The simplest approach is to model each output of F ∆T − G ∆T independently, whereby we use d scalar GPs (sharing the same vector-valued inputs in state space) to emulate each output.This requires initialising d GP emulators, each with their own covariance kernel κ i (usually the same for consistency) and corresponding hyperparameters θ i -to be optimised independently using their own respective observation datasets y (i) , i = 1, . . ., d.In this case, the d GP emulators can be conditioned/optimised independently of one another and so we make use of the idle processors to carry out these computations in an embarrassingly parallel fashion to reduce the total GP complexity from O(dk 3 J 3 ) to O(k 3 J 3 ) each iteration-the same as the scalar case.
The more complex approach is to jointly emulate the outputs of F ∆T − G ∆T by modelling cross-covariances between outputs via the method of co-kriging (Cressie, 1993).A number of co-kriging techniques exist (see Álvarez et al. (2011) for a brief overview), one of which is the linear model of coregionalisation that models the joint, block-diagonal, covariance prior using a linear combination of the separate kernels κ i .Prior testing revealed that using this method did not improve performance enough to justify the added complexity, O(d 3 k 3 J 3 ) vs. O(dk 3 J 3 ) in the independent setting (results not reported).Some applications may require correlated output dimensions, hence we note the methodology here for any interested readers.
As a final note, to solve nonautonomous systems of equations, i.e. (1.1), there are two possible approaches.One way is to include the time variable as an extra input to each of the d scalar GPs-this requires a more carefully selected covariance kernel.The other way is to re-write the d-dimensional nonautonomous system as a system of d + 1 autonomous equations and solve as described above-this is the method we use in Section 4.

Numerical experiments
In this section, we present numerical experiments to compare the performance of GParareal and parareal on a number of low-dimensional ODE systems, namely the FitzHugh-Nagumo model, the chaotic Rössler system, a nonautonomous system, and the double pendulum system.MATLAB code for GParareal, parareal, and the GP emulator as used in the experiments of this section can be found at https://github.com/kpentland/GParareal.For simplicity, F ∆T and G ∆T are chosen to be explicit Runge-Kutta methods (RK) of order q, p ∈ {1, 2, 4, 8}, respectively (q p).Let N F and N G denote the number of time steps each solver uses over [t 0 , T ].For these experiments we built our own cubic complexity GP emulator to highlight the effectiveness of GParareal using standard out-the-box methods, postponing the implementation of more efficient and sophisticated emulation methods to a future work.In the multivariate setting (recall Section 3.5), we use a scalar output GP emulator (with isotropic SE covariance kernel) to model each output dimension of F ∆T − G ∆T and assign each one its own processor, reducing the GP emulation costs by a factor of d.Hyperparameter optimisation is carried out at each iteration, stopping when the (maximal) absolute difference between hyperparameters is larger than 10 −2 .The experiments are run on up 512 CPUs.

FitzHugh-Nagumo model
In this experiment, we consider the FitzHugh-Nagumo (FHN) model (FitzHugh, 1961;Nagumo et al., 1962) given by where we fix parameters (a, b, c) = (0.2, 0.2, 3).We integrate (4.1) over t ∈ [0, 40], dividing the interval into J = 40 slices, and set the tolerance for both GParareal and parareal to ε = 10 −6 .We use solvers G ∆T = RK2 and F ∆T = RK4 with N G = 160 and N F = 1.6 × 10 5 steps respectively.In Figure 4.1(a), we solve (4.1) with initial condition u 0 = (−1, 1) using both algorithms.Observe that the accuracy of GParareal is of approximately the same order as the solution obtained     To compare the convergence of both methods more broadly, we solve (4.1) for a range of initial values.The heatmap in Figure 4.2(a) illustrates how the convergence of parareal is highly dependent, not just on the solvers in use, but also the initial values at t = 0, taking anywhere from 10 to 15 iterations to converge.For some initial values, parareal does not converge at all, with solutions blowing up (returning NaN values) due to the low accuracy of G ∆T .In direct contrast, see Figure 4.2(b), GParareal converges more quickly and more uniformly due to the flexibility provided by the emulator, taking just five or six iterations to reach tolerance for all the initial values tested.This demonstrates how using an emulator can enable convergence even when G ∆T has poor accuracy.
Until now, GParareal simulations have been carried out using only acquisition data.In Figure 4.3, we demonstrate how GParareal can use both acquisition and legacy data to converge in fewer iterations than without the legacy data.Approximately kJ = 5 × 40 = 200 legacy data points, obtained solving (4.1) for u 0 = (−1, 1) , are stored and made available to the GP emulator when solving (4.1) for alternate initial values u 0 = (0.75, 0.25) .In Figure 4.3(a), we can see that convergence takes two fewer iterations with the legacy data than without.Accuracy of the solutions obtained from these simulations is again shown to be of the order of the parareal solution in both cases-see Figure 4.3(b).Repeating the experiment from Figure 4.2(b) with the same legacy data for a range of initial values we see that k is either unchanged or improved in all cases, see Figure 4.4.It should be noted that conditioning the GP and optimising hyperparameters using the legacy data comes at extra (serial) computational cost and checks should be made to ensure that T F T GP .These results illustrate that using GParareal (with or without legacy data) we can solve and evaluate the dynamics of the FHN model in significantly lower wallclock time than parareal.
In this experiment, rather than obtaining legacy data by solving (4.2) using alternative initial values (as we did in Section 4.1), we instead generate the data by integrating over a shorter time interval.This is particularly useful if we are unsure how long to integrate our system for, i.e. to reach some long-time equilibrium state or reveal certain dynamics of the system, as is the case in many real-world dynamical systems.For example, many dynamical systems that feature random noise may exhibit metastability, in which trajectories spend (a long) time in certain states (regions of phase space) before transitioning to another state (Grafke et al., 2017 Legoll et al., 2021).Such rare metastability may not be revealed/observed until the system has been evolved over a sufficiently large time interval.We propose integrating over a 'short' time interval, assessing the relevant characteristics of the solution obtained, and then integrating over a longer time interval (using the legacy data) if required.Note that to do this, all parameters in both simulations must remain the same, with the exception of the time step widths-to ensure the legacy data is usable in the GP emulator in the longer simulation.Suppose we solve (4.2) over t ∈ [0, 170], then we need to reduce J, N G , and N F by a factor of two, i.e. use J (2) = J/2, N and N (2) F = N F /2 in the shorter simulation.The legacy simulation, integrating over [0,170], takes nine iterations to converge using GParareal (ten for parareal), giving us approximately kJ (2) = 9 × 20 = 180 legacy evaluations of F ∆T − G ∆T (results not shown).Integrating (4.2) over the full interval [0, 340], GParareal converges in four iterations sooner with the legacy data than without-refer to Figure 4.5(c).In Figure 4.5(d) we can see that using the legacy data achieves a higher numerical speedup (3.4×) compared to parareal (1.6×).In Figure 4.5(a) we see the trajectories from each simulation converging toward the Rössler attractor and Figure 4.5(b) illustrates GParareal retaining a similar numerical accuracy to parareal with and without the legacy data.Note the steadily increasing errors for both algorithms is due to the chaotic nature of the Rössler system.

Nonautonomous system
Next, we consider a nonautonomous system of ODEs to demonstrate how GParareal handles explicit time dependence.We solve over t ∈ [−20, 500]-adapted from Trefethen et al. (2017).As described in Section 3.5, we transform this two-dimensional nonautonomous system into a three-dimensional autonomous system by introducing an additional variable u 3 (t) = t, where du 3 /dt = 1.Given that we know u 3 (t) explicitly, the third dimension of F ∆T − G ∆T need not be modelled with a GP.However, given the GPs are run in parallel anyway, this does not add to the cost of running GParareal.We select solvers G ∆T = RK1 and F ∆T = RK8 with N G = 2048 and N F = 5.12 × 10 5 steps, respectively.We use J = 32 time slices, initial condition u 0 = (0.1, 0.1, −20) , and a stopping tolerance of ε = 10 −6 .In Figure 4.6, we plot the solutions and corresponding errors generated by each of the solvers over time.Again, the results illustrate good convergence to the fine solver solution, with GParareal taking 10 iterations to locate the solution and parareal taking 20.We suspect that the superior performance of GParareal is partially due to the almost periodic nature of the solutions in Figure 4.6(a), enabling the emulator to reproduce the dynamics of F ∆T − G ∆T quite well.
Next, we run a series of simulations to measure the effect of increasing the number of time slices J (and hence processors) on convergence, wallclock times, and speedup-see Table 4.1.
To do this, we increase the number of fine time steps to N F = 5.12 × 10 10 , so that F ∆T is sufficiently expensive to observe speedup.We observe a good match between the numerical and theoretical results, presented in the top and bottom tables of Table 4.1, respectively, and visualised graphically in Figure 4.7.Firstly, notice that k para increases with J whilst k GPara remains largely unaffected, leading to speedups for GParareal being roughly 2× to 4× that of parareal, up to J = 256.For both algorithms, the cost of T G and T F decreases as J increases (due to fewer time steps per time slice), whilst T GP increases (due to increasing numbers of data points each simulation).Note the exception of T GP = 1.02E2 when J = 256 because hyperparameter optimisation converged within a few iterations and was therefore not carried out after this.Up to J = 256, T GP < T F and so we observe increasing parallel speedup for GParareal.When J = 512, the cost of the GP overtakes that of F ∆T and so parallel speedup decreases, albeit still being double that of parareal.Recall that if T GP > T F , we may not opt to use GParareal in the first place, for the reasons outlined in Section 3.4.{32, 64, 128, 256, 512}. (a) Wallclock times using the fine solver (dashed black), GParareal (dashed blue), and parareal (dashed red).Corresponding theoretical results are given by the solid lines, calculated using (2.5) and (3.18) for parareal and GParareal, respectively.(b) The corresponding speedup results using the same lines and colours.The theoretical results were calculated using (2.6) and (3.19) for parareal and GParareal, respectively.

Double pendulum system
Consider now the double pendulum system: a simple pendulum of mass m, rod length , connected to another simple pendulum of equal mass m, rod length , acting under gravity g (see Figure 4.8).Four ODEs govern the dynamics of this system: where (Danby, 1997).Note that m, , and g have been scaled out of (4.4) by letting = g.The variables u 1 and u 2 measure the angles between each pendulum and the vertical axis, while u 3 and u 4 measure the corresponding angular velocities.
For this experiment, we select solvers G ∆T = RK1 and F ∆T = RK8 with N G = 3072 and N F = 2.1504 × 10 5 steps, respectively.We integrate over t ∈ [0, 80], using J = 32 time slices with a stopping tolerance ε = 10 −6 .In Figure 4.9, we plot solutions for u 1 and u 2 over time using initial conditions u 0 = (2, 0.5, 0, 0) , i.e. the pendulums are positioned at some (positive) initial angles and released from rest.Observe how both pendulums move chaotically, with the inner pendulum oscillating within [−π, π] and the outer pendulum oscillating between odd multiples of π, "turning over" a number of times.We attain good solution accuracy from GParareal with respect to the fine solution with errors slowly increasing over time due to the chaotic nature of the system, much like what was seen in the Rössler experiments in Section 4.2.We plot k for various initial angles in Figure 4.10 to highlight the system's sensitivity to initial conditions.For small initial angles, GParareal converges sooner than parareal, but for much larger angles both algorithms use almost all of the 32 iterations to locate a solution (and in some cases, parareal does not return a solution).
In Table 4.2 and Figure 4.11, we again test the effect of increasing J on wallclock times, speedup, and convergence.To do this, we increase the number of fine time steps to N F = 2.1504 × 10 10 .We purposefully choose an initial condition (u 0 above) for which both algorithms converge in approximately the same number of iterations, so that we can directly observe how the increasing GP cost affects the performance of GParareal for large J.Under these circumstances, we can think of the wallclock time for GParareal as (approximately) the wallclock time of parareal plus the wallclock time of the GP conditioning/optimisation. For J 128, we observe that T GP < T F and so the speedup of GParareal and parareal are approximately the same.In these cases, using GParareal is no more costly than using parareal, with the additional benefit of being able to   re-use the acquisition data for a future simulation, if needed.For J 256, we begin to observe T GP ≈ T F (or larger), so the numerical speedup of GParareal begins to plateau.We should reiterate, however, that using so many processors for such a small test problem is quite excessive.Figure 4.11.Numerical results obtained solving the double pendulum system (4.4) for J ∈ {32, 64, 128, 256, 512}.(a) Wallclock times using the fine solver (dashed black), GParareal (dashed blue), and parareal (dashed red).Corresponding theoretical results are given by the solid lines, calculated using (2.5) and (3.18) for parareal and GParareal, respectively.(b) The corresponding speedup results using the same lines and colours.The theoretical results were calculated using (2.6) and (3.19) for parareal and GParareal, respectively.

Discussion
In this paper, we present a time-parallel algorithm (GParareal) that iteratively locates a numerical solution to a system of ODEs.It does so using a predictor-corrector, comprised of numerical solutions from coarse (G ∆T ) and fine (F ∆T ) integrators.However, unlike the classical parareal algorithm, it uses a Gaussian process (GP) emulator to infer the correction term F ∆T −G ∆T .The numerical experiments reported in Section 4 demonstrate that GParareal performs favourably compared to parareal, converging in fewer iterations and achieving increased parallel speedup for a number of low-dimensional ODE systems.We also demonstrate how GParareal can make use of legacy data, i.e. prior F ∆T and G ∆T data obtained during a previous simulation of the same system (using different ICs or a shorter time interval), to pre-train the emulator and converge even faster-something that existing time-parallel methods cannot do.In Section 4.1, using just the data obtained during simulation (acquisition data), GParareal achieves an almost two-fold increase in speedup over parareal when solving the FitzHugh-Nagumo model.Simulating over a range of initial values, GParareal converged in fewer than half the iterations taken by parareal and, in some cases, managed to converge when the coarse solver was too poor for parareal.When using legacy data, GParareal could converge in even fewer iterations.Similar results were illustrated for the Rössler system in Section 4.2 but with legacy data obtained from a prior simulation over a shorter time interval-beneficial when one does not know how long to integrate a system for.In Sections 4.3 and 4.4, GParareal was tested on a larger number of processors (up to 512), verifying the theoretical computational complexity results given in Section 3.4 and that the cost of the GP needs to be much smaller than the cost of the fine solver in order for speedup to be maximised.In all cases, the solutions generated by GParareal were of a numerical accuracy comparable to those found using parareal.
In its current implementation, GParareal may, however, suffer from the curse of dimensionality in two ways.First, an increasing number of data points, O(kJ), is problematic for the standard cubic complexity GP implemented here.In this case, a more sophisticated (non-cubic complexity) emulator or perhaps using neural networks could be beneficial.Second, trying to emulate a d-dimensional function F ∆T − G ∆T is difficult if the number of evaluation points is not sufficient.One option to tackle this may be to obtain more acquisition data by launching more F ∆T and G ∆T runs using the idle processors to further train the emulator at little additional computational cost.However, as shown in Section 3.3, the accuracy of the GP emulator is strongly controlled by the fill distance of the set of evaluation points, which is generally difficult to restrict when d is large.One could think about using legacy data generated by evaluating F ∆T − G ∆T at specific input locations (for example, a uniform grid) that satisfy certain fill distance requirements in the state space.
It should also be noted that GParareal may not always provide faster convergence using legacy data if such legacy evaluations of F ∆T − G ∆T lay 'far away', i.e. over one or two input length-scales away, from the initial values of interest in the current simulation.In this case, GParareal would rely more heavily on its acquisition data.There is no immediate remedy for such a situation, but using a fallback parareal correction, as suggested in the next paragraph, could be an option.
In equation (3.11), we approximate a Gaussian distribution by taking its expected value, ignoring uncertainty in the GP posterior for F ∆T − G ∆T .In this setting, the GP emulator is used to interpolate the F ∆T − G ∆T data, hence it is perfectly acceptable to swap it out for any other sufficiently accurate interpolation method, e.g.kernel ridge regression (Kanagawa et al., 2018).During early iterations of GParareal, when little acquisition data are available, the uncertainty in the GP posterior (i.e the variance) may be large at points of interest.By retaining the GP posterior uncertainty, one could (ideally) propagate the full uncertainty using the coarse solver to the next time step and then continue.While this would produce a probabilistic version of GParareal, this would be a computationally expensive process that we wish to avoid at this stage.One alternative to approximating (3.11) by its expected value could be to draw a random sample instead.A sampling-based solver such as this would return a stochastic solution to the ODE, much like the stochastic parareal algorithm presented in Pentland et al. (2022).It is unclear how this algorithm would perform vs. parareal (or even stochastic parareal), however, it could still make use of legacy data following successive independent simulations.Another possible alternative to approximating (3.11) arises if the input initial value is at least one or two length-scale distances away from any other known input value in our acquisition dataset.In this case, we then might expect the GP emulation of the mean in (3.11) to have high variance and so a fallback to the deterministic parareal correction for F ∆T − G ∆T (see (2.3)) could be built in as a next best correction to the coarse prediction in (3.11).Among others, these are two alternative formulations of GParareal that are worth investigating to account for the whole Gaussian distribution provided by the emulator and not just its mean value.
Follow-up work will focus on extending GParareal, using some of the methods suggested above, to solve higher-dimensional systems of ODEs in parallel (possibly PDEs).In the longer term, we aim to develop a truly probabilistic time-parallel numerical method that can account for the inherent uncertainty in the GP emulator, returning a probability distribution rather than point estimates over the solution.
funding from the Euratom research and training programme 2014-2018 and 2019-2020 under grant agreement No. 633053.The authors would also like to acknowledge the University of Warwick Scientific Computing Research Technology Platform for assistance in the research described in this paper, in particular Arkadiy Davydov.The views and opinions expressed herein do not necessarily reflect those of any of the above-named institutions or funding agencies.For the purpose of open access, the author has applied a CC BY public copyright licence to any Author Accepted Manuscript version arising.

Figure 2 . 1 .
Figure 2.1.Computational task scheduling during three iterations of parareal compared with a full serial integration.The coloured blocks represent the wallclock time any given processor spends on a task.Coarse runs are shown in yellow, fine runs in blue, and any idle time in white.The wallclock time is given on the axis at the bottom, indicating both T para and T serial .

Figure 3 . 1 .
Figure 3.1.Schematic of the first iteration of GParareal.The 'exact' solution over [t 0 , t 3 ] is shown in black, with the first coarse and fine (parallel) runs given in yellow and blue respectively.Solid bars represent the residual between these solutions (3.6).The predictions, i.e. the second coarse runs, are shown in red and the corresponding corrections from the GP emulator are represented by the dashed bars.The updated solutions (3.12) at the end of the iteration are represented by the red dots.Note the black and blue lines in [t 0 .t 1 ] should overlap but have been made not to for clarity.
Figure 4.1.Numerical results obtained solving the FHN model (4.1) for u 0 = (−1, 1) .(a) Time-dependent solutions using the fine solver, GParareal, and parareal-both GParareal and parareal plotted only at times t for clarity.(b) The corresponding absolute errors between solutions from GParareal and parareal vs. the fine solution.(c) Maximum absolute errors (2.4) of each algorithm at successive iterations k until tolerance ε = 10 −6 is met.(d) Median wallclock times (taken over 5 runs) of both algorithms against the number of processors (up to 40).The inset plot displays the corresponding parallel speedup.

Figure 4 . 2 .
Figure 4.2.Heat maps displaying the number of iterations taken until convergence k of (a) parareal and (b) GParareal when solving the FHN model (4.1) for different initial values u 0 ∈ [−1.25, 1.25] 2 .Black boxes indicate where parareal returned a NaN value during simulation.

Figure 4 . 5 .
Figure 4.5.Numerical results obtained solving the Rössler system (4.2) over t ∈ [0, 340].(a) Solutions from the fine solver, GParareal (with legacy data), and parareal plotted in the (u 1 , u 2 )-plane-both GParareal and parareal plotted only at times t for clarity.(b) The corresponding absolute errors between solutions from GParareal and parareal vs. the fine solution.(c) Maximum absolute errors (2.4) of each algorithm at successive iterations k until tolerance ε = 10 −6 is met.(d) Median wallclock times (taken over 5 runs) of each simulation against the number of processors (up to 40).The inset plot displays the corresponding parallel speedup.
Figure 4.6.Numerical results obtained solving the nonautonomous system (4.3).(a) Timedependent solutions using the fine solver, GParareal, and parareal-both GParareal and parareal plotted only at times t on [−20, 150] for clarity.(b) The corresponding absolute errors between solutions from GParareal and parareal vs. the fine solution, having converged after 10 and 20 iterations, respectively.

Figure 4
Figure 4.8.A schematic of the double pendulum system.
Figure 4.9.Numerical results obtained solving the double pendulum system (4.4).(a) Timedependent solutions for u 1 and u 2 using the fine solver, GParareal, and pararealboth GParareal and parareal plotted only at times t for clarity.Dashed lines indicate "turning over" angles, at which either pendulum passes through an odd multiple of π.(b) The corresponding absolute errors between solutions from GParareal and parareal vs. the fine solution, having converged after 23 and 22 iterations, respectively.