GPU acceleration of splitting schemes applied to differential matrix equations

We consider differential Lyapunov and Riccati equations, and generalized versions thereof. Such equations arise in many different areas and are especially important within the field of optimal control. In order to approximate their solution, one may use several different kinds of numerical methods. Of these, splitting schemes are often a very competitive choice. In this article, we investigate the use of graphical processing units (GPUs) to parallelize such schemes and thereby further increase their effectiveness. According to our numerical experiments, large speed-ups are often observed for sufficiently large matrices. We also provide a comparison between different splitting strategies, demonstrating that splitting the equations into a moderate number of subproblems is generally optimal.


Introduction
We are interested in differential matrix equations of Lyapunov or Riccati type, or generalized versions of these.They are all of the form Ṗ = A T P + P A + Q + G(P ), where A ∈ R n×n and Q ∈ R n×n are given matrices and G is a matrix-valued function of P .For differential Lyapunov equations (DLE) we have G(P ) = 0 and for differential Riccati equations (DRE) we have G(P ) = −P BR −1 B T P with two given matrices B ∈ R n×m and R ∈ R m×m .Such equations occur frequently in many different areas, such as in optimal/robust control, optimal filtering, spectral factorizations, H ∞ -control, differential games, etc. [1,5,27,37].Perhaps the most relevant setting is the linear quadratic regulator (LQR) problem.There, the aim is to optimize a finite-time cost function of the form J(u) = T 0 y(t) 2 + u(t) 2 dt, T ≥ 0, under the constraints that ẋ = Ax + Bu (state equation) and y = Cx (output equation, with C ∈ R p×n ).In this case, the solution to the DLE with Q = C T C gives the observability Gramian of the system, which characterizes the relevant states x for the input-output mapping u → y.The solution of the DRE, on the other hand, provides the optimal input that minimizes J, in state feedback form.In fact, if P solves the DRE with Q = C T C then the optimal input u opt is given by u opt (t) = −R −1 B T P (T − t)x(t).
For the generalized DLE and DRE versions, an additional linear term SP S T appears in G(P ), where S ∈ R n×n is a given matrix.Such equations also arise in the LQR setting, when a stochastic perturbation of multiplicative type is included in the state equation.
In recent years, a number of numerical methods have been suggested for largescale DLEs, DREs and related equations.The classic ones, low-rank versions of BDF and Rosenbrock schemes [14,15,30] are usually outperformed by more modern methods such as Krylov-based projection schemes [28], peer methods [29] or splitting schemes [31,41,42].In this paper, we focus on splitting schemes.These methods lower the computational cost by dividing the problem into simpler subproblems such as Ṗ = A T P + P A and Ṗ = Q and then solve these separately, in sequence.While the splitting of course introduces an additional error, this is generally compensated by the decreased computational cost and leads to large speed-ups.
The hypothesis to be investigated in this paper is that utilizing a graphical processing unit (GPU) to parallelize the schemes may further greatly increase the efficiency.Such speed-ups have already been observed for other related methods for DREs [9,10,11] as well as for their steady-state versions; the algebraic Lyapunov and Riccati equations [11,12].In the just mentioned cases, the basic building block of the schemes is the computation of the matrix sign function, which requires the inversion of a large dense matrix.In a splitting scheme, the basic building block is instead the computation of the action of a matrix exponential on a skinny matrix.Speed-ups have previously been observed for applications where matrix exponentials are multiplied by vectors [3,21], see also [22].In these works, a speedup is generally not observed for "small" matrices (n 1000), and the speed-up is of limited size when the matrices are sparse rather than dense.As we are typically interested in at least medium-sized problems (1000 n 10000) we do expect to see a significant speed-up.Moreover, while we are necessarily considering the sparse case, we are not simply computing the action of the matrix exponential on vectors, but on skinny block matrices.This increases the parallelizability of the problem and makes the sparsity issues noted in e.g.[23,7,21] less relevant.
Since the relevant methods are mainly implemented in Matlab, we restrict ourselves to utilizing its built-in GPU support [38] via NVIDIA's CUDA [34] parallel programming interface.We do not claim that this approach leads to the best possible performance.The point is rather to demonstrate that quite simple changes to the implementations of the splitting schemes may lead to much better performance, when one has access to a GPU.Our results already show a remarkable improvement in efficiency, and this can only increase with further optimisations and the use of more advanced techniques tailored to specific problems.
In addition, we provide comparisons between different splitting strategies for DLEs and DREs.We particularly address questions that naturally arise while solving these equations by splitting methods.E.g., should the DLE be split at all?Should the DRE be split into two or three subproblems?Our results in this direction demonstrate that it is usually beneficial to use the smallest number of splits.However, when Q is sufficiently small it is beneficial to split it too, since the extra error is similarly small and the subproblems Ṗ = A T P + P A and Ṗ = Q are very cheap to compute compared to Ṗ = A T P + P A + Q.
An outline of the article is as follows.In Section 2 we review the idea behind splitting schemes and apply them to all the mentioned equation types.Then we consider implementation details and the simple changes necessary for GPU utilization in Section 3. We present the results of several numerical experiments in Section 4, and summarize our conclusions in Section 5.

Splitting schemes
Splitting schemes are numerical methods that are applicable to differential equations that have a natural decomposition into two (or more) parts; Ṗ = F (P ) = F 1 (P ) + F 2 (P ), P (0) = P 0 .
With "natural decomposition" we mean that the subproblems Ṗ = F 1 (P ) and Ṗ = F 2 (P ) are either simpler or cheaper to solve than the full problem Ṗ = F (P ).This is the case in many problems, with the most common example being reaction-diffusion equations ẋ = ∆x + f (x).In this case, there are highly optimized methods for the pure diffusion problem ẋ = ∆x, while the subproblem ẋ = f (x) often turns into a local rather than global problem -i.e. it is enough to solve ẋi = f (x i ) for every discretization point x i .In the following, we denote the solution to Ṗ = F k (P ), P (0) = P 0 , by P (t) =: T k (t)P 0 .
The most basic and commonly used (exponential) splitting schemes are the Lie and Strang splittings.They are given by the time stepping operators respectively, where h is the time step.Of course, the roles of T 1 and T 2 might be interchanged.The schemes are then defined by with P L 0 = P S 0 = P 0 .Here, P L k and P S k both approximate P (kh).The Lie splitting is first-order accurate while Strang splitting is second-order accurate under certain conditions on F 1 , F 2 and F , see e.g.[26].For simplicity, we restrict ourselves to the Strang splitting scheme in this paper, but one might also consider higher-order schemes [20,24,42], or schemes where the subproblems are not solved exactly, see e.g.[26,25].
Clearly, one might continue the splitting procedure if the system is naturally decomposed into more than two parts.If Ṗ = F 1 (P ) + F 2 (P ) + F 3 (P ) then applying the Lie and Strang splitting schemes twice leads to the schemes Again, the roles of T 1 , T 2 and T 3 might be interchanged.Different compositions with a possibly higher number of operators might also be considered, in order to optimize the structure of the error.We refer to [4] but do not consider such methods here.Like essentially every other method for solving differential matrix equations, the splitting schemes need to make use of low-rank structure in order to be competitive in the large-scale setting.This means that we can expect the singular values of the symmetric, positive semi-definite solution P to decay rapidly, see e.g.[2,6,8,36,40], and thus we can factorize P ≈ LDL T for L ∈ R n×r , D ∈ R r×r with r ≪ n.By formulating the methods to only operate on L and D and never explicitly form the product LDL T , we drastically lower both the memory requirements and the computational cost.
In the following, we outline different splitting strategies for all the matrix equations mentioned so far, and also review how to low-rank-factorize each arising subproblem.
2.1.Differential Lyapunov equations.As a first example, we consider the differential Lyapunov equation Ṗ = A T P + P A + Q, P (0 Here we may choose F 1 as the linear part and F 2 as the constant term, i.e.F 1 (P ) = A T P + P A and F 2 (P ) = Q.
These subproblems can be solved explicitly and the solutions at time h are given by T 1 (h)P 0 = e hA T P 0 e hA , T 2 (h)P 0 = P 0 + hQ.
It is easily seen that if we have the LDL T -factorizations Q , then we can also factorize these solutions as We could also note that the exact solution to the full problem is given by ( 4) where the integral term may be approximated by high-order quadrature as in [41].While this does not result in a splitting scheme of the form described above, we still include it in our experiments due to its similarity and efficiency.

Differential Riccati equations.
A second example is given by the differential Riccati equation: Ṗ = A T P + P A + Q − P BR −1 B T P, P (0) = P 0 .
In this case, we can either split in three terms; F 1 (P ) = A T P + P A, F 2 (P ) = Q, and F 3 (P ) = P BR −1 B T P, or two terms 1 ; F 12 (P ) = A T P + P A + Q, and F 3 (P ) = P BR −1 B T P.
The latter was advocated in [41,42] because (experimentally) the error constant in the three-term splitting is much larger.However, the three-term splitting does not need to approximate the integral term, and thus the larger error might be compensated by a lower computational cost.
In either case, we note that the solution at time h to the problem Ṗ = F 3 (P ), P (0) = P 0 , is given explicitly by (6) T 3 (h A low-rank factorization is given by Note that the I in this equation is not the same identity matrix as in the previous equation, because the DL T -part of P 0 has moved.We thus only need to solve a small linear equation system.
2.3.Generalized Lyapunov equations.We further consider a generalized Lyapunov equation of the form We again split the equation and obtain three subproblems defined by2 F 1 (P ) = A T P + P A, F 2 (P ) = Q, and F 4 (P ) = SP S T .
The first two subproblems are handled as before, whereas we approximate T 4 (h)(P ) by the midpoint rule, analogously to what is done in [19]: Given P 0 = L 0 D 0 L T 0 , we get T 4 (h)P 0 ≈ LDL T , where where blkdiag is the block diagonal operator that puts its block arguments on the diagonal of an otherwise zero matrix.We note that when using a second-order splitting scheme like the Strang splitting, it is necessary to use a second-order method like the midpoint rule in order to preserve the overall convergence order.If we use instead a first-order scheme like the Lie splitting, it is sufficient to approximate Ṗ = F 4 (P ) by e.g. the explicit Euler method.
2.4.Generalized Riccati equations.Moreover, we study a generalized Riccati equation given by Ṗ = A T P + P A + Q + SP S T − P BR −1 B T P, P (0) = P 0 , (8) and split this equation into three subproblems of the form F 12 (P ) = A T P + P A + Q, F 3 (P ) = −P BR −1 B T P, and F 4 (P ) = SP S T .
These subproblems are solved similarly as in the previous subsections.We do not consider a four term splitting since the extra error due to the splitting would become prohibitively large.

GPU considerations
In this section we briefly describe the GPU implementations of the Strang splitting applied to the differential matrix equations discussed in Section 2. All the algorithms are implemented in Matlab 2017a R using the Parallel Computing Toolbox.As there exist many built-in GPU-based functions, implementing algorithms on the GPU is quite user-friendly in recent releases of Matlab, see e.g.[38].In order to avoid unnecessary communication between the CPU and the GPU, we first move all the data to the GPU, solve the equations on the GPU and transfer the results back to the CPU.These two steps are accomplished by the Matlab commands gpuArray and gather, respectively.
3.1.Action of the matrix exponential.In all the considered equations, the most demanding part is to compute the action of the matrix exponential in (2) efficiently.In [17,18] the authors considered an algorithm based on Leja interpolation and showed that applying the algorithm to a matrix derived from a spatial discretization of a differential operator is very efficient.We will thus use this method to compute e hA L for different skinny matrices L. In the following, we denote this algorithm by expleja.
We briefly sketch the implementation of the method and refer to [17,18] for details.As a first step, the spectrum of the matrix A is estimated by the Gershgorin disk theorem.After moving A to the GPU by use of gpuArray, we take advantage of GPU-based built-in functions like diag, sum and abs and implement this algorithm equivalently to the CPU version.We proceed similarly with the computation of the parameters for the exponential interpolation, see [17].The most time consuming part of the algorithm is the Newton interpolation that one has to perform at every Leja point.Here we profit from the efficient matrix-vector and matrix-matrix multiplication for large-scale matrices on the GPU, and the property that the multiplication of two GPU arrays is again stored on the GPU.Thus, the GPU implementation pays off particularly in this section of the code and we never have to copy the matrices from the GPU to the CPU and vice versa.

3.2.
Computing the matrix equations on the GPU.In this subsection we will explain the methods used to solve the differential matrix equations on the GPU.First, we consider the DLE solved by the Strang splitting approach.We assume that the matrices are stored on the CPU, hence as a first step we copy these matrices to the GPU.Then, we can apply the steps from the CPU implementation as described in [42], computing the actions of the matrix exponential by Leja interpolation.A detailed description is given in Algorithm 3.1.

Algorithm 3.1 Solving DLE by Strang splitting on the GPU
Q and P 0 = LDL T and copy matrices to GPU using gpuArray.On the other hand, as mentioned in Subsection 2.1 it is possible to derive an explicit form of the solution of the DLE given by (4).Hence, following [42] we compute an approximation of the solution as given in Algorithm 3.2 using a quadrature rule to estimate the integral.
We note that in both Algorithm 3.1 and Algorithm 3.2 there is a so-called column compression step.This refers to the procedure of discarding (almost) linearly dependent columns from L, and serves to keep the number of columns in the approximations small.Without such a step, each iteration of Algorithm 3.1 (for example) would add the columns in L Q to L, while the rank would likely stay similar.The Algorithm 3.2 Solving DLE by Quadrature Rule on the GPU 1: Given: A, Q, P 0 , T , N t , h = T Nt .2: Repeat Steps 2 and 3 from Algorithm 3.1.3: Approximate integral: • Compute n nodes s k and weights w k of quadrature formula; compression can be performed in various ways, usually by computing either a reduced rank-revealing QR factorization or a reduced SVD [30].Here, we employ a reduced SVD factorization, followed by a diagonalization of the small resulting system.It is cheap as long as the rank of the solution stays low, which is the case in many applications.
As noted in Section 2, we also want to approximate the solutions to DREs and generalized DLEs and DREs.Therefore, we further have to solve the subproblems given by F 3 and F 4 .Pseudo-code for these computations, based on the low-rank factorizations given in Section 2.2-2.3, is shown in Algorithms 3.3 -3.4.
Further, we consider the three-term splitting where we compute the two terms T 1 (h)P 0 and T 2 (h)P 0 as in Algorithm 3.1.Finally we reverse the order of the three-term splitting Due to the additional splitting term, further errors are introduced, but since the integral does not have to be computed the three-term splitting codes are less computationally demanding.
The generalized DLE can be solved by the same three approaches.Using Algorithm 3.4, T 3 is replaced by T 4 in the previous three formulas.Finally, we apply a three-term Strang splitting to the generalized DRE, given by is applied.

Numerical experiments
The aim of this section is to show the different splitting strategies applied to the matrix equations, using the algorithms described in the previous section.We will first show the accuracy of the splitting schemes on a small-scale example, where we can compute an accurate reference solution by a Matlab built-in solver.Then, we show the speed-up of the code of the GPU implementation in comparison to the CPU implementation and compare the efficiencies of the various methods.Finally, we consider two real-world medium-to large-scale examples for DLE and DRE, respectively, and demonstrate that GPU acceleration is similarly advantageous there.
4.1.Small-scale accuracy verification.We show the results of the algorithms on a small example.Consider the Laplacian on the unit square with homogeneous Dirichlet boundary conditions.By discretizing it using central second-order finite differences with n grid points in each space dimension, we acquire a matrix A ∈ R n 2 ×n 2 .We let Q and P 0 be randomly chosen matrices of rank 2 and rank 5, respectively.The tolerance for both the column-compression and the Leja interpolation were set to 10 −16 .
Figure 1 shows an order plot for the Strang splitting applied to the DLE (1).The reference solution is computed by the Matlab routine ode15s with relative tolerance 2.22 • 10 −14 and absolute tolerance 10 −20 .We take as final time T = 1 2 and n = 5.Next, we apply the Strang splitting to the DRE with A, Q and P 0 as defined previously, B as a randomly chosen matrix of size n 2 × 1 and either R −1 = 1 or R −1 = 10 −3 , see Figure 2. The DRE is solved by the three approaches introduced in the previous section.In the following, we will denote these by "2 term", "3 term F 1 F 2 F 3 " and "3 term F 1 F 3 F 2 ".We see from Figure 2  F 1 F 3 F 2 is less accurate, whereas the errors of the two remaining splitting schemes behave similarly.Thus, the error due to splitting away the part F 3 is more severe than splitting F 1 and F 2 .However, using R −1 = 10 −3 leads to a different result.
The three-term splittings now yield roughly equally large errors, but the two-term splitting is about 10 times more accurate than the other schemes.Here we clearly observe the additional error introduced by the third splitting term.Finally, we consider for the generalized matrix equations an example introduced in [13], where the matrix A denotes again the discretized 2D Laplacian on the unit square with homogeneous Dirichlet boundary conditions on two edges.On the third edge, we use the fixed boundary condition given by x = u, and on the final edge a Robin boundary condition n∇x = 0.5(0.5 + dW )x is applied.This leads to a matrix B ∈ R n 2 ×1 and a matrix S ∈ R n 2 ×n 2 .The matrix Q = CC T is defined by letting C = 1 n 2 (1, . . ., 1) be the matrix representation of the mean.We then solve the generalized Lyapunov equation (7) and show the corresponding error plot in Figure 3 (left).Moreover, we take R = 1 and compute also the relative error of the solution of the generalized Riccati equation, see Figure 3 (right).We again see that the two-term splitting of the generalized DLE is approximately 10 times more accurate than the other two splitting schemes.As in all previous examples, we observe that the error of the generalized DRE behaves as expected, i.e. it converges with order 2 and remains small for all step sizes.4.2.GPU speed-up.We compare the computational costs of the GPU based algorithm described above running on a Tesla K80 with 2 × 12 GB RAM, with the algorithm which operates only on the CPU (Intel Xeon E5-2630).We compute the solution until T = 1 2 with step size h = 0.005 for different sizes of the matrix A and repeat the calculation 50 times to get an accurate result.We build the matrices as described in Subsection 4.1.All of the experiments are performed on Matlab on the same platform.In order to have a fair comparison between the different implementations, we run the Matlab codes on a single core by using the command -singleCompThread.Moreover, we deactivate the Java Virtual Machine by -nojvm.The computing time of the CPU algorithm is estimated by the command tictoc.For the GPU algorithm we use the wait function with a GPUDevice object as input and measure the time by tic -toc and wait.time in seconds  In Figure 4 (left) the differential Lyapunov equation is solved with the splitting scheme as described in Section 2.1, see also Algorithm 3.1, and by the non-splitting scheme, where we compute the integral by a high-order Gauss quadrature rule as in Algorithm 3.2.The computing time is given as a function of the size of the matrix A. We observe that for small matrices the CPU implementation is less time consuming than the GPU implementation.However, if the size of the problem exceeds 10 3 × 10 3 , it pays off to use the GPU implementation.We observe a speedup of over a factor 280 for matrices of size 10 4 × 10 4 , for both splitting schemes.We further note that the time complexity of the splitting scheme is comparable to the non-splitting part.A similar behaviour can be seen in Figure 4 (right), where we plot the measured time of the splitting schemes of the DRE.Again, a major speedup of the implementation on the GPU is detected for these medium-scale problems.(For matrices of size 10 4 × 10 4 the GPU implementation is approximately 300 times faster).We expect an even larger performance gain for truly large-scale problems.
We see no difference between the two three-term splitting schemes regarding time measurements, but the two-term splitting is slightly slower (as expected).The lack of speed is often more than compensated by a higher accuracy, however, as shown in the efficiency plot in Figure 5.This plots the relative error against the required computation times, and thus the further left and down in the plot, the "better".We observe that the three-term schemes are most efficient for all error levels when R = 1, while the two-term splitting is more efficient for small error levels when R = 10 −3 .This makes sense, as the former case means that Q is relatively small and splitting it away therefore yields a small splitting error.In the latter case, Q is relatively big, and should not be split.(Because we are considering a very small-scale example here, the CPU versions of the code outperform the GPU in all cases.)relative error relative error The GPU-based codes also exhibit better performance for the generalized matrix equations, see Figure 6.We consider here a two-term splitting of the generalized DLE as well as two different three-term splittings, and compare the computational costs of these schemes as well as the difference between CPU and GPU implementation.We use the matrices for the equations that were introduced in the previous subsection.As mentioned in Section 2, we employ a three-term splitting for the generalized DRE.We do not consider a four-term splitting here, since the error constants can be expected to be prohibitively large.4.3.Real-world example: Simulation of El Niño.As a second example for DLEs, we consider the weather phenomenon El Niño.This is characterized by an time in seconds  3. Time measurements for DLE for different matrix sizes.
unusual warming of the sea surface temperature in the Indo-Pacific ocean.It can be modeled by a stochastic advection equation driven by additive noise [35] and its covariance is given by a DLE of the form Ṗ (t) = AP (t) + P (t)A T + Q, see [31,32] for details.The matrix A arises from a centered finite difference approximation of the advection operator and Q is the discretized covariance operator of the random noise.We consider here the different discretization resolutions corresponding to n = 624, 1740, 3900, 7800 and 15600, and approximate the solution either by Algorithm 3.1 or by directly approximating the integral term in (4) as outlined in section 2.1.Figure 7 (left) shows the relative error for the splitting schemes for different step sizes.The reference solution is computed by the non-splitting scheme with a 16 times smaller step size.We show only the results for n = 1740, but the error behaves similarly for the other problem sizes.We observe the second-order convergence of the Strang splitting scheme and note that the relative error is small even for larger step sizes.Similar to the academic example from the previous section, the time complexities of the non-splitting scheme and the splitting scheme are comparable, as shown in Figure 7 (right).The problem size where the GPU parallelization starts to pay off is also similar: at n = 624 the CPU and GPU costs are comparable but at n = 1740 we already observe a speed-up of a factor 6. For the finest resolution, the speed-up is roughly a factor 100. Figure 8 displays the efficiency of the numerical schemes for the El Niño example.As indicated by Figure 7 (right), the GPU and CPU versions of the 2-term splitting are roughly as efficient for the smallest problem.The 1-term "splitting" performs much better, and yields the same error independent of the step size.This happens because we are essentially computing the exact solution to the problem, and the only error is the quadrature error arising from the integral approximation.The same behaviour can also be observed for n = 3900, but here the GPU implementation clearly outperforms the CPU implementation.The 1-term method is again most efficient, but we note that the shape of the curves indicates that the 2-term splitting will become advantageous for either smaller errors or larger problems.4.4.Real-world example: Steel cooling.As a second example for DREs, we consider the optimal cooling of steel profiles.This problem has been widely studied in the literature, for details see [39,16].It gives rise to a DRE of the form The matrices M and A are the mass and stiffness matrices resulting from a finite element discretization of the Laplacian on a non-convex polygonal domain (the steel profile).Q is chosen as CC T , where C is the discretization of an operator that measures temperature differences between different points in the domain.We note that we would normally never explicitly compute the (generally dense) matrices involving M −1 .In the CPU code, we form and reuse an incomplete LU decomposition of M to cheaply solve a linear equation system whenever the action of M −1 or M −T is required.In the GPU code, the issue is unexpectedly complicated by the fact that Matlab's CUDA interface does not support solving equation systems with sparse system matrices and (dense) block right-hand sides.This is supported in the cuSPARSE library of CUDA itself, so until the Matlab interface is extended one might theoretically implement this capability by a MEX extension.In order to demonstrate performance gains by rather easy means, however, we do not do this.Instead, we compute and store a dense LU factorization.This is clearly not viable for truly large-scale problems, but problems of up to size n ≈ 3 • 10 4 are easily possible on our available hardware, and up to n ≈ 5.5 • 10 4 if AM −1 is explicitly formed at a slightly higher initial cost.Despite the heavy additional memory requirement, the GPU parallelization leads to a significant speed-up.
An additional issue related to the mass matrix is the original Leja point interpolation method for the computation of matrix exponential actions.One of the main steps of this algorithm computes an estimate of the spectrum of A by the use of Gershgorin discs.Since computing these require direct access to the elements in A, it is not directly applicable to AM −1 without explicitly forming the matrix.To get around this issue and still acquire a cheap estimate, we utilized the results of [33] which extends the Gershgorin approach to generalized eigenvalue problems.In our experience, this method overestimates the imaginary part of the spectrum but otherwise works well.We note that if the GPU code utilizes dense matrices, we may of course simply compute AM −1 and apply the original Leja point method.Since we expect to be able to work with sparse matrices in the near future, however, we follow the approach outlined above in both the CPU and GPU codes.
In the following we consider discretizations corresponding to n = 371, 1357, 5177 and 20209.We take R −1 = I, T = 1 2 and P (0) = 0 and measure the computation times and relative errors of the different implementations.In Figure 9 (left) the relative error is shown for the size n = 1357 and different temporal step sizes.The reference solution is computed by the two-term splitting scheme with a step size that is 32 times smaller.We see that the two-term splitting performs considerably better than the other schemes, and the additional error due to a third splitting term has a huge impact on the approximation.In Figure 9 (right) we show the measured computation times for the algorithms when 100 time steps are taken.For the medium scale problem (n = 1357), the GPU and the CPU implementations have roughly the same computational costs but for the larger scale problems, the GPU implementation pays off.
We observe nearly no difference in the time measurements of the three different splitting schemes.This is likely because the integral term is only computed once, and the cost of this computation is minor in comparison to the remaining computations.Since the two-term splitting scheme is much more accurate here, it is therefore also more efficient, as shown in Figure 10.If only low accuracy is desired, the three-term splitting schemes are the best choice, but in all other cases the two-term splitting is most effective.

Conclusions
We have considered several different splitting schemes based on Leja point interpolation for the computation of matrix exponential actions.Since the matrix exponentials act on skinny block-matrices (the low-rank factors) rather than only vectors, we expected that these computations would be highly parallelizable and that GPU acceleration would therefore be beneficial.The latter was verified by several numerical experiments.In the considered problems of academical nature, the GPU code was faster than the pure CPU code by approximately a factor 10 already for matrices of size 2 • 10 3 , and by a factor 10 3 for matrices of size 10 4 .The break-even point was around size 650, which is well below what would be considered large-scale today.In the tested real-world applications, the gains in the El Niño example were in accordance with the more academic examples.For the steel example, they were more modest, but still around a factor 10 for the relevant problem sizes.As there is no difference in the size of the numerical errors, this clearly shows that GPU acceleration can lead to large gains in efficiency and should be considered for matrix equations of these type.The efficiency could additionally be further increased by considering more advanced parallelization techniques.An obvious such candidate is to investigate the use of single-precision computations when the desired level of accuracy is low.
We have also presented comparisons of different splitting strategies, mainly investigating whether one should split away the constant term Q or not, and in which order the subproblems should be solved.For the latter question, we observe that the ordering has minimal influence on the error, and we may thus choose the order such that the computational cost is minimized.(E.g., take the most expensive subproblem as the "middle" term.)For the first question, we expected that it would not be beneficial to split away Q, since the extra integral term which arises only has to be approximated once.This was verified by our experiments, except in the case when Q was relatively small -then, of course, the extra splitting error is similarly small.We note that these results are for the autonomous case.When the matrices that define the equations also depend on time, the situation likely changes as the integral term would need to be recomputed in each step.However, as the modified methods would still rely on matrix exponential actions as their basic building blocks, we still expect that GPU acceleration would significantly increase the efficiency.

2 Figure 1 .
Figure 1.Relative error of the Strang splitting scheme applied to the differential Lyapunov equation.

Figure 3 .
Figure 3. Relative errors of the different splitting schemes applied to the generalized DLE (left) and the generalized DRE (right).

Figure 4 .
Figure 4. Computational costs of the algorithms for the DLE (left) and DRE (right) are given in a log-log plot for different sizes of the matrix A.

Figure 6 .
Figure 6.Computational costs of the algorithm for the generalized DLE (left) and the generalized DRE (right) are given in a log-log plot for different sizes of the matrix A.

Figure 7 .
Figure 7.The relative error for n = 1740 for the splitting scheme of the DLE for the El Niño example (left) and the computational costs for different resolutions (right)

Figure 8 .
Figure 8. Efficiency of the splitting algorithms for the El Niño example for n = 624 (left) and n = 3900 (right).
Finally, the matrix B is the discretization of the operator that implements the Neumann boundary conditions of the Laplacian -this results in a boundary control application.Cancelling M T and M leads to the equation Ṗ = M −T A T P + P AM −1 + M −T QM −1 − P BR −1 B T P, which we can treat like outlined in Section 2.2 after replacing A by M −1 A and Q by M −T QM −1 .

Table 2 .
Time measurements for DRE for different matrix sizes.