A fast and oblivious matrix compression algorithm for Volterra integral operators

The numerical solution of dynamical systems with memory requires the efficient evaluation of Volterra integral operators in an evolutionary manner. After appropriate discretisation, the basic problem can be represented as a matrix-vector product with a lower diagonal but densely populated matrix. For typical applications, like fractional diffusion or large scale dynamical systems with delay, the memory cost for storing the matrix approximations and complete history of the data then would become prohibitive for an accurate numerical approximation. For Volterra-integral operators of convolution type, the \emph{fast and oblivious convolution quadrature} method of Sch\"adle, Lopez-Fernandez, and Lubich allows to compute the discretized valuation with $N$ time steps in $O(N \log N)$ complexity and only requiring $O(\log N)$ active memory to store a compressed version of the complete history of the data. We will show that this algorithm can be interpreted as an $\mathcal{H}$-matrix approximation of the underlying integral operator and, consequently, a further improvement can be achieved, in principle, by resorting to $\mathcal{H}^2$-matrix compression techniques. We formulate a variant of the $\mathcal{H}^2$-matrix vector product for discretized Volterra integral operators that can be performed in an evolutionary and oblivious manner and requires only $O(N)$ operations and $O(\log N)$ active memory. In addition to the acceleration, more general asymptotically smooth kernels can be treated and the algorithm does not require a-priori knowledge of the number of time steps. The efficiency of the proposed method is demonstrated by application to some typical test problems.


Introduction
We study the numerical solution of dynamical systems with memory which can be modelled by abstract Volterra integro-differential equations of the form α(t)y (t) + A(t)y(t) = t 0 k(t, s)f (s, y(s)) ds, Such problems arise in a variety of applications, e.g., in anomalous diffusion [32], neural sciences [2], transparent boundary conditions [1,20,22,23], wave propagation [1,12,20], field circuit coupling [13] and many more, see also [7,8,29,34] and the references therein. The simplest model problem which already shares the essential difficulties stemming from non-locality of the right hand side in (1) is the evaluation of the integral operator with kernel function k, data f , and result function y. Let us emphasize that, in order to allow the application in the context of integro-differential problems (1), the parameter-dependent integrals (2) have to be evaluated in an evolutionary manner, i.e., for successively increasing time. The results obtained for (2) then quite naturally extend to (1). We hence focus on the evaluation of Volterra integral operators (2) in the following and return to more general problems in Section 5.

Discretisation and related work
After applying some appropriate discretisation procedure, see [7] for a survey, problem (2) can be phrased as a simple matrix-vector multiplication yn = (Kf)n, 1 ≤ n ≤ N.
The evolutionary character and the nonlocal interactions are reflected by the fact that the matrix K ∈ R N ×N is lower block triangular but densely populated. The straight-forward computation of the result vector y requires O(N 2 ) algebraic operations. The evolutionary character of problem (2) can be preserved by computing the entries yn for n = 1, . . . , N recursively, i.e., by traversing the matrix K from top to bottom. If, on the other hand, the matrix K is traversed from left to right, then the algorithm becomes oblivious, i.e., the data fn is only required in the nth step of the algorithm, but the execution of (3) then requires O(N ) active memory to store the partial sums for every row. Although the evaluation can then still be organized in an evolutionary manner, see Section 2.2, the number of time steps N needs to be fixed a-priori in order to store the intermediate results.
For the particular case that the integral kernel in (2) is of convolution type a careful discretisation of (5) gives rise to an algebraic system (3) with block Toeplitz matrix K, and the discrete solution y can be computed in O(N log N ) operations using fast Fourier transforms. As shown in [21], an evolutionary version of the matrix vector product can be realized in O(N log 2 N ) complexity and requiring O(N ) active memory. The convolution quadrature methods of [27,28,30] treat the case that only the Laplace transformk(s) of the convolution kernel (4) is available. The fast and oblivious convolution quadrature method introduced in [26,31] allows the efficient evaluation of Volterra integrals with convolution kernel in an evolutionary and oblivious manner with O(N log N ) operations and only O(log N ) active memory and O(log N ) evaluations of the Laplace transform k(s). This method is close to optimal concerning complexity and memory requirements and has been applied successfully to the numerical solution of partial differential equations with transparent boundary conditions [20], the efficient realization of boundary element methods for the wave equation [34] or fractional diffusion [9]. For integral operators (2) with general kernels k(t, s), the above mentioned methods cannot be applied directly. Alternative approaches, like the fast multipole method [14,16,33], the panel clustering technique [19], H-and H 2 -matrices [5,17], multilevel techniques [6,15] or wavelet algorithms [10], which were developed and applied successfully in the context of molecular dynamics and boundary integral equations, are however still applicable. These methods are based on certain hierarchical approximations for the kernel functions k(t, s), whose error can be controlled under appropriate smoothness assumptions, e.g. if the kernel is asymptotically smooth; see (27) for details. If the data f is independent of the solution y, the numerical evaluation of the Volterra-integral operator (2) can then be realized with O(N log α N ) computational cost with some α ≥ 0 and N again denoting the dimension of the underlying discretisation. Moreover, data-sparse approximations of the matrix K for asymptotically smooth kernels k(t, s) can be stored efficiently with only O(N log α N ) memory and for convolution kernels k(t − s) even with O(log N ) memory; we refer to [5,18] for details and an extensive list of references. Unfortunately, the algorithms mentioned in literature are not evolutionary and, therefore, cannot be applied to more complex problems like (1) directly.

A fast and oblivious evolutionary algorithm
In this paper, we propose an algorithm for the efficient evaluation of Volterra integrals (2) or corresponding matrix-vector products (3) which shares the benefits and overcomes the drawbacks of the approaches mentioned above, i.e., it is evolutionary: the approximations yn can be computed one after another and the number of time steps N does not need to be known in advance, oblivious: the entry fn of the right hand side is only required in the nth step, fast: the evaluation of all yn, 1 ≤ n ≤ N requires only O(N ) operations, and memory efficient: the storage of the convolution matrix requires only O(N ) memory for general and O(log N ) memory for convolution kernels. The matrix entities can also be computed on the fly, such that only O(log N ) storage is required to store a compressed history of the data f .
Our strategy is based on the ideas of polynomial-based H 2 -compression algorithms for finding hierarchical low-rank approximations of the kernel function k(t, s) leading to a block-structured hierarchical approximation of the matrix K in (3). The accuracy of the underlying approximation can thus be guaranteed by well-known approximation results; see [5,18] for instance. A key ingredient for our considerations is the one-dimensional nature of the integration domain which allows to characterize the block structure of the approximating hierarchical matrix explicitly. This allows us to formulate an algorithm which traverses the compressed matrix K from top-to-bottom in accordance with the evolutionary structure of the underlying problem. The hierarchical approximation of the convolution kernel also yields a compression strategy for the history of the data f . In this sense, our algorithm can be considered a generalisation of [1,4,22,23], where a fast multipole expansion was employed to accelerate the sum of exponentials approach, or to [24], where a polynomial on growing time steps was employed for the compression of the data, as well as to [25], where an evolutionary H-matrix approximation with a special low-rank structure was constructed. As a further result, we show that our algorithm seamlessly integrates into the convolution quadrature framework of [27,28], when the kernel k(t − s) is of convolution type and only accessible via its Laplace transform. In analogy to the treatment of nearfield contributions in the fast boundary element method, we utilize standard convolution quadrature to compute the entries of the convolution matrix close to the diagonal, while numerical inverse Laplace transforms [11] are used to setup an H 2 -approximation of the remaining off-diagonal parts of the convolution matrix in the time domain. This approach has some strong similarities to the fast and oblivious convolution quadrature method [31,35], but we will reveal some important differences. In particular, we illustrate that the methods of [31,35] can actually be interpreted as H-matrix approximations with a particular organisation of the matrix-vector product in (3), which shows that the O(N log N ) complexity cannot be further improved. Moreover, the convolution matrix must be applied from left to right to allow for an oblivious evaluation and the number of time steps N must be known in advance.
In contrast to that, our new algorithm is based on an H 2 -approximation of the matrix K and the evolutionary, fast and oblivious evaluation of the matrix-vector product can be realized by traversing through the matrix from top to bottom in O(N ) complexity and without needing to know the number of time steps N in advance. Finally, our algorithm naturally extends to general integral kernels k(t, s) increasing the field of applications substantially.

Outline
In Section 2 we recall some general approximation results, introduce our basic notation, and state a slightly modified algorithm for the dense evaluation of the Volterra integral operators to illustrate some basic principles that we exploit later on. Section 3 is concerned with a geometric partitioning on the domain of integration, the multilevel hierarchy used for the H 2 -compression, and the description and analysis of our new algorithm. In Section 4 we consider convolution kernelŝ k(s) and discuss the relation of our algorithm to Lubich's convolution quadrature and the connections to the fast and oblivious algorithm of [31,35]. To support our theoretical considerations, some numerical results are provided in Section 5.

Preliminary results
Let us start with summarizing some basic results about typical discretisation strategies for Volterra integral operators which are the basis for the efficient and reliable numerical evaluation later on. For simplicity, all functions y, f , and k are assumed to be scalar valued. We will demonstrate the application to more general problems of the form (1) in Section 5.

A general approximation result
For the discretisation of the integral operator (5), we consider methods of the form where k h and f h are suitable approximations for k and f . The subscript h will be used to designate approximations throughout. The following result may serve as a theoretical justification for a wide variety of particular discretisation schemes.
Lemma 1 Let T > 0, kernels k, k h ∈ L ∞ (0, T ; L r (0, T )), and f, f h ∈ L r (0, T ) be given with 1 ≤ r, r ≤ ∞ with 1/r + 1/r = 1. Further assume that Then the functions y, y h defined by (5) and (6) satisfy i.e., the error in the results can be bounded uniformly by the perturbation in the data.
Proof From Hölder's inequality, we can deduce that The result then follows by estimating f h ≤ f + f − f h , using the estimates for the differences in the data, and taking the supremum over all 0 < t < T .

Remark 1
The constant C in the estimate (8) depends on the kernel k, but is independent of T . The result can therefore be applied to time intervals of arbitrary size. Without substantially changing the argument, it is possible to obtain similar estimates also in other norms. In many cases, y h only serves as an intermediate result and the final approximation is given by y h (t) = (P h y h )(t), where P h is some projection or interpolation operator; a particular case will be discussed in more detail below. Estimates for the error y − y h can then be obtained by additionally taking into account the projection errors.

Piecewise polynomial approximations
Many discretisation methods for integral or integro-differential equations, e.g., collocation or Galerkin methods [7], are based on piecewise polynomial approximations and fit into the abstract form mentioned above. As a particular example and for later reference, we consider such approximations in a bit more detail now. Let h > 0 be given, define t n = nh, n ≥ 0, and set T = t N = N h. We set I n = [t n−1 , t n ] for 1 ≤ n ≤ N , and denote by T h = {I n : 1 ≤ n ≤ N } the resulting uniform mesh of the interval [0, T ]. We write Pp(a, b) for the space of polynomials of degree at most p over the interval (a, b), and Pq,q for the space of polynomials in two variables of degree at most q in each variable. We further define piecewise polynomial spaces over the grid T h and the tensor-product grid T h × T h .
For sufficiently regular functions f and k over the mesh T h and T h × T h , piecewise polynomial approximations k h , f h satisfying (7) can be found by appropriate interpolation and choosing the mesh size h small enough. Without further structural assumptions on the data, it seems natural to use uniform grids T h , which can be obtained, e.g., by uniform refinement of some reference grid. In Figure 1, we depict the resulting uniform partitions for approximation of the kernel function k. Fig. 1: Uniformly refined grids T h × T h for approximation of k. Only the elements required for approximating k(t, s) for s ≤ t are depicted. The grid cells near the diagonal t = s, i.e, the nearfield, play a special role and are thus coloured in gray.
The evaluation of y h defined by (6) can be split into two contributions with w h corresponding to the integrals over the farfield cells and z h over the nearfield cells, which are depicted in white and gray in Figure 1, respectively. As discussed in the following subsection, the numerical treatment of these contributions differs slightly.

Practical realisation
From equation (6) and the choice of f h ∈ Pp(T h ) and k h ∈ Pq,q(T h × T h ), one can see that y h is a piecewise polynomial of degree ≤ p + q + 1 over the grid T h . It is often convenient to replaceỹ h by a piecewise polynomial y h ∈ Pp(T h ) with the same degree as the data. For this purpose, we simply choose a set of distinct points 0 ≤ γ j ≤ 1, 0 ≤ j ≤ p, and define y h ∈ Pp(T h ) by collocation on every time interval I n ∈ T h , with collocation points t n j = t n−1 +γ j h, j = 0, . . . , p. Now let ψ n j , j = 0, . . . , p, denote the Lagrangian basis of Pp(I n ) with respect to the interpolation points t n j , i.e., Then as a consequence of the uniformity of the mesh T h , one may deduce that i.e., the basis {ψ n i } 0≤i≤p is invariant under translation, which will become an important ingredient for our algorithm below. The approximate data f h and the discrete solution y h can now be expanded as In a similar manner, the approximate kernel function k h can be expanded with respect to a set of bases {ϕ n i } i=0,...,q for the spaces Pq(I n ), which leads to We will again assume translation invariance of this second basis, i.e., Note that we allow for different polynomial degrees q = p in the approximations y h , f h , and k h , and hence two different sets of basis functions are required. Evaluating (6) at time t = t m j and utilizing (12), we obtain where we used the splitting of the integration domain (0, t m j ) into subintervals of the mesh T h and an additional separation of farfield and nearfield contributions; see Figure 1. By inserting the basis representations (15) and (16) for y h , f h , and k h , the integrals in the farfield contribution can be expressed as For convenience of presentation, let us introduce the short-hand notations and note that the corresponding matrices P and Q are independent of the time steps m, n, due to the translation invariance conditions (14) and (17). The result vector y m containing entries y m j = y h (t m j ) from (18) can then be expressed as with farfield contribution w m given by where k m,n is the matrix containing the entries k m,n i,j . Further introducing the symbols K m,n = P k m,n Q, this may be stated equivalently as w m = m−2 n=0 K m,n f n . In a similar manner, we may represent the nearfield contribution z m by We denote by y, f ∈ R N (p+1) the global vectors that are obtained by stacking the element contributions y m , f n together. The computation of y can then be written as matrix-vector product y = Kf with block-matrix K ∈ R N (p+1)×N (p+1) consisting of blocks K m,n as defined above. A possible block-based implementation of this matrix-vector product can be realized as follows.
Algorithm 1 Evaluation of Volterra integral operators for uniform meshes.
Remark 2 At first sight, this algorithm may seem more complicated than actually required. In fact, after generating the matrix blocks K m,n = P k m,n Q, the mth component of the result vector could also be computed as y m = m n=1 K m,n f n . The above version, however, is closer to the algorithm developed in the next section. Moreover, it is evolutionary, i.e., the entries of the vector y are computed one after another, and oblivious in the sense that only the blocks f m−1 and f m are needed for the computation of y m . Note, however, that all auxiliary values g n , n = 1, . . . , m − 2 are required to compute the block y m and therefore have to be kept in memory. This will be substantially improved in Section 3.2.

Computational complexity
As indicated above, the computation of y h according to (18) can be phrased in algebraic form as a matrix-vector product with y and f denoting the coefficient vectors for y h and f h , and a lower block triangular matrix K ∈ R N (p+1)×N (p+1) . Note that the pattern of the matrix K is structurally the same as that of the tensor-product grid underlying the approximation of the kernel function k; see Figure 1, with each cell corresponding to a block of size (p + 1) × (p + 1). Thus, the the computation of y = Kf will in general require O(p 2 N 2 ) operations and O(p 2 N 2 ) memory to store the matrix K. In addition, we require O(pN ) active memory to store two values of f n and the history of g.

A fast and oblivious algorithm
The aim of this section is to introduce a novel algorithm which allows for a simultaneous compression of the matrix K used for the evaluation of (24) and the history of the data stored in the vectors g n , n ≥ 1. The underlying approximation is that of H 2 -matrix compression techniques [5,18]. We first collect some results about these hierarchical approximations and then state and analyze our algorithm.

Multilevel partitioning
For ease of presentation we will assume that the number of time steps is given as N = 2 L with L ∈ N and define h = T /N . Now let I (n;1) = I n and introduce a hierarchy of nested partitions into subintervals I (n; ) = I (2n−1; −1) ∪ I (2n; −1) = t 2 −1 (n−1) , t 2 −1 n , > 1, of length 2 −1 h by recursive coarsening of the intervals; see Figure 2 for a sketch.
where r denotes the smallest integer larger or equal to r as usual.
In a similar spirit to [5,14,16,18], we next introduce a multilevel partitioning of the support of the kernel k leading to adaptive hierarchical meshes Note that every element of AT h is square and thus corresponds to one element of a, possibly coarser, uniform mesh T h × T h . Moreover, any element in AT h is the union of elements of the underlying uniform mesh T h × T h and can be constructed by recursive agglomeration or coarsening. As illustrated in Figure 3, the hierarchic adaptive mesh AT h can again be split into nearfield elements adjacent to the diagonal and the remaining far field elements. Let us remark that the resulting partitioning and its splitting coincide with most of the classical partitioning strategies developed in the context of panel clustering and H-and H 2 -matrices, see [18] and the references therein.

Adaptive data sparse approximation
Let Pq,q(AT h ) be the space of piecewise polynomials of degree ≤ q in each variable over the mesh AT h . Since the adaptive hierarchical mesh is obtained by coarsening of the underlying uniform grid T h × T h , we certainly have Instead of a uniform approximation as used in Section 2.2, we now consider adaptive approximations k h ∈ Pq,q(AT h ) for the evaluation of (6) or (18). Remark 3 Let us assume for the moment that the kernel k in (2) is asymptotically smooth, i.e., there exist constants c 1 , c 2 > 0, r ∈ R such that for all α, β ≥ 0 and all t = s. As shown in [5,18], adaptive approximations k h ∈ Pq,q(AT h ) can be constructed for asymptotically smooth kernels, which converge exponentially in q in the farfield. As a consequence, the same level of accuracy required in Lemma 1 can be achieved by adaptive approximations with much less degrees of freedom than by uniform approximations. Remark 4 It is not difficult to see that dim(Pq,q(T h × T h )) = O(N 2 q 2 ) while dim(Pq,q(AT h )) = O (N q 2 ). The adaptive hierarchical approximation thus is datasparse and leads to substantial savings in the memory required for storing the kernel approximation or its matrix respresentation (24); compare to Lemma 3 at the end of this section. In addition, an appropriate reorganisation of the operations required for the matrix-vector product (24) leads to a substantial reduction of computational complexity. Moreover, the fast evaluation also induces an automatic compression of the history of the data.

Multilevel hierarchical basis
In order to obtain algorithms for the matrix-vector-multiplication (3) of quasioptimal complexity, we require the following second fundamental ingredient. Based on the multilevel hierarchy I (n; ) of time intervals and the translation invariance of the basis functions ϕ n i =: ϕ (n;1) i , we define a multilevel basis ϕ (n; ) for the spaces Pq(I (n; ) ), > 1 appearing in the far field blocks of the approximation k h ∈ Pq,q(AT h ). Let us note that by translation invariance, the coefficients i,j are independent of n and . For each of the elements of the adaptive partition AT h , we expand the kernel function as We can now further insert the expansions (29) for the kernel k h into the farfield integrals over I (P (m,n; ); ) to see that By the recursive definition of ϕ (n; ) , the latter expression can be rewritten as

The result of the integral (18) then is finally obtained by
with nearfield contributions z m and projections P j,k as given in (20) and (22). The above derivations can be summarized as follows. if > 1 then 7: g (1; ) = A (1) g (1; −1) + A (2) g (2; −1) 8: end for 19: y m = P u (1) + z m 23: end for Remark 5 The Matlab function bitxor(a, b) returns the integer generated by a bitwise xor comparison of the binary representation of a and b in O(1) complexity. This allows to determine Lcoarse = arg max k {B(m; k) = B(m − 1; k)} in O(1) complexity in each step and is required for achieving a theoretical runtime of O(N ). In actual implementations one may just set Lcoarse = L(m), without any notable difference in computation times. Further note that only one value u (n; ) and two values of g (n; ) are required for each level . Moreover, at most two values of f n are required at any timestep. The required buffers are denoted by u ( ) , f (i) , and g (i; ) , i = 1, 2, = 1, 2, 3, . . .. The complexity of the overall algorithm is analyzed in detail in the next section. Remark 6 Readers familiar with H 2 -matrices or the fast multipole method may notice that our algorithm is similar to the corresponding matrix-vector multiplications but with rearranged computations. In each time step, the algorithm checks for changes in the matrix partitioning structure compared to the previous time step. Then, starting from the coarsest level, an upward pass of one level is executed for all coarsened intervals of the farfield by applying the transfer or multipole-tomultipole matrices. Entities from the coarsened intervals are overwritten. Thereafter the farfield interactions are computed by applying the kernel or multipole-to-local matrices corresponding to the changed intervals and directly including the contributions of the next coarser level with a downward pass by appling transfer or local-to-local matrices. Finally, the nearfield contributions are applied.

Complexity estimates
In the following, we consider Algorithm 2 for the evaluation of (18) with approximate kernel k h ∈ Pq,q(AT h ) and data f h ∈ Pp(T h ), and with N = 2 L denoting the number of time intervals in T h . The assertions of the following two lemmas are well-known, see e.g. [5], but their reasoning is simple and illustrative such that we repeat it for the convenience of the reader. and since N = 2 L Young's inequality yields the assertion.

Lemma 3
The H 2 -matrix representation K of the adaptive hierarchic approximation k h ∈ Pq,q(AT h ) can be stored in O N (p 2 +q 2 ) memory. If the kernel is of convolution type (4), then the memory cost reduces to O p 2 + log 2 (N )q 2 ) . Proof The proof for the adaptive approximation is similar to the previous lemma, with the p 2 -related term arising from the nearfield and the q 2 -related term from the farfield. For a kernel of convolution type, the hierarchical approximation provides a block Toeplitz structure, such that we only have to store O(1) coefficient matrices per level for the farfield and O(1) coefficient matrices for the nearfield.
Let us finally also remark on the additional memory required during execution. Lemma 4 The active memory required for storing the data history required for Algorithm 2 is bounded by O(q log 2 N + p).
Proof We require O(1) vectors of length p for the nearfield and at most two vectors g (n; ) of length q on L = log 2 (N ) levels for the farfield contributions.

Summary
In this section, we discussed the adaptive hierarchical data sparse approximation for the dense system matrix K in (3) stemming from a uniform polynomial-based discretization of the Volterra-integral operators (2). This approximation amounts to an H 2 -matrix compression of the system matrix, leading to O(N ) storage complexity for general and O(log(N )) storage complexity for convolution kernels. Using a multilevel basis representation on the hierarchy of the time intervals, we formulated a fast and oblivious evolutionary algorithm for the numerical evaluation of Volterra integrals (2). The overall complexity for computing the matrixvector product in (3) is O(N ) and only O(log(N )) memory is required to store the compressed history of the data. The algorithm is executed in an oblivious and evolutionary manner and can therefore be generalized immediately to integrodifferential equations of the form (1). Moreover, knowledge of the number of time steps N is not required prior to execution.

Approximation of convolution operators
While our algorithm for the fast evaluation of Volterra integral operators is based on a time domain approximation, in many interesting applications, see [29] for examples and references, the kernel function in (5) is of convolution type and only accessible indirectly via its Laplace transform, i.e., the transfer function Let us note that, at least formally, the evaluation of the kernel function in the time domain can be achieved by the inverse Laplace transform where Γ is an appropriate contour connecting −i∞ with i∞; see [3] for details. To ensure the existence of the inverse Laplace transform, we require that k(λ) is analytic in a sector | arg(λ − c)| < ϕ, and |k(λ)| ≤ M |λ| −µ for some fixed M, µ > 0, (34) and tacitly assume that the contour Γ lies within the domain of analyticity of the functionk. In this section, we show that with some minor modifications, Algorithm 2 and our analysis are applicable also in this setting and we compare our method with the fast and oblivious convolution quadrature of [31,35].

Approximation and numerical realization
As a first step, we show that the convolution kernel k given implicitly by (32) indeed satisfies the assumption (27) on asymptotic smoothness. Thus an accurate adaptive hierarchical approximation as discussed in Section 3 is feasible.
For the construction of the adaptive approximation k h , we can now proceed in complete analogy to (11), i.e., we split the convolution integral into a farfield and a nearfield contribution. The latter can be computed stably and efficiently with Lubich's convolution quadrature; see [27,28] for details. For the farfield contributions, we utilise the adaptive hierarchical approximation discussed in the previous section. If direct access to the kernel k(t, s) = k(t−s) is available, Algorithm 2 can be applied directly. If, on the other hand, only the transfer functionk(s) is accessible, the values of k h (t, s) = k(t−s) can be computed by fast numerical Laplace inversion; see [11,26,37]. Here we follow the approach of [26,35] which is based on hyperbolic contours of the form with 0 < µ, 0 < α < π/2 − ϕ, and σ ∈ R, such that the contour remains in the sector of analyticity (33) ofk. The discretisation of the contour integral (32) by the trapezoidal rule with uniform step with τ yields with θr = τ r. Given we are interested in k(t) for t ∈ [t min , tmax] and have fixed values for α and σ, suitable parameters τ and µ are given by , ρopt = arg min where ε is the machine precision and aρ(ρ) = acosh see [26,35]. In our examples in Section 5 we chose α = 3/16π, σ = 0. For error bounds concerning this approach we refer to [26,35].

Comparison with fast and oblivious convolution quadrature
Similar to Algorithm 2, the fast and oblivious convolution quadrature (FOCQ) method of [31,35] is also based on a splitting (35) of the convolution integral into nearfield and farfield contributions, and the former can again be computed stably and efficiently with Lubich's convolution quadrature [27,28]. A different adaptive hierarchic approximation based on L-shaped cells is now used for the approximation in the farfield; see Figure 5. The farfield part of the Choosing an appropriate contour Γ , see (36), and corresponding quadrature points with appropriate values s = b ( ) and γ = γ(θ ( ) r ). Thus, the fast and oblivious convolution quadrature provides an approximation of the convolution matrix by solving an auxiliary set of (2R + 1)L differential equations. In order to obtain an oblivious algorithm it is crucial that the solution of each differential equation is updated in each time step, i.e., the compressed convolution matrix must be evaluated from left to right; see [31,35] for details.
The connection to our approach is revealed upon noticing that the compression approach underlying the fast and oblivious convolution quadrature actually implements a low-rank approximation in each of the farfields L-shaped blocks, i.e., The corresponding farfield approximation (38) thus effectively reads which can be understood as a low-rank matrix-vector product realized implicitly by the numerical solution of a differential equation. Since the partitioning depicted in Figure 5 can easily be refined to an adaptive partitioning as in Figure 3, the fast and oblivious convolution quadrature can be interpreted as a particular case of an Hmatrix approximation with a particular realisation of the H-matrix-vector product.
The O(log(N )) memory cost and O(N log(N )) complexity of the algorithm can then be immediately deduced from H-matrix literature [5,18]. As mentioned in the introduction, the algorithm proposed in Section 3.2, with the modifications discussed above, is based on an H 2 -matrix approximation and leads to a better complexity of O(N ). It is also clear that the number of quadrature points for the numerical Laplace inversion determines the ranks of the far-field approximations for the H-matrix approximations, which allows for an improved understanding of the approximation error in terms of the approximation order.

Numerical examples
In the following we present a series of numerical examples to illustrate and verify the capabilities of our novel algorithms. The experiments are performed in Matlab, in which we implemented our new algorithm as well as a version of the fast-and-oblivious convolution-quadrature algorithm of [26,31] for reference. Although our new algorithm performed considerably faster in all of the following examples, we do not present a detailed comparison.
In accordance with the H-and H 2 -literature, we require the farfield cells in our implementation to be at least n min × n min = 16 × 16 times larger than the nearfield cells. This simply means that the cells in Figure 3 represent 16 × 16 blocks of finegrid cells. Following [7, Chapter 2], we choose Radau-IIA collocation methods of stage p = 1, 2, 3, for the discretization of the Volterra integral operators, which is exactly the scheme used for the approximation as outlined in Section 2.2 and the beginning of Section 2.3. This fixes the approximation spaces Pp(T h ) for the solution y h and the data f h . The error of this discretization scheme is given by for smooth data f ∈ C 2p−1 ([0, T ]) and kernel k ∈ C 2p−1 ({(t, s) : 0 ≤ t ≤ s ≤ T }); see [7, Chapter 2] for details. For convolution kernels k(t − s), we utilize Lubich's convolution quadrature [27,28] in the near field and the same strategy as above in the farfield. Again, the Radau-IIA method is used as the underlying integration scheme. discretization scheme in the near field is that of. This allows to estimate the approximation error by for transfer functions satisfying (33)- (34) and data f ∈ C 2p−1 ([0, T ]); see [30] for details. The parameter µ is related to the singularity of the kernel k(t − s) at t = s.

Variation of constants formula
The first example is dedicated to the solution of the differential equation By the variation of constants formula, the solution can be expressed as Let us note that the integral kernel k(t, s) = e s 2 −t 2 satisfies the asymptotic smoothness assumption (27), but is not of convolution type k(t − s). To obtain a reference solution, we solve (42)-(43) numerically with a 3-stage Radau-IIA method and with N∞ = 2 19 time steps. For the numerical solution of (44), we then employ Algorithm 2 with polynomial degree q = 16 for the kernel approximation and various degrees p for the approximation of the data f and the solution y. The left plot of Figure 6 illustrates that we indeed reach the theoretical convergence rates predicted by (40) up to a tolerance of around 10 −10 at which numerical noise begins to dominate. From the right plot one can immediately deduce the linear complexity of the algorithm.

Nonlinear Volterra integral equation
We continue with a second test example taken from [35], in which we consider the nonlinear Volterra integral equation In this example, the convolution kernel k(t − s) = 1/ π(t − s) is of convolution type, with Laplace transform given byk(λ) = 1/ √ λ. The evaluation of the integral kernel k(t − s) is realized via numerical inverse Laplace transforms with R = 15 quadrature points and the kernel is approximated by piecewise polynomials of degree q = 8 in the farfield; see Section 4. Since the data f (t, u) = (u − sin(t)) 3 here depend on u, a nonlinear equation must be solved for each time step, for which we employ a Newton method up to a tolerance of 10 −12 . As a reference solution for computing the errors, we here simply take the numerical solution computed on a finer grid. The results of Figure 7 again clearly show the predicted convergence rates up to the point where numerical noise begins to dominate, see (41) with µ = 1/2, and the linear complexity of Algorithm 2.

Fractional diffusion with transparent boundary conditions
As a last example, which is taken from [9,35], we consider the one-dimensional fractional diffusion equation ∆xu(x, τ ) dτ + g(x, t), x ∈ R, t ∈ R >0 , with α = 2/3, u(x, ·) → 0 for |x| → ∞ and g(x, 0) = 0. For the computations, we restrict the spatial domain to x ∈ (−a, a), assume that u 0 and g have support in [−a, a], and impose transparent boundary conditions on x = ±a which read we refer to [20,35] for further details on the model. The Laplace transform of the convolution kernel k(t − s) = (t − s) α−1 /Γ (α) is here given byk(λ) = 1/λ α . For the spatial discretisation we employ a finite difference scheme on an equidistant mesh x i = iτ , τ = a/M , i = −M, . . . , M and use second-order finite differences within the domain and central differences to approximate the normal derivative on the boundary; see [9,35]. For the time discretisation we employ the frequency domain version of our algorithm with R = 30 quadrature points for the inverse Laplace transform and polynomial degree q = 16 for the farfield interpolation. We note that two different convolution quadratures are required, one for the fractional derivative in (−a, a) involving α and one for the fractional derivative of the boundary values, involving α/2.
For the space discretisation we consider a fixed mesh with M = 10 4 which is fine enough to let the error of the time discretisation dominate. As a reference solution we take the method with order p = 3 on a finer mesh. Let us note that due to the lack of temporal smoothness of the solution at t = 0, we cannot expect the full order of convergence as predicted by (41); we refer to [9, Section 1] for further discussion. In Figure 8, we however still observe a very good approximation in the pre-asymptotic phase and a substantial improvement in accuracy until the numerical noise level is reached when using higher approximation order p. As predicted by the theory, the computation times again increase linearly in the number of time steps. In the numerical tests, identical results were obtained for direct evaluation of k(t, s) and evaluation of the kernel via inverse Laplace transforms, which indicates that the main approximation error comes from the adaptive hierarchical approximation.