Abstract
In the present paper, a parallelintime discretization of linear systems of Volterra equations of type
is addressed. Related to the analytical solution, a general enough functional setting is firstly stated. Related to the numerical solution, a parallel numerical scheme based on the NonStationary Wave Relaxation (NSWR) method for the time discretization is proposed, and its convergence is studied as well. A CUDA parallel implementation of the method is carried out in order to exploit Graphics Processing Units (GPUs), which are nowadays widely employed for reducing the computational time of several general purpose applications. The performance of these methods is compared to some sequential implementation. It is revealed throughout several experiments of special interest in practical applications the good performance of the parallel approach.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Time discretization of convolution equations typically gives rise to discrete convolutions whose practical computation turns out typically to be very costly, in particular if compared to time discretizations for ordinary differential equations. Moreover, if one considers a system of convolution equations instead, the problem gets even worse, this is why in this context, more than ever, efficient and fast algorithms are critical.
This paper is concerned with parallel implementations of time discretization for \(D\times D\) linear systems of Volterra equations, \(D\in \mathbb {Z}^+\), of the form
where \(\bar{u}_0\in X^D\) stands for the initial data, \(\bar{f}:[0,T]\mapsto X^D\) is a given vectorvalued function, X typically stands for \(\mathbb {R}\), and \(\{\textbf{K}(t)\}_{t\ge 0}\), represents a timedependent family of \(D\times D\) matrixvalued operators.
Besides the computational cost time discretization itself carries out, rather frequently the system (1) turns out to be very large, that is \(D>>0\) which makes the practical evaluation even harder. We might mention as a typical example within this framework those linear systems of equations (1) arising from the spatial discretization of linear Volterra equations
where \(x\in \mathbb {R}^d\), \(d=1,2\), or 3, and \(\{K(t)\}_{t\ge 0}\) stands for a timedependent family of linear operators defined in a suitable functional setting. Certainly, related to the spatial discretization of those equations, the finer the spatial mesh, the larger the system is, that is the larger the computational cost is.
Consider as a prototype equation arising in the context of (2), might be the simplest one one may consider, the linear equation whose convolution kernel adopts the form
where k(t) stands for a scalar function, and A is a linear operator. Very often in practical instances the operator A in (3) turns out to be the Laplacian operator \(A=\Delta \) in \(\mathbb {R}^n\), \(n=1,2\) or 3, or a fractional Laplacian \((\Delta )^\beta \), \(0<\beta <1\) (see [1]), both of them joint with suitable boundary conditions, or merely \(A\in \mathcal{M}_{D\times D}(\mathbb {R})\) (find more examples in [2]). Related to the scalar functions k(t) let us mention those defining time fractional integrals/derivatives. Recall for instance and among many others definitions of fractional integrals/derivatives, the Riemann–Liouville time fractional integrals of order \(\alpha >0\) whose associated kernel reads
This kind of equation is nowadays attracting the interest of many researchers (see [2,3,4] and references therein, among many others).
To go more indepth in that definition, its associated definition of fractional derivative, and other definitions of fractional integrals/derivatives, there is a vast literature among which we refer the reader to [4,5,6,7] and references therein. In this regard let also mention those definitions extending the noninteger fractional integrals of constant order \(\alpha >0\) to the ones whose fractional order depends on time, i.e., \(\alpha (t)\) instead of a constant \(\alpha \) [8, 9]. These kinds of integrals/derivatives are discussed more precisely in Section 2.
Let us highlight that prototype equations we are considering show a singular behavior as t tends to \(0^+\) which gives rise to singularities in the solution of the corresponding equations. Notice that most times singular Volterra equations reflect much more accurately the memory effect in many practical cases, instead of the regular ones. Unfortunately their solutions require a more careful study and this fact has to be taken into account in the numerical analysis.
On the other hand the numerical solution of (1) has been investigated by a large number of authors. Of particular interest are the discretizations based on convolution quadratures due mainly to the good stability properties [10,11,12,13, 15,16,17]; the collocation methods provide also a good balance between stability and accuracy [18,19,20,21,22]; or the ones based on the numerical inversion Laplace transform [23], among many others.
However, the computational cost these implementations carry out when (1) stands for a large system of equations represents even today a great difficulty in practical instances, therefore raising faster and more efficient implementations is nowadays a great challenge. This issue is usually addressed, at least, in two different ways. On the one hand faster and more efficient numerical methods have been proposed in the literature, see, e.g., [22, 24,25,26,27] in the framework of convolution quadratures. The second one consists of performing faster and more efficient implementations in the context of parallel computing. In the present paper we focus on the last one.
Parallel computing has emerged in last decades as a very helpful tool to speed up computational processes in particular the ones involved in numerical analysis, all of them once adapted to this architecture. However, parallel architectures are very expensive even at present, instead modern hardware technologies allow parallel computing at much lower cost, in fact we are speaking of Graphics Processing Units (GPUs). The GPUs based hardware was initially designed for graphical purposes; however, over the last few years it turned into a very useful tool in the framework of scientific computing due to their low cost, and their computational capabilities [28]. Parallel computing within GPUs context has allowed to speed up algorithms for systems of Ordinary Differential Equations (ODEs) [29,30,31]; Partial Differential Equations (PDEs) [32,33,34]; and the ones we are interested in the present paper, that is the systems of Volterra Integral Equations (VIEs) [35, 36, 36,37,38,39,40]; among others. In this point we have to highlight most of cites above related to VIEs are concerned with convolution integrals of fractional type.
The aim of this paper consists of extending some of previous approaches within parallel computing framework from systems of linear integral equations of fractional type to a more general class of systems of linear convolution equations. In other words beyond the systems of fractional integral equations, in the present paper we consider systems of nonlocal in time equations whose convolution kernels merely satisfy the existence of its Laplace transform in certain halfcomplex plane. That includes a large list of kernels and equations, some of them more precisely described in Section 2. The parallel approach considered in this paper in based on the NonStationary Wave Relaxation Methods (NSWR) introduced in [41]. In this regard since several truncation errors are involved in the iterative method the study goes accompanied by an accurate analysis of the convergence which allows one to ensure the well behavior of such algorithms in practical instances. The performance of these methods is illustrated with several numerical experiments applied to some of the systems of equations proposed in Section 2.
The paper is organized as follows. Section 2 is devoted to state the framework of the present work, both from the continuous and the discrete point of view. In Section 3 we describe precisely the notation and the raised method, and in Section 4 we present our main contributions of this paper, that is the theoretical analysis related to the convergence of the proposed algorithms. Section 5 is devoted to explain the technical resources involved in the implementation of algorithms, and in Section 6 several numerical experiments show the good performance of these. Section 7 includes the most relevant conclusions of the present and future works.
2 Framework statement
2.1 Time continuous setting
First of all we state the notation we will use throughout the following sections. Let us consider \(X=\mathbb {R}\), and the system of linear Volterra equations (1) where \(\bar{u},\bar{f}:[0,T]\mapsto \mathbb {R}^D\), \(D\in \mathbb {Z}^+\) (which is expected to be large), \(\mathbb {R}^D\) is normed by \(\Vert \cdot \Vert \) (we set below what \(\Vert \cdot \Vert \) stands for: \(L_2\)norm, sup norm,...), \(\{\textbf{K}(t)\}_{t\ge 0}\subset \mathcal{M}_{D\times D}(\mathbb {R})\), that is a timedependent family of matrices, and \(\bar{u}_0\in \mathbb {R}^D\) represents the initial data. In order to keep the initial data \(\bar{u}_0\) from apart of the source term \(\bar{f}\), one may assume without loss of generality that \(\bar{f}(0)=\bar{0}\).
Before going further in the paper we state conditions for the wellposedness of (1). In this regard consider the wide class of timedependent operators \(\{\textbf{K}(t)\}_{t\ge 0}\) elementwise of exponential growth, that is those for which there exist \(b,\beta >0\) such that if \(\textbf{K}(t)=(k_{i,j}(t))_{1\le i,j\le D}\), then
Notice that for the sake of the simplicity of the notation, and without loss of generality, we assume the same constants b and \(\beta >0\) for each entry of \(\textbf{K}(t)\). This means that \(\textbf{K}(t)\), for \(t>0\), admits absolutely convergent elementwise Laplace transform in \(\mathcal{D}_{\textbf{K}}:=\{z\in \mathbb {C}: \text{ Re }(z)\ge \beta \}\), denoted now and hereafter by \(\widetilde{\textbf{K}}(z) = (\widetilde{k}_{i,j}(z))_{1\le i,j\le D}\), \(z\in \mathcal{D}_{\textbf{K}}\), being \(\tilde{k}_{ij} = \mathcal{L}(k_{ij}(z))\). This fits the timedependent operators of the form \(\textbf{K}(t)=k(t) A\) where k(t) is a function of exponential growth, and A stands for a matrix, e.g., the one corresponding to some classical spatial discretization of the Laplacian. If in addition \(\bar{f}:(0,+\infty )\rightarrow \mathbb {R}\) is continuous, then (1) is wellposed [2]. Some assumptions might be lightened; however, this framework is general enough for our purposes.
Apart from the model commented in Section 1 based on the Riemann–Liouville fractional integral, let us consider other examples fitting this context and whose implementations will serve to illustrate the performance of our methods.

1.
Denote \(\alpha :[0,T]\rightarrow \mathbb {R}^+\) a positive function, \(\alpha (t)>0\), whose Laplace transform exists and is denoted by \(\widetilde{\alpha }(z)\), for \(z\in \mathcal{D}\supseteq \mathcal{D}_{\textbf{K}}\). Define \(k:[0,T]\mapsto \mathbb {R}\), the kernel associated to the fractional integral with order varying in time \(\alpha (t)>0\), according to the following definition
$$\begin{aligned} \partial ^{\alpha (t)} f(t) := (k*f)(t) = \int _0^t k(ts)f(s)\,\text{ d }s,\quad t>0, \end{aligned}$$(6)where
$$\begin{aligned} k(t):=\mathcal{L}^{1}(K)(t)\quad \text{ with }\quad K(z):= \frac{1}{z^{z\tilde{\alpha }(z)}},\qquad z\in D. \end{aligned}$$(7)Notice that if \(\alpha (t)\) is constant, then definitions (4) and (6)–(7) completely agree. Consider now the equation (2) where \(K(t)=k(t)\Delta \), and \(\Delta \) stands for the 1dimensional Laplacian operator in [0, L], with homogenous Newmann boundary conditions. Given \(D>0\), and the mesh grid \(x_0=0<x_1<x_2<\ldots <x_{D}=L\), \(x_j=jh\), \(0\le j\le D\) and \(h=L/D\), the spatial discretization of (2) by means of a classical second order finite differences scheme gives rise to a finite dimensional operator A which is nothing but a \((D+1)\times (D+1)\) threediagonal matrix. The Volterra equation (1) arises now with \(\textbf{K}(t)=k(t)A\), and \(\bar{f}\) standing for the function \(f:[0,T]\rightarrow \mathbb {R}^D\) sampled in the spatial grid. Moreover, \(\bar{u}\) stands for the approximation to the time continuous solution over the spatial grid, that is \(\bar{u}(t)=(u_j(t))_{0\le j\le D}\) where \(u_j(t)\approx u(x_j,t)\), \(0\le j \le D\). Other definitions of fractional derivatives/integrals with order depending on time have been proposed in the literature. Some of these definitions have been discussed in [9] concluding that (6)–(7) might be considered as the most convenient in many practical instances. This is why we adopt such a definition in this paper.

2.
A kind of generalization of (3)–(4) comes out in the framework image processing [3]. In fact consider the twodimensional Laplacian in \([0,L]\times [0,L]\), denoted again (if not confusing) by \(\Delta \), with homogeneous Newmann boundary conditions again. In [3] the Laplacian applies to u(x, y, t) in the variables x, and y, and \(u:[0,L]\times [0,L]\times [0,T]\mapsto \mathbb {R}\) represents an original gray scale image \(u_0(x,y)\) evolved up to the time level \(t>0\) to be u(x, y, t). Once one discretizes in space by a classical second order finite difference scheme with \(D+1\) nodes along each coordinate axis, we have the mesh grid \(\{(x_i,y_j)\}_{0\le i,j \le D} = \{(ih,jh)\}_{0\le i,j \le D}\), \(h=L/D\), and that u(x, y, t), \(0\le x,y\le L\), becomes into a \((D+1)\times (D+1)\) matrix \((u_{i,j}(t))_{0\le i,j\le D}\) where the entries represent the approximations to the exact solution, that is \(u_{i,j}(t)\approx u(x_i,x_j,t)\), for \(t>0\), \(0\le i,j \le D\). Therefore if one reshapes such a matrix as a vector by stacking columns, then the discrete Laplacian, denoted again by A, reads as a \((D+1)^2\times (D+1)^2\) fivediagonal matrix, and \((u_{i,j}(t))_{0\le i,j\le D}\) as a \((D+1)^2\times 1\) vector. One of the novelties of this model is that it relates image processing and fractional calculus and allows to apply a different diffusion rate over every single pixel \((x_i,y_j)\). Such a rates are handled by a particular fractional kernel \(k_{i,j}(t)\) of type (4) with order \(\alpha _{i,j}>0\), \(0\le i,j\le D\). In [3] the authors also consider the case of fractional orders depending on time, that is orders \(\alpha _{i,j}(t)\), \(0\le i,j\le D\), \(t>0\), according to the definition (6)–(7). Anyhow that approach leads to a system of linear equations of type (1) where \(\textbf{K}(t)=I(t)A\), \(I(t)=\)diag\((k_j(t))_{0\le j\le D^2}\), \(t\ge 0\), \(\alpha _j(t)\) are defined according to (6)–(7), A stands for the \((D+1)^2\times (D+1)^2\) fivediagonal matrix mentioned above, and \(\bar{u}_0\) stands for the initial data u(x, y, 0) sampled in the spatial mesh, that is the original image to be restored sampled on the spatial mesh.
2.2 Time discrete setting
In Section 1 we said that a large variety of numerical schemes for the time discretization of Volterra equations (1) exists in the literature. However, in this paper we focus on the convolution quadratures based methods introduced by Ch. Lubich in [15].
In particular a first order numerical scheme is considered here and the reason is twofold. On the one hand the lack of time regularity observed in the solutions of most of singular Volterra equations has proved as a barrier to consider higher order time discretization. In particular, for fractional equations of type (2)–(4)), it is very well known that the second order of convergence is reachable if \(1<\alpha <2\), but it is not obvious at all if the equation involves some nonlinear term. Even worse if \(0<\alpha <1\) for which the second order is hardly reached [13]. The same applies to the solutions of the equations with fractional order depending on time (6)–(7), whose regularity is precisely studied in [9]. Apart from that the choice of a first order numerical scheme does not yield a loss of generality in our study, on the contrary the study might be straightforwardly extended to higher order numerical schemes if the regularity of the corresponding continuous solutions is high enough.
Therefore we adopt the backward Euler based convolution quadrature method as the time discretization for (1). Since the stability, convergence, as well as some other properties, has been already studied [14,15,16,17,18,19,20,21,22,23,24,25,26,27], we here merely recall its formulation. In fact let \(\tau >0\) be the time step of the discretization, \(\tau =T/N\), and \(t_n=n\tau \), for \(0\le n \le N\). Moreover, denote by q(z) the quotient of the generating polynomials of the associated multistep linear method, in the case of the backward Euler method simply \(q(z)=1z\). According to the above the time discretization for (1) reads
where \(\bar{u}_n\in \mathbb {R}^D\) stands for the approximation to \(\bar{u}(t_n)\), \(\bar{f}_n:=\bar{f}(t_n)\in \mathbb {R}^D\), \(0\le n \le N\), and the convolution weights \(\textbf{K}_j\in \mathcal{M}_{D\times D}(\mathbb {R})\) are elementwise given by the expansion
For example, if \(\textbf{K}(t)=k(t) A\) with \(k(t)=t^{\alpha 1}/\Gamma (\alpha )\), \(\alpha >0\), and \(A\in \mathcal{M}_{D\times D}(\mathbb {R})\), then the convolution weights take the form
where
Let us highlight that in spite of the computation of convolution weights (9) may be directly done by means of the theoretical expression (10) this task requires a very high computational cost. Instead in practical instances that computation is performed by means of the Fast Fourier Transform (FFT), not only for the kernels of fractional type but also for the kernels which Laplace transform is reachable as the ones we consider in this paper.
For the convenience of readers we recall a result related to the convolution weights (9) which will be very useful in the proofs of our results (see Corollary 3.2 in [16]). In fact there exists \(C>0\) depending on \(\textbf{K}\) and T but independent on \(\tau \) and n, such that the convolution weights in (9) hold the bound
Finally notice that, without loss of generality, now and hereafter we consider \(\mathcal{M}_{D\times D}(\mathbb {R})\) normed by the matrix sup norm denoted if not confusing merely by \(\Vert \cdot \Vert \).
3 Discrete NSWR methods for linear Volterra equations
The NonStationary Wave Relaxation (NSWR) methods for integral equations of type
have been introduced in [41] and consist of iterative methods that provide a timedependent sequence \(\{y^{\eta }(t)\}_{\eta \ge 0}\), throughout the difference equation
where the set of functions \(\mathcal{F}^{(\eta )}:\mathbb {R}^+\times \mathbb {R}^+ \times \mathbb {R}\times \mathbb {R}\rightarrow \mathbb {R}\) satisfy the equality
The following facts have been already noticed in [36]. Firstly it has been observed that if one applies the iterative process over the whole time interval, then the convergence of the method slows noticeably down as the number of time nodes grows up. To overcome this problem the solution adopted in [36], and the one we adopt here, consists of splitting the whole interval [0, T] in subintervals/windows of length \(b\tau \), \(b\in \mathbb {Z}^+\), so that
The iterative method (14) reads now, for \(t\in (rb\tau ,rb\tau +b\tau ]\),
In the case of the system (1), the method (17) reads
for \(\eta =1,2,\ldots \), where, by the simplicity of the notation, and having in mind that the equations we are considering in the present paper are of convolution type, we have replaced \(\mathcal{F}^{(\eta )} (t,s,\bar{u},\bar{v}) \) by \(\mathcal{F}^{(\eta )} (ts,\bar{u},\bar{v})\), that is \(\mathcal{F}^{(\eta )}\) is now a function \(\mathcal{F}^{(\eta )}:\mathbb {R}^+ \times \mathbb {R}^D \times \mathbb {R}^D \rightarrow \mathbb {R}^D\). The condition (15) then reads
Therefore we discretize (17) according to the numerical scheme (8)–(9) in the time mesh \(\{t_n\}_{1\le n\le N}\), giving rise to the discrete NSWR method
for \(t_n\in (rb\tau ,rb\tau +b\tau ]\). Once again for the sake of the simplicity of the notation we have replaced \(\mathcal{F}^{(\eta )}(t_nt_j, \bar{u}_j^{(\eta 1)},\bar{u}_j^{(\eta )})\) by \(\mathcal{F}_{nj}^{(\eta )}(\bar{u}_j^{(\eta 1)},\bar{u}_j^{(\eta )})\).
The iterative method (20) provides at every single time step \(t_n=n\tau \) a sequence of vectors \(\{\bar{u}_n^{(\eta )}\}_{\eta \ge 0}\), for which the number of iterations required to achieve the stated accuracy at time level \(t_n\) will depend at every single time level solely on the last windows but not on the time step \(t_n\) itself. That is, for the rth window, \(0\le r\le R1\), there will be a threshold \(m_r>0\) so that the iterative method reaches the required accuracy with at most \(m_r\) iterations. Therefore we adopt \(\bar{u}_n^{(m_r)}\) as the numerical approximation to \(\bar{u}(t_n)\), that is \(\bar{u}_n^{(m_r)}\approx \bar{u}(t_n)\), for every \(t_n\in (t_{(r1)b},t_{rb}]\), \(1\le n\le N\), and \(1\le r \le R\).
A convenient choice of the operators \(\mathcal{F}_j^{(\eta )}:\mathbb {R}^D\times \mathbb {R}^D\mapsto \mathbb {R}^D\), for \(0\le j\le N\), and \(\eta \ge 0\), will allow the parallelization of the method by decoupling the system at every single time level \(t_n\) into two independent systems. These operators will be required to comply two hypotheses:

[H1] All operators are Lipchitz in both variables, that is there exists \(L>0\) such that
$$\begin{aligned} \Vert \mathcal{F}_j^{(\eta )}(\bar{u},\cdot )\mathcal{F}_j^{(\eta )}(\bar{v},\cdot )\Vert \le L \Vert \bar{u}\bar{v}\Vert ,\quad \Vert \mathcal{F}_j^{(\eta )}(\cdot ,\bar{u})\mathcal{F}_j^{(\eta )}(\cdot ,\bar{v})\Vert \le L \Vert \bar{u}\bar{v}\Vert , \end{aligned}$$(21)$$\begin{aligned} \forall \bar{u},\bar{v}\in \mathbb {R}^D, \end{aligned}$$for \(0\le j\le N\) and \(\eta \ge 0\). Without lost of generality, we assume a common Lipchitz constant \(L>0\) for both variables.

[H2] For \(0\le j\le N\), and \(\eta \ge 0\), there holds the equality
$$\begin{aligned} \mathcal{F}_j^{(\eta )} (\bar{u},\bar{u}) = \textbf{K}_j \bar{u},\qquad \forall \bar{u}\in \mathbb {R}^D,\quad 0\le j\le N, \quad \eta \ge 0, \end{aligned}$$(22)where \(\{\textbf{K}_j\}_{j\ge 0}\) stand for the quadrature weights defined in (9).
For instance, the family of linear operators \(\{\mathcal{F}_j^{(\eta )}\}_{1\le j \le N, \eta \ge 0}\), defined by
is probably the simplest one satisfying [H1], and for a convenient choice of \(N_j^{(\eta )}\) and \(M_j^{(\eta )}\) also [H2]. In particular, we focus our attention on the discrete NSWR (20) where
with
for certain values \(\{\mu ^{(\eta )}\}_{\eta \ge 0}\) to be determined later, and where I is the identity matrix in \(\mathbb {R}^D\). In our approach we avoid the dependence on j of \(\mu _j^{(\eta )}\) so we have \(\mu ^{(\eta )}\) instead of \(\mu _j^{(\eta )}\), and we might write \(M^{(\eta )}\) instead of \(M_j^{(\eta )}\).
Finally, consider the windows length equal to \(\tau \), that is \(b=1\). In this case the discrete NSWR method is socalled discrete timepoint (or time step) NSWR method and reads
Observe that since \(\mathcal{F}_j\) does not depend in fact on the subindex j and solely there appears \(\mathcal{F}_0^{(\eta )}\), we denoted this operator simply as \(\mathcal{F}^{(\eta )}\). Moreover, in the same manner \(N_j^{(\eta )}\) and \(M_j^{(\eta )}\) do not depend on j and are denoted by \(N^{(\eta )}\) and \(M^{(\eta )}\) respectively. The time step NSWR we propose reads as
The class of iterative methods defined by (23)–(26) is known as Richardson discrete timepoint NSWR methods, and they are the subject of our study in this paper.
Before ending this section several comments have to be made. On the one hand, if \(\mu ^{(\eta )}=0\), for all \(\eta \), then the iterative method (23)–(26) reduces to the functional iteration method.
Moreover, in spite of we here consider a timepoint NSWR method, the results and the idea of the proofs straightforwardly extends for larger windows, \(b>1\).
Finally observe that the choice (24) allows decoupling the \(D\times D\) system (26) into D linear equations which can be independently solved, or in other words this allows the parallelization of the algorithm by solving independently D decoupled equations.
4 Main results
The first issue to be observed is that, as typically occurs with iterative methods, the method (23)–(26) does not allow to reach the exact solution \(\bar{u}_n\) at any single step \(t_n\) within a finite number of iterations, even in the case of the exact solution could be theoretically achieved. It is mainly due to the roundoff errors. But not only the roundoff errors are the matter, even if the exact solution can be theoretically achieved in a finite number of iterations, such a number may be so large that it makes the computational effort unfordable, in other words the required number of iterations might be so large as to make the computational advantages of the parallelization approach negligible. Keeping this in mind assume that, for \(1\le n \le N\), we merely will have an approximation \(\bar{u}_n^{(m_n)}\) instead of the exact value \(\bar{u}_n\). The numerical solution \(\bar{u}_n^{(m_n)}\) will stand for the true numerical approximation to \(\bar{u}(t_n)\), \(1\le n\le N\), under a previously stated tolerance.
Observe that in the case of the Richardson time step NSWR methods the number of iterations in each window obviously coincides with the number of iterations for every single time level, namely our case.
The first result of this section concerns the form the error takes for the numerical scheme (23)–(26) so firstly denote the error
Denote also
where \(\{\mu ^{(r)}\}_{r\ge 0}\) is the set of values involved in (24). These values are discussed below. Admit (with abuse of the notation) that the evaluation of \(P_{\eta }(x)\) may be matrixwise done. Therefore we have the following lemma.
Lemma 1
The error (28) yielded by the numerical scheme (23)–(26) writes as
where
Proof
The proof straightforwardly follows by solving some difference equations. We consider first the case \(n=1\), for \(\eta \ge 0\). From (28) and (24)–(25) we have
Definition (25), and adding and subtracting \(\textbf{K}_0 \bar{u}_1^{(\eta )}\), lead to
Finally, adding and subtracting \( N^{(\eta )} \bar{u}_1\) leads to
Therefore, recursively and by definitions (25) it follows that
For \(n>1\) and \(\eta \ge 0\), in a similar manner as for \(n=1\), we have again from (28) and (24)–(25)
Now, adding and subtracting \(\textbf{K}_0 \bar{u}_n^{(\eta )}\) first, then adding and subtracting \( N^{(\eta )} \bar{u}_n\), joint with Definition (25) lead to
where \(\bar{e}_j^{(m_j)}\) stands for the error of the iterative process after \(m_j\) iterations once reached the threshold, at time step \(t_j\), \(1\le j\le N\). Therefore from (32) we have
The solution of the difference equation (33) straightforwardly writes
Therefore the statement of the lemma follows.
In view of the error expression (30)–(31), it can be observed that operator \(\textbf{K}_0\) plays a key role, and note also that \(\textbf{K}_0\) expresses as
in particular if \(\{\textbf{K}(t)\}_{t\ge 0}\) are the operators defined in (10), then \(\textbf{K}_0=\tilde{k}(1/\tau ) A = \tau ^\alpha A\).
This suggests that the choice of the parameters \(\{\mu ^{(r)}\}_{r\ge 0}\) in (29) must be related to the eigenvalues of \(\textbf{K}_0\) as in fact does occur in our approach. In this regard denote the spectrum of \(\textbf{K}_0\) as
and assume that they are sorted in increasing order of the absolute value. For the sake of the simplicity of the notation and without loss of generality all of them with multiplicity one, that is \( 0 \le \Vert \lambda _1 \Vert< \Vert \lambda _2 \Vert< \Vert \lambda _3 \Vert< \ldots < \Vert \lambda _D \Vert \). Therefore let us consider in (29)
Notice that in that case the iterative method (26) reaches the exact solution, at least theoretically, in a finite number of iterations since
However, as we have discussed above, in spite of the exact solution may theoretically be achieved after D iterations, since we are expecting that D is going to be quite large, that property will be useless from the practical point of view due to the high computational cost carried out by the large number of iterations required. But even though the number of iterations is from the computational point of view acceptable, hardly ever the exact solution will be reached due to the cumulative roundoff errors.
That is why a compromise solution is typically adopted: To give up the exact solution of the iterative method, instead we opt for a strategy to obtain an approximate solution within an acceptable computational effort. In this regard we fix a threshold expected to be reached at every time step \(t_n\) within a number of iterations let say \(m_n\) expected to be much smaller than D, and giving rise to the numerical approximation we are looking for.
Let us highlight that some authors made use of Chebyshev polynomials and the Minimax property they enjoy [35, 36, 40, 41]. This approach consists of taking the set \(\{\mu ^{(r)}\}_{r\ge 0}\) as the roots associated with the Chebyshev polynomial rescaled to the interval bounded by the eigenvalues of \(\textbf{K}_0\) with minimum and maximum absolute value, let say \(\lambda _1\) and \(\lambda _D\), that is
Our approach does not allow the use of the Minimax property if this choice is made. Moreover, it has a disadvantage: The choice (36) requires the computation of the whole spectrum of \(\textbf{K}_0\) which for a large D may be highly costly, in spite of this such a computation is certainly done once for all, or in the particular case of the onedimensional Laplacian the spectrum is explicitly known. On the contrary the choice (38) only requires the eigenvalues \(\lambda _1\) and \(\lambda _D\) which is much less expensive.
Since the proof of the main result follows the same ideas for both choices, for the sake of the simplicity of presentation we opted in this paper for (36). Anyhow the presence of a memory term in (30) makes in both cases new assumptions necessary in order to reach the error stability. In fact, if the choice (36) is made, then the new assumption is related to \(\sigma (\textbf{K}_0)\) and reads

[H3] The spectrum of \(\textbf{K}_0\) is located in \(\mathbb {C}^:=\{z\in \mathbb {C}: \text{ Re }(z)<0\}\) excepts might be the eigenvalue 0, and there exists \(0<\rho <1\) such that
$$\begin{aligned} \frac{\Vert \lambda _j\lambda _r\Vert }{\Vert 1\lambda _r\Vert }\le \rho , \qquad \lambda _j,\lambda _r\in \sigma (\textbf{K}_0). \end{aligned}$$(39)
Two comments must be made on the Hypothesis [H3]. First of all the Hypothesis [H3] is in general not too demanding in view of the form \(\textbf{K}_0\) takes in many cases, for instance in (10) where \(\textbf{K}_0\) writes as \(\textbf{K}_0 = \tau ^\alpha A\), \(\alpha >0\). In that case [H3] reduces to take the time step \(\tau \) small enough.
Moreover, in case of \(\sigma (\textbf{K}_0)\not \subset \mathbb {C}^\), a new family of operators whose spectrum locates in \(\mathbb {C}^\) may be achieved by shifting conveniently the original one. Notice that in spite of this hypothesis may look strange, it does not since the spectrum of \(\textbf{K}_0\) is closely related to the spectrum of \(\textbf{K}\) and consequently to the conditions for the wellposedness of (1).
On the other hand if the choice (38) is made, one simply has to reformulate the Hypothesis [H3] in the following form

[H3’] The spectrum of \(\textbf{K}_0\) is located in \(\mathbb {C}^\) excepts might be the eigenvalue 0, and there exists \(0<\rho <1\) such that
$$\begin{aligned} \frac{\Vert \lambda _j\mu ^{(r)}\Vert }{\Vert 1\mu ^{(r)}\Vert }\le \rho , \qquad \lambda _j\in \sigma (\textbf{K}_0),\quad 1\le r\le D, \end{aligned}$$(40)where \(\mu ^{(r)}\) are defined according to (38).
In that case the proof of the convergence follows the same steps as in Theorem 1 below.
Before stating the following result recall that \(m_n\) denotes the maximum number of iterations for the time step \(t_n\), which according to the above discussion are expected to be much less than D.
Theorem 1
Consider the iterative method (24)–(27), joint with (36) under the Hypothesis [H3]. Assume also that bound (12) satisfies
for \(\tau \) small enough.
Therefore, given a tolerance \(TOL>0\), there exist \(m_n>0\), \(1<n<N\), such that
Notice that (41) is straightforwardly satisfied merely taking a time step small enough; however, it can be certainly relaxed if a maximum number of iterations \(m<<D\) is previously fixed for each time step \(t_n\), that is \(m_n\le m\), for \(1\le n \le N\). If so (41) may be replaced by the less demanding one \(C\tau <1/m\).
Theorem 1 states sufficient conditions to keep the error of the iterative method proposed under a given tolerance. This results reflects in the numerical experiments where the results show that reaching accurate numerical solutions is feasible with much lower computational effort than with classical/sequential approaches.
Proof
The proof makes use of the error expression (30), where \(A^{(\eta )}\) and \(B^{(\eta )}\) are given by (31). According to this we have to prove that
at every time level \(t_n\).
To prove that \(\Vert A^{(\eta )}\Vert \) tends to zero as \(\eta \rightarrow +\infty \), observe that
whose spectrum consists of
Note that 0 turns out to be an eigenvalue of \(P_\eta (\textbf{K}_0)/P_\eta (1)\) of multiplicity \(\eta \), and by Hypothesis [H3], there exists \(0<\rho <1\) such that
Therefore, from (44) we straightforwardly have that
In particular once reached the maximum number of iterations \(\eta =m_n\),
The first part of the proof then follows.
Related to the term \(A^{(\eta )} \displaystyle B^{(\eta )}\sum _{j=1}^{n1}\textbf{K}_{nj}\bar{e}_j^{(m_j)}\), let us show first the boundness of \(A^{(\eta )} B^{(\eta )}\). To this end we rewrite such a term as follows
and in the same manner as in (46) we have
Therefore it easily follows that, for \(\eta \ge 1\),
in fact,
Recursively and according to (12), (47), and (51) we easily leads to
for \(1\le n\le N\). For the simplicity of the notation we have denoted \(\displaystyle \prod _{r=n}^{n1} (1+C\tau C_r\rho )=1\).
Finally, since \(C_n\le m_n\), for \(1\le n\le N\), and applying (41), if at every single time step \(t_n\) the initial error of the iterative stage satisfies
for a given tolerance \(TOL>0\), then
and the proof ends.
It is worth noting the following:

The condition (52) may be slightly relaxed by replacing \((1+\rho )^N\) by \((1+\rho )^{nj2}\).

Moreover, by the convergence of the original numerical solution \(\{\bar{u}_n\}_{1\le n\le N}\), \(\Vert \bar{u}_{n}\bar{u}_{n1}\Vert =O(\tau )\). Moreover, the convergence of the iterative method \(\{\bar{u}_n^{(m_n)}\}_{1\le n\le N}\), and setting a small enough time step \(\tau \), \(\bar{u}_n^{(0)}=\bar{u}_{n1}^{(m_{n1})}\) seems certainly to be a suitable candidate for the starting value of the iterative method at time level \(t_n\) satisfying (52).
5 CUDA implementation
The implementation of the parallel numerical scheme proposed in Section 3 for the problem (1), and theoretically analyzed in Section 4, is discussed more indepth in the present section. In particular details related to the algorithm and technical resources involved in the practical implementation of the algorithms are shown below.
Instead of a MIMD (Multiple Instruction  Multiple Data) strategy, based on the cooperation of various Central Processing Units (CPUs), the proposed approach exploits a SIMD (Single Instruction  Multiple Data) methodology according to Flynn taxonomy [42]. This methodology takes advantage of Graphics Processing Units (GPUs) and allows using them for general purpose through CUDA (Compute Unified Device Architecture), introduced by NVIDIA in 2006. In fact, GPUs were originally designed for graphical purposes such as the computation of image geometrical transformations (translations, rotations, scaling, projections), rendering operations for image finishing, or rastering tasks.
Instead, GPUs for General Purpose (GPGPUs) paradigm aim to exploit Arithmetic Logic Units (ALUs) of GPUs. In this context, CUDA provides a general purpose architecture based on an instruction set socalled Parallel Thread eXecution (PTX), and an extension set for various programming languages. In addition, in the CUDA architecture, the GPU (device) represents a coprocessor of the CPU (host) that is able to invoke kernels, functions that are executed by the device through threads.
Threads are the fundamental elements for the CUDA parallel design, since a thread executes a set of instructions on a data subset. Moreover, they have not activation/deactivation cost and a large number of them can be used to achieve the full efficiency. Threads are organized in blocks, and blocks are divided in a grid. Grid and blocks can be onedimensional, twodimensional or threedimensional according to the specific problem (see Fig. 1). This structure based on grid, blocks, and threads allows to simplify the parallel implementation and takes advantage of a coordinate set accessible through instructions gridDim, blockDim, blockIdx, and threadIdx.
The proposed method implementation through CUDA is based on a onedimensional grid and onedimensional blocks. Moreover, numerical results are obtained through Google Colab that made available a Tesla T4 GPU as shown in Fig. 2. The Tesla T4 Compute Capability (CC) is 7.5, and the CC category describes the GPU hardware structure and determines limits for the implementation phase.
First of all let us recall the time discretization (27) of the problem (1)
where \(F_n\) stands for
Let \(0=t_0<t_1<\cdots <t_N=T\) be a time mesh for the time discretization with \(t_n = t_0 + nh\). Therefore Algorithm 1 describes the implementation of the numerical method (27). The prefix ’h’ expresses that the variables are memorized by the host, instead the prefix ’d’ precedes the variables memorized by the device. Initially, lines 1 and 2 fix the dimension of blocks and the grid which allow to divide operations in kernels through the provided structure. Then the weights (9) and the parameters (38) are determined according to line 3. In particular, weights are memorized in a multidimensional matrix d\(\_\)K \(\in \mathbb {R}^{D \times D \times N}\) that contains N matrices of dimension \(D \times D\). Instead that set of values is stored in the vector mu \(= \left( \mu ^{\left( 1 \right) }, \dots , \mu ^{\left( D \right) } \right) ^T \in \mathbb {R}^{D} \).
Line 4 of the pseudocode summarizes the initialization of the starting term \(\bar{u}_0 + \bar{f} \left( t_0 \right) \) memorized into the variable d\(\_\)unew. Then lines 5 describes the kernel that allows to copy the vector d\(\_\)unew into the first column of the matrix d\(\_\)un \(= \left( \bar{u}_0, \dots , \bar{u}_N \right) \in \mathbb {R}^{D \times \left( N+1\right) }\).
The proposed Richardson timepoint NSRW method is based on interactions on each subinterval between two consecutive discretization points. Then, the term (55) is computed through the kernel at line 8 and memorized into the vector d\(\_ F_n\).
The iteration on the subinterval continues while the error is less than the tolerance and the maximum iterations number is not reached. Into the loop, the method is applied according to (54), and after its application lines 13–15 allow to calculate the error estimation in the point \(t_n\) at iteration \(\eta =\)Niter:
In particular, at line 13 the kernel error allows to memorize the difference between components of \(\bar{u}_n^{\left( \eta \right) }\) and \(\bar{u}_n^{\left( \eta 1\right) }\), stored in the vector d_err. The function thrust::device_ptr allocates the new vector d_unew on the device, then the function thrust::reduce returns the error estimation (56).
Indeed, the approximated d_unew solution related to the discretization point \(t_n\) is copied into d_uold at line 16. Finally, at line 19, the new approximation related to the last iteration \(m_n\) is memorized into the matrix d_un.
Table 1 summarizes the variables related to Algorithm 1.
6 Numerical results
As performance metric related to parallel implementations we considered the SpeedUp, which consists of the rate of the runtime (in milliseconds) for the sequential and parallel implementations.
For the SpeedUp assessment, the following subsections present several tests applied to practical problems. The numerical results aim to evaluate the ability of parallel computing to improve the velocity of the problem resolution. Moreover, they also allow for assessing the accuracy of the proposed approach and for empirically evaluating the convergence order of the method.
Results will be presented through tables that describe, in addition to SpeedUp, the dimension D of the system (1), sequential (CPU time) and parallel (GPU time) implementation times measured in milliseconds, and the error E referred to the exact solution \( \bar{u}\left( t \right) \)
where \(m_n\), defined in (42), represents the iteration that allows to satisfy the tolerance in the point \(t_n\). Finally, tables show the Error Estimation EE, aka, the maximal value of errors EE\(_n^{\left( m_n\right) }\) defined in (56)
Notice that E stands for the true error yielded by the numerical method, while EE is nothing but an estimation of it which is really useful as stoping criteria in cases where analytical solution is not available at all. In our experiments the analytical solution is actually known, despite that we show the values of EE.
We will also show the mean of iterations’ number at each time step \(t_n\), i.e.,
Moreover, let us notice that throughout the present section, the error will be measured in the sup norm.
6.1 Test problem 1
The first test problem consists of the system (1) with \(T=1\), where
with \(1< \alpha < 2\), and the matrix A is the threediagonal matrix
representing the onedimensional Laplacian discretization by means of the classical second order finite differences scheme (without rescaling with spatial mesh length).
The exact solution to test problem 1 is
Tables 2 and 3 present the results provided by sequential and parallel implementation of the numerical method for \(N=50\) and \(N=100\), respectively. Notice that the values of SpeedUp increase as the system dimension D increases. In particular, the values of SpeedUp provided by the case \(N=100\) are better than the ones provided by the case \(N=50\).
6.2 Test problem 2
The second test problem stands for a more complex problem than the previous one where, instead of considering a single value of \(\alpha \), there are different values related to each single integral equation of the system. It, indeed, consists of the system (1) in which
and the vector \(\bar{f}\left( t \right) \) has the following elements:
Then the convolution operator has the form
where the matrix A is defined in (63), and the matrix \(I(ts)\) represents a diagonal matrix in which the diagonal elements are
These hypotheses allow to know that the analytical solution is same as in test problem 1, that is (64).
Tables 4 and 5 summarize the obtained results for \(N=50\) and \(N=100\), respectively. We also report in these tables the mean of iteration (60) related to each system dimension because the selection of \(\alpha _1,\dots ,\alpha _D\) through pseudocasual numbers influences the total number of iterations.
Tables 4 and 5 show the increase of the SpeedUp as system dimension D growths. In particular, in the case N\(=100\) and D\(=4096\) we obtain a SpeedUp equal to 148.7127. We also notice that the obtained SpeedUps are better than the ones of the previous numerical experiment.
6.3 Test problem 3
The last problem considered to test the proposed approach has the following specifications
where \(I(ts)\) is diagonal matrix defined in (70), and the matrix A represent now the approximation matrix obtained by second order finite difference scheme to the twodimensional Laplacian operator with Neumann boundary conditions. In particular, by considering \(D_x\) discretization points in the xaxis direction and \(D_y\) discretization points in the yaxis direction, the matrix A belongs to \(\mathcal{M}_{D \times D}(\mathbb {R})\) with \(D = D_x D_y\) and is defined by
for \(i=1,\dots , D_x, \quad j=1,\dots ,D_y\). Once again the spatial mesh length does not play any role in this approach, this is why it is taken as 1 and number of discretization points \(D_x\) and \(D_y\) chosen as in Table 6.
In this problem, in order to derive the same analytical solution of the previous test problems
the source term \(\bar{f}(t) = (f_1(t), f_2(t),\ldots ,f_D(t))^T\), comes defined by
where \(a_{ij}\) stand for the entries of the matrix A, and as done in test problem 2, the values \(\alpha _1,\dots ,\alpha _D\) are generated through pseudorandom numbers and are related to each integral equation of the system.
Tables 7 and 8 present the results related to test problem 3 with N\(=50\) and N\(=100\), respectively.
Also in this case, the increase of the SpeedUp as system dimension growths confirms the benefits of the parallelization. In fact, for D\(=4096\), the sequential times are more than a hundred times slower than their respective parallels with both N\(=50\) and N\(=100\).
6.4 Numerical results assessments
As mentioned above, another aim of numerical tests consists of the experimental assessment of the convergence order. The analysis related to the results arising from the three test problems is shown in Table 9 which presents the results obtained to a system of dimension D\(=32\) as the step size of the time discretization decreases of a factor equal to 2. The increase of the discretization points N gives rise to the decrease of the error. The error in logarithmic scale, showed in Fig. 3, follows the slope of order 1 with the exception of the case N\(=3200\) where there is a little deviation.
Then the assessment of the convergence, theoretically analyzed in Section 4, is experimentally confirmed. Moreover, Fig. 3 represents the errors in logarithmic scale and allows deducing that the slope related to error points in Table 9 follows the slope of order 1. The only exception consists of the second test problem at the case \(N=3200\).
According to the results presented in previous subsections and summarized in Fig. 4, the three test problems provide a considerable improvement in performance from sequential to parallel executions. The runtime saved allows the application of the proposed method to fields in which the response time represents a key point.
In fact, nowadays a widespread application field of the analyzed problem is image or video processing. Systems of type (1) rather frequently appear in problems related to image and video processing (image filtering, clustering, video restoration). In this framework one of the main drawbacks is the size of images and/or frames to be processed which give rise to computations with very large system of equations (linear or even nonlinear). What is harder, in many instances results are required to get ready in real time: restoration of damaged patches, disoccluding hidden parts of frames in video. In this regard see for instance the recent paper [43] where linear Volterra equations of type (1) are proposed for video restoration. In this context, the response time is crucial and for related algorithms the possibility of reducing the run time represents a great challenge and any advantage is welcome. Moreover, the convergence analysis confirms the goodness of the proposed approach.
7 Conclusions and future works
NSWR methods and their parallel implementation by means of GPUs have shown a very good performance, particularly if compared to sequential implementation. The good performance extends further classical fractional integral equations but for more general linear systems of Volterra equations as the ones shown in Section 6.
What is more the numerical results reached in Section 6 are perfectly in line with the theoretical results in Section 4 related to the convergence
The statement of Theorem 1 may be extended without no significant differences to a wider class of equations of type (1). In fact one may assume that \(\textbf{K}(t)\) admits an elementwise Laplace transform \(\widetilde{\textbf{K}}(z) =(\tilde{k}_{i,j}(z))_{1\le i,j\le D}\) whose entries turn out to be analytic in the complex sector \(S_\varphi :=\{z\in \mathbb {C}: \Vert \arg (z)\Vert < \varphi \}\), \(0<\varphi <\pi /2\), and for which there exist \(M,\nu >0\) such that
In that case Corollary 3.2 in [16] extends (12) and says that there exists \(C>0\) such that
Under this hypothesis Theorem 1 follows in similar fashion.
We finally highlight that this paper focuses on the accuracy and SpeedUp analysis of the proposed parallelized approach for general purposes, that is for linear systems of general Volterra equations but not for particular applications. The application of this approach in the field of images processing represents a future development of the present work.
Availability of supporting data
Not applicable
References
Stinga, P.R.: User’s guide to the fractional Laplacian and the method of semigroups. Fractional Differential Equations, vol. 2, pp. 235–266. De Gruyter. Anatoly Kochubei and Yuri Luchko Edts., (2019). 10.1515/9783110571660012
Prüss, J.: Evolutionary Integral Equations and Applications, 1st edn. Modern Birkhäuser Classics. Birkhäuser, Basel, (2012). 10.1007/9783034804998
Cuesta, E., Durán, A., Kirane, M.: On evolutionary integral models for image restoration. In: Tavares J., Natal Jorge R. (Edts) Developments in Medical Image Processing and Computational Vision. Lecture Notes in Computational Vision and Biomechanics, vol. 19, pp. 241–260. Springer, (2015). 10.1007/9783319134079 15
Hilfer, R.: Applications of Fractional Calculus in Physics. World Scientific, Singapore, (2000). https://doi.org/10.1142/3779
Podlubny, I.: Fractional Differential Equations. Mathematics in Science and Engineering. Academic Press, (1999)
Kilbas, A.A., Srivastava, H.M., Trujillo, J.J.: Theory and Applications of Fractional Differential Equations. North Holland Mathematics Studies 204. Elsevier B. V., (2006). https://doi.org/10.1016/S03040208(06)800010
Das, S.: Functional Fractional Calculus, 2nd edition Springer, (2011). https://doi.org/10.1007/9783642205453
Cuesta, E., Kirane, M., Alsaedi, A., Ahmad, B.: On the sub–diffusion fractional initial value problem with time variable order. Adv. Nonlinear Anal. 10(1), 1301–1315 (2021). anona20200182
Cuesta, E., Ponce, R.: Wellposedness, regularity, and asymptotic behavior of continuous and discrete solutions of linear fractional integrodifferential equations with timedependent order. Electron. J. Differential Equations 2018(173), 1–27 (2018)
Banjai, L., Lubich, Ch.: An error analysis of RungeKutta convolution quadrature. BIT 51(3), 483–496 (2011). https://doi.org/10.1007/s105430110311y
Banjai, L., Lubich, Ch., Melenk, J.M.: RungeKutta convolution quadrature for operators arising in wave propagation. Numer. Math. 119(1), 1–20 (2011). https://doi.org/10.1007/s002110110378z
Calvo, M.P., Cuesta, E., Palencia, C.: RungeKutta convolution quadrature methods for wellposed equations with memory. Numer. Math. 107, 589–614 (2007). https://doi.org/10.1007/s0021100701079
Cuesta, E., Lubich, Ch., Palencia, C.: Convolution quadrature time discretization of fractional diffusionwave equations. Math. Comput. 75(254), 673–696 (2006). https://doi.org/10.1090/S0025571806017881
Cuesta, E., Palencia, C.: A numerical method for an integrodifferential equation with memory in Banach spaces: Qualitative properties. SIAM J. Numer. Anal. 41(4), 1232–1241 (2003). https://doi.org/10.1137/S0036142902402481
Lubich, Ch.: Fractional linear multistep methods for AbelVolterra integral equations of the second kind. Math. Comput. 45(172), 463–469 (1985). https://doi.org/10.2307/2008136
Lubich, Ch.: Convolution quadrature and discretized operational calculus I. Numer. Math. 52, 129–145 (1988). https://doi.org/10.1007/BF01398686
Lubich, Ch.: On the multistep time discretization of linear initialboundary value problems and their boundary integral equations. Numer. Math. 67, 365–389 (1994). https://doi.org/10.1007/s002110050033
Brunner, H.: Collocation Methods for Volterra Integral and Related Functional Differential Equations. Cambridge Monographs on Applied and Computational Mathematics (15). Cambridge University Press, (2004). https://doi.org/10.1017/CBO9780511543234
Cardone, A., Conte, D., D’Ambrosio, R., Paternoster, B.: Collocation methods for Volterra integral and integro–differential equations: A review. Axioms 7(45), 1–19 (2018). Angelamaria Cardone 1,* ID , Dajana Conte 1 ID , Raffaele D’Ambrosio 2 ID and Beatrice Paternoster 1 I
Cardone, A., Conte, D., Paternoster, B.: Stability of twostep spline collocation methods for initial value problems for fractional differential equations. Commun. Nonlinear Sci. 115, 106726 (2022). https://doi.org/10.1016/j.cnsns.2022.106726
Conte, D., D’Ambrosio, R., Paternoster, B.: Twostep diagonallyimplicit collocation based methods for Volterra integral equations. Appl. Numer. Math. 62(10), 1312–1324 (2012). https://doi.org/10.1016/j.apnum.2012.06.007
Conte, D., Prete, I.D.: Fast collocation methods for Volterra integral equations of convolution type. J. Comput. Appl. Math. 196, 652–663 (2006). https://doi.org/10.1016/j.cam.2005.10.018
LópezFernández, M., Palencia, C.: On the numerical inversion of the Laplace transform of certain holomorphic mappings. Appl. Numer. Math. 51(2–3), 289–303 (2004). https://doi.org/10.1016/j.apnum.2004.06.015
Capobianco, G., Conte, D., del Prete, I., Russo, E.: Stability analysis of fast numerical methods for Volterra integral equations. Electron. Trans. Numer. Anal. 30, 305–322 (2008)
Hairer, E., Lubich, Ch., Schlichte, M.: Fast numerical solution of nonlinear Volterra convolution equations. SIAM J. Numer. Anal. 6(3), 532–541 (1985). https://doi.org/10.1137/0906037
LópezFernández, M., Lubich, Ch., Schädle, A.: Adaptive, fast, and oblivious convolution in evolution equations with memory. SIAM J. Sci. Comp. 30(2), 1015–1037 (2008). https://doi.org/10.1137/060674168
Schädle, A., LópezFernández, M., Lubich, Ch.: Fast and oblivious convolution quadrature. SIAM J. Scient. Comput. 28(2), 621–639 (2006). https://doi.org/10.1137/050623139
Oancea, B., Andrei, T., Dragoescu, R.M.: GPGPU Computing. In: Proceedings of the CKS International Conference, 2012. http://arxiv.org/abs/1408.6923arXiv:1408.6923, (2014)
Burrage, K.: Parallel and Sequential Methods for Ordinary Differential Equations. Numerical Analysis and Scientific Computation. Oxford University Press, New York (1995)
Burrage, K., Dyke, C., Pohl, B.: On the performance of parallel weveform relaxations for differential systems. Appl. Numer. Math. 20(1–2), 39–55 (1996). https://doi.org/10.1016/01689274(95)001166
Sand, J., Burrage, K.: A Jacobi waveform relaxation method for ODEs. SIAM J. Sci. Comp. 20(2), 534–552 (1998). https://doi.org/10.1137/S1064827596306562
Courvoisier, Y., Gander, M.J.: Optimization of Schwarz waveform relaxation over short time windows. Numer. Algorithms 64, 221–243 (2013). https://doi.org/10.1007/s110750129662y
Lent, J.V., Vandewalle, S.: Multigrid waveform relaxation for anisotropic partial differential equations. Numer. Algorithms 31(1), 361–380 (2002). https://doi.org/10.1023/A:1021191719400
Zhang, H., Jiang, Y.L.: A note on the \(H^1\)convergence of the overlapping Schwarz waveform relaxation method for the heat equation. Numer. Algorithms 66, 299–307 (2014). https://doi.org/10.1007/s1107501397347
Capobianco, G., Conte, D.: An efficient and fast parallel method for Volterra integral equations of Abel type. J. Comput. Appl. Math. 189(1–2), 481–493 (2006). https://doi.org/10.1016/j.cam.2005.03.056
Capobianco, G., Conte, D., del Prete, I.: High performance parallel numerical methods for Volterra equations with weakly singular kernels. J. Comput. Appl. Math. 228(2), 571–579 (2009). https://doi.org/10.1016/j.cam.2008.03.027
Cardone, A., Messina, E., Russo, E.: A fast iterative method for discretized VolterraFredholm integral equations. J. Comput. Appl. Math. 189(1–2), 568–579 (2006). https://doi.org/10.1016/j.cam.2005.05.018
Conte, D., D’Ambrosio, R., Beatrice, P.: GPUacceleration of waveform relaxation methods for large differential systems. Numer. Algorithms 71(2), 293–310 (2016). https://doi.org/10.1007/s1107501599936
Califano, G., Conte, D.: Optimal Schwarz waveform relaxation for fractional diffusionwave equations. Appl. Numer. Math. 127, 125–141 (2018). https://doi.org/10.1016/j.apnum.2018.01.002
Conte, D., Paternoster, B.: Parallel methods for weakly singular Volterra integral equations on GPUs. Appl. Numer. Math. 114, 30–37 (2017). https://doi.org/10.1016/j.apnum.2016.04.006
Capobianco, G., Crisci, M.R., Russo, E.: Nonstationary waveform relaxation methods for Abel integral equations. J. Integral Equations Appl. 16(1), 53–65 (2004). https://doi.org/10.1216/jiea/1181075258
Mishr, I., Karakaya, Z.: Teaching parallel computing concepts using reallife applications. Int. J. Engineer. Edu. 32(2), 772–781 (2016)
Cuesta, E., Finat, J., Sánchez, J.: Greylevel intensity measurements processing by means of Volterra equations and Least Squares Method for video restoration. Phys. Scr. In press, 1–25 (2023)
Acknowledgements
The author E. Cuesta would like to thank GIR (Research Recognized Group) of the University of Valladolid, and IMUVa (Mathematic Institute of the University of Valladolid).
Funding
The authors D. Conte and C. Valentino are members of the GNCS group. This work has been partially supported by GNCSINDAM project and by the Italian Ministry of University and Research (MUR), through the PRIN 2017 project (No. 2017JYCLSF) “Structure preserving approximation of evolutionary problems”. Open Access funding provided thanks to the CRUECSIC agreement with Springer Nature.
Author information
Authors and Affiliations
Contributions
D. Conte, V. Carmine, and E. Cuesta shared writing, and preparing the main manuscript text. V. Carmine took mostly over the codes performance. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Ethical approval
Not applicable
Competing interests
Not applicable
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Dajana, C., Eduardo, C. & Carmine, V. Nonstationary wave relaxation methods for general linear systems of Volterra equations: convergence and parallel GPU implementation. Numer Algor 95, 149–180 (2024). https://doi.org/10.1007/s11075023015670
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11075023015670