## 1 Introduction

Time discretization of convolution equations typically gives rise to discrete convolutions whose practical computation turns out typically to be very costly, in particular if compared to time discretizations for ordinary differential equations. Moreover, if one considers a system of convolution equations instead, the problem gets even worse, this is why in this context, more than ever, efficient and fast algorithms are critical.

This paper is concerned with parallel implementations of time discretization for $$D\times D$$ linear systems of Volterra equations, $$D\in \mathbb {Z}^+$$, of the form

\begin{aligned} \bar{u}(t) = \bar{u}_0 + \int _0^t \textbf{K}(t-s) \bar{u}(s)\,\text{ d }s + \bar{f}(t),\qquad 0<t\le T, \end{aligned}
(1)

where $$\bar{u}_0\in X^D$$ stands for the initial data, $$\bar{f}:[0,T]\mapsto X^D$$ is a given vector-valued function, X typically stands for $$\mathbb {R}$$, and $$\{\textbf{K}(t)\}_{t\ge 0}$$, represents a time-dependent family of $$D\times D$$ matrix-valued operators.

Besides the computational cost time discretization itself carries out, rather frequently the system (1) turns out to be very large, that is $$D>>0$$ which makes the practical evaluation even harder. We might mention as a typical example within this framework those linear systems of equations (1) arising from the spatial discretization of linear Volterra equations

\begin{aligned} u(t,x) = u_0(x) + \int _0^t K(t-s) u(s,x)\,\text{ d }s + f(t,x),\qquad t>0, \end{aligned}
(2)

where $$x\in \mathbb {R}^d$$, $$d=1,2$$, or 3, and $$\{K(t)\}_{t\ge 0}$$ stands for a time-dependent family of linear operators defined in a suitable functional setting. Certainly, related to the spatial discretization of those equations, the finer the spatial mesh, the larger the system is, that is the larger the computational cost is.

Consider as a prototype equation arising in the context of (2), might be the simplest one one may consider, the linear equation whose convolution kernel adopts the form

\begin{aligned} K(t)=k(t)A,\qquad t>0, \end{aligned}
(3)

where k(t) stands for a scalar function, and A is a linear operator. Very often in practical instances the operator A in (3) turns out to be the Laplacian operator $$A=\Delta$$ in $$\mathbb {R}^n$$, $$n=1,2$$ or 3, or a fractional Laplacian $$(-\Delta )^\beta$$, $$0<\beta <1$$ (see [1]), both of them joint with suitable boundary conditions, or merely $$A\in \mathcal{M}_{D\times D}(\mathbb {R})$$ (find more examples in [2]). Related to the scalar functions k(t) let us mention those defining time fractional integrals/derivatives. Recall for instance and among many others definitions of fractional integrals/derivatives, the Riemann–Liouville time fractional integrals of order $$\alpha >0$$ whose associated kernel reads

\begin{aligned} k(t) := \left\{ \begin{array}{ccl} \displaystyle \frac{t^{\alpha -1}}{\Gamma (\alpha )}, &{}\text{ if }&{} t>0,\\[8pt] 0, &{}\text{ if }&{} t\le 0. \end{array}\right. \end{aligned}
(4)

This kind of equation is nowadays attracting the interest of many researchers (see [2,3,4] and references therein, among many others).

To go more in-depth in that definition, its associated definition of fractional derivative, and other definitions of fractional integrals/derivatives, there is a vast literature among which we refer the reader to [4,5,6,7] and references therein. In this regard let also mention those definitions extending the non-integer fractional integrals of constant order $$\alpha >0$$ to the ones whose fractional order depends on time, i.e., $$\alpha (t)$$ instead of a constant $$\alpha$$ [8, 9]. These kinds of integrals/derivatives are discussed more precisely in Section 2.

Let us highlight that prototype equations we are considering show a singular behavior as t tends to $$0^+$$ which gives rise to singularities in the solution of the corresponding equations. Notice that most times singular Volterra equations reflect much more accurately the memory effect in many practical cases, instead of the regular ones. Unfortunately their solutions require a more careful study and this fact has to be taken into account in the numerical analysis.

On the other hand the numerical solution of (1) has been investigated by a large number of authors. Of particular interest are the discretizations based on convolution quadratures due mainly to the good stability properties [10,11,12,13, 15,16,17]; the collocation methods provide also a good balance between stability and accuracy [18,19,20,21,22]; or the ones based on the numerical inversion Laplace transform [23], among many others.

However, the computational cost these implementations carry out when (1) stands for a large system of equations represents even today a great difficulty in practical instances, therefore raising faster and more efficient implementations is nowadays a great challenge. This issue is usually addressed, at least, in two different ways. On the one hand faster and more efficient numerical methods have been proposed in the literature, see, e.g., [22, 24,25,26,27] in the framework of convolution quadratures. The second one consists of performing faster and more efficient implementations in the context of parallel computing. In the present paper we focus on the last one.

Parallel computing has emerged in last decades as a very helpful tool to speed up computational processes in particular the ones involved in numerical analysis, all of them once adapted to this architecture. However, parallel architectures are very expensive even at present, instead modern hardware technologies allow parallel computing at much lower cost, in fact we are speaking of Graphics Processing Units (GPUs). The GPUs based hardware was initially designed for graphical purposes; however, over the last few years it turned into a very useful tool in the framework of scientific computing due to their low cost, and their computational capabilities [28]. Parallel computing within GPUs context has allowed to speed up algorithms for systems of Ordinary Differential Equations (ODEs) [29,30,31]; Partial Differential Equations (PDEs) [32,33,34]; and the ones we are interested in the present paper, that is the systems of Volterra Integral Equations (VIEs) [35, 36, 36,37,38,39,40]; among others. In this point we have to highlight most of cites above related to VIEs are concerned with convolution integrals of fractional type.

The aim of this paper consists of extending some of previous approaches within parallel computing framework from systems of linear integral equations of fractional type to a more general class of systems of linear convolution equations. In other words beyond the systems of fractional integral equations, in the present paper we consider systems of nonlocal in time equations whose convolution kernels merely satisfy the existence of its Laplace transform in certain half-complex plane. That includes a large list of kernels and equations, some of them more precisely described in Section 2. The parallel approach considered in this paper in based on the Non-Stationary Wave Relaxation Methods (NSWR) introduced in [41]. In this regard since several truncation errors are involved in the iterative method the study goes accompanied by an accurate analysis of the convergence which allows one to ensure the well behavior of such algorithms in practical instances. The performance of these methods is illustrated with several numerical experiments applied to some of the systems of equations proposed in Section 2.

The paper is organized as follows. Section 2 is devoted to state the framework of the present work, both from the continuous and the discrete point of view. In Section 3 we describe precisely the notation and the raised method, and in Section 4 we present our main contributions of this paper, that is the theoretical analysis related to the convergence of the proposed algorithms. Section 5 is devoted to explain the technical resources involved in the implementation of algorithms, and in Section 6 several numerical experiments show the good performance of these. Section 7 includes the most relevant conclusions of the present and future works.

## 2 Framework statement

### 2.1 Time continuous setting

First of all we state the notation we will use throughout the following sections. Let us consider $$X=\mathbb {R}$$, and the system of linear Volterra equations (1) where $$\bar{u},\bar{f}:[0,T]\mapsto \mathbb {R}^D$$, $$D\in \mathbb {Z}^+$$ (which is expected to be large), $$\mathbb {R}^D$$ is normed by $$\Vert \cdot \Vert$$ (we set below what $$\Vert \cdot \Vert$$ stands for: $$L_2$$-norm, sup norm,...), $$\{\textbf{K}(t)\}_{t\ge 0}\subset \mathcal{M}_{D\times D}(\mathbb {R})$$, that is a time-dependent family of matrices, and $$\bar{u}_0\in \mathbb {R}^D$$ represents the initial data. In order to keep the initial data $$\bar{u}_0$$ from apart of the source term $$\bar{f}$$, one may assume without loss of generality that $$\bar{f}(0)=\bar{0}$$.

Before going further in the paper we state conditions for the well-posedness of (1). In this regard consider the wide class of time-dependent operators $$\{\textbf{K}(t)\}_{t\ge 0}$$ element-wise of exponential growth, that is those for which there exist $$b,\beta >0$$ such that if $$\textbf{K}(t)=(k_{i,j}(t))_{1\le i,j\le D}$$, then

\begin{aligned} {\vert k_{i,j} (t) \vert } \le b\,\text{ e}^{-\beta t}, \qquad t>0. \end{aligned}
(5)

Notice that for the sake of the simplicity of the notation, and without loss of generality, we assume the same constants b and $$\beta >0$$ for each entry of $$\textbf{K}(t)$$. This means that $$\textbf{K}(t)$$, for $$t>0$$, admits absolutely convergent element-wise Laplace transform in $$\mathcal{D}_{\textbf{K}}:=\{z\in \mathbb {C}: \text{ Re }(z)\ge \beta \}$$, denoted now and hereafter by $$\widetilde{\textbf{K}}(z) = (\widetilde{k}_{i,j}(z))_{1\le i,j\le D}$$, $$z\in \mathcal{D}_{\textbf{K}}$$, being $$\tilde{k}_{ij} = \mathcal{L}(k_{ij}(z))$$. This fits the time-dependent operators of the form $$\textbf{K}(t)=k(t) A$$ where k(t) is a function of exponential growth, and A stands for a matrix, e.g., the one corresponding to some classical spatial discretization of the Laplacian. If in addition $$\bar{f}:(0,+\infty )\rightarrow \mathbb {R}$$ is continuous, then (1) is well-posed [2]. Some assumptions might be lightened; however, this framework is general enough for our purposes.

Apart from the model commented in Section 1 based on the Riemann–Liouville fractional integral, let us consider other examples fitting this context and whose implementations will serve to illustrate the performance of our methods.

1. 1.

Denote $$\alpha :[0,T]\rightarrow \mathbb {R}^+$$ a positive function, $$\alpha (t)>0$$, whose Laplace transform exists and is denoted by $$\widetilde{\alpha }(z)$$, for $$z\in \mathcal{D}\supseteq \mathcal{D}_{\textbf{K}}$$. Define $$k:[0,T]\mapsto \mathbb {R}$$, the kernel associated to the fractional integral with order varying in time $$\alpha (t)>0$$, according to the following definition

\begin{aligned} \partial ^{-\alpha (t)} f(t) := (k*f)(t) = \int _0^t k(t-s)f(s)\,\text{ d }s,\quad t>0, \end{aligned}
(6)

where

\begin{aligned} k(t):=\mathcal{L}^{-1}(K)(t)\quad \text{ with }\quad K(z):= \frac{1}{z^{z\tilde{\alpha }(z)}},\qquad z\in D. \end{aligned}
(7)

Notice that if $$\alpha (t)$$ is constant, then definitions (4) and (6)–(7) completely agree. Consider now the equation (2) where $$K(t)=k(t)\Delta$$, and $$\Delta$$ stands for the 1-dimensional Laplacian operator in [0, L], with homogenous Newmann boundary conditions. Given $$D>0$$, and the mesh grid $$x_0=0<x_1<x_2<\ldots <x_{D}=L$$, $$x_j=jh$$, $$0\le j\le D$$ and $$h=L/D$$, the spatial discretization of (2) by means of a classical second order finite differences scheme gives rise to a finite dimensional operator A which is nothing but a $$(D+1)\times (D+1)$$ three-diagonal matrix. The Volterra equation (1) arises now with $$\textbf{K}(t)=k(t)A$$, and $$\bar{f}$$ standing for the function $$f:[0,T]\rightarrow \mathbb {R}^D$$ sampled in the spatial grid. Moreover, $$\bar{u}$$ stands for the approximation to the time continuous solution over the spatial grid, that is $$\bar{u}(t)=(u_j(t))_{0\le j\le D}$$ where $$u_j(t)\approx u(x_j,t)$$, $$0\le j \le D$$. Other definitions of fractional derivatives/integrals with order depending on time have been proposed in the literature. Some of these definitions have been discussed in [9] concluding that (6)–(7) might be considered as the most convenient in many practical instances. This is why we adopt such a definition in this paper.

2. 2.

A kind of generalization of (3)–(4) comes out in the framework image processing [3]. In fact consider the two-dimensional Laplacian in $$[0,L]\times [0,L]$$, denoted again (if not confusing) by $$\Delta$$, with homogeneous Newmann boundary conditions again. In [3] the Laplacian applies to u(xyt) in the variables x, and y, and $$u:[0,L]\times [0,L]\times [0,T]\mapsto \mathbb {R}$$ represents an original gray scale image $$u_0(x,y)$$ evolved up to the time level $$t>0$$ to be u(xyt). Once one discretizes in space by a classical second order finite difference scheme with $$D+1$$ nodes along each coordinate axis, we have the mesh grid $$\{(x_i,y_j)\}_{0\le i,j \le D} = \{(ih,jh)\}_{0\le i,j \le D}$$, $$h=L/D$$, and that u(xyt), $$0\le x,y\le L$$, becomes into a $$(D+1)\times (D+1)$$ matrix $$(u_{i,j}(t))_{0\le i,j\le D}$$ where the entries represent the approximations to the exact solution, that is $$u_{i,j}(t)\approx u(x_i,x_j,t)$$, for $$t>0$$, $$0\le i,j \le D$$. Therefore if one reshapes such a matrix as a vector by stacking columns, then the discrete Laplacian, denoted again by A, reads as a $$(D+1)^2\times (D+1)^2$$ five-diagonal matrix, and $$(u_{i,j}(t))_{0\le i,j\le D}$$ as a $$(D+1)^2\times 1$$ vector. One of the novelties of this model is that it relates image processing and fractional calculus and allows to apply a different diffusion rate over every single pixel $$(x_i,y_j)$$. Such a rates are handled by a particular fractional kernel $$k_{i,j}(t)$$ of type (4) with order $$\alpha _{i,j}>0$$, $$0\le i,j\le D$$. In [3] the authors also consider the case of fractional orders depending on time, that is orders $$\alpha _{i,j}(t)$$, $$0\le i,j\le D$$, $$t>0$$, according to the definition (6)–(7). Anyhow that approach leads to a system of linear equations of type (1) where $$\textbf{K}(t)=I(t)A$$, $$I(t)=$$diag$$(k_j(t))_{0\le j\le D^2}$$, $$t\ge 0$$, $$\alpha _j(t)$$ are defined according to (6)–(7), A stands for the $$(D+1)^2\times (D+1)^2$$ five-diagonal matrix mentioned above, and $$\bar{u}_0$$ stands for the initial data u(xy, 0) sampled in the spatial mesh, that is the original image to be restored sampled on the spatial mesh.

### 2.2 Time discrete setting

In Section 1 we said that a large variety of numerical schemes for the time discretization of Volterra equations (1) exists in the literature. However, in this paper we focus on the convolution quadratures based methods introduced by Ch. Lubich in [15].

In particular a first order numerical scheme is considered here and the reason is twofold. On the one hand the lack of time regularity observed in the solutions of most of singular Volterra equations has proved as a barrier to consider higher order time discretization. In particular, for fractional equations of type (2)–(4)), it is very well known that the second order of convergence is reachable if $$1<\alpha <2$$, but it is not obvious at all if the equation involves some non-linear term. Even worse if $$0<\alpha <1$$ for which the second order is hardly reached [13]. The same applies to the solutions of the equations with fractional order depending on time (6)–(7), whose regularity is precisely studied in [9]. Apart from that the choice of a first order numerical scheme does not yield a loss of generality in our study, on the contrary the study might be straightforwardly extended to higher order numerical schemes if the regularity of the corresponding continuous solutions is high enough.

Therefore we adopt the backward Euler based convolution quadrature method as the time discretization for (1). Since the stability, convergence, as well as some other properties, has been already studied [14,15,16,17,18,19,20,21,22,23,24,25,26,27], we here merely recall its formulation. In fact let $$\tau >0$$ be the time step of the discretization, $$\tau =T/N$$, and $$t_n=n\tau$$, for $$0\le n \le N$$. Moreover, denote by q(z) the quotient of the generating polynomials of the associated multistep linear method, in the case of the backward Euler method simply $$q(z)=1-z$$. According to the above the time discretization for (1) reads

\begin{aligned} \bar{u}_n = \bar{u}_0 + \sum _{j=0}^{n} \textbf{K}_{n-j} \bar{u}_j + \bar{f}_j,\qquad 0\le n\le N, \end{aligned}
(8)

where $$\bar{u}_n\in \mathbb {R}^D$$ stands for the approximation to $$\bar{u}(t_n)$$, $$\bar{f}_n:=\bar{f}(t_n)\in \mathbb {R}^D$$, $$0\le n \le N$$, and the convolution weights $$\textbf{K}_j\in \mathcal{M}_{D\times D}(\mathbb {R})$$ are element-wise given by the expansion

\begin{aligned} \widetilde{\textbf{K}}\left( \frac{q(z)}{\tau }\right) = \sum _{n=0}^{+\infty } \textbf{K}_n z^n. \end{aligned}
(9)

For example, if $$\textbf{K}(t)=k(t) A$$ with $$k(t)=t^{\alpha -1}/\Gamma (\alpha )$$, $$\alpha >0$$, and $$A\in \mathcal{M}_{D\times D}(\mathbb {R})$$, then the convolution weights take the form

\begin{aligned} \textbf{K}_n = k_nA, \end{aligned}
(10)

where

\begin{aligned} k_n=\tau ^\alpha {\alpha -n+1\atopwithdelims ()n} = \tau ^\alpha \frac{\alpha (\alpha -1)\cdots (\alpha -n+1)}{n!},\quad n\ge 1\quad (k_0=1). \end{aligned}
(11)

Let us highlight that in spite of the computation of convolution weights (9) may be directly done by means of the theoretical expression (10) this task requires a very high computational cost. Instead in practical instances that computation is performed by means of the Fast Fourier Transform (FFT), not only for the kernels of fractional type but also for the kernels which Laplace transform is reachable as the ones we consider in this paper.

For the convenience of readers we recall a result related to the convolution weights (9) which will be very useful in the proofs of our results (see Corollary 3.2 in [16]). In fact there exists $$C>0$$ depending on $$\textbf{K}$$ and T but independent on $$\tau$$ and n, such that the convolution weights in (9) hold the bound

\begin{aligned} \Vert \textbf{K}_n\Vert \le C \tau ,\qquad 1\le n\le N. \end{aligned}
(12)

Finally notice that, without loss of generality, now and hereafter we consider $$\mathcal{M}_{D\times D}(\mathbb {R})$$ normed by the matrix sup norm denoted if not confusing merely by $$\Vert \cdot \Vert$$.

## 3 Discrete NSWR methods for linear Volterra equations

The Non-Stationary Wave Relaxation (NSWR) methods for integral equations of type

\begin{aligned} y(t) = y_0 + \int _0^t k(t,s,y(s))\,\text{ d }s + f(t),\qquad t>0, \end{aligned}
(13)

have been introduced in [41] and consist of iterative methods that provide a time-dependent sequence $$\{y^{\eta }(t)\}_{\eta \ge 0}$$, throughout the difference equation

\begin{aligned} y^{(\eta )}(t) = y_0 + \int _0^t \mathcal{F}^{(\eta )} (t,s,y^{(\eta -1)}(s),y^{(\eta )}(s)) \,\text{ d }s + f(t),\qquad t>0,\quad \eta =1,2,\ldots , \end{aligned}
(14)

where the set of functions $$\mathcal{F}^{(\eta )}:\mathbb {R}^+\times \mathbb {R}^+ \times \mathbb {R}\times \mathbb {R}\rightarrow \mathbb {R}$$ satisfy the equality

\begin{aligned} \mathcal{F}^{(\eta )} (t,s,y,y) = k(t,s,y),\qquad \eta =1,2,\ldots . \end{aligned}
(15)

The following facts have been already noticed in [36]. Firstly it has been observed that if one applies the iterative process over the whole time interval, then the convergence of the method slows noticeably down as the number of time nodes grows up. To overcome this problem the solution adopted in [36], and the one we adopt here, consists of splitting the whole interval [0, T] in sub-intervals/windows of length $$b\tau$$, $$b\in \mathbb {Z}^+$$, so that

\begin{aligned}{}[0,T] = \bigcup _{j=0}^{R-1}[t_{jb},t_{(j+1)b}] = \bigcup _{j=0}^{R-1}[jb\tau ,jb\tau +b\tau ], \quad \text{ with }\quad Rb\tau =T. \end{aligned}
(16)

The iterative method (14) reads now, for $$t\in (rb\tau ,rb\tau +b\tau ]$$,

\begin{aligned} y^{(\eta )}(t) = y_0 + \int _0^{rb\tau } k(t,s,y(s)) \,\text{ d }s + \int _{rb\tau }^t \mathcal{F}^{(\eta )} (t,s,y^{(\eta -1)}(s),y^{(\eta )}(s)) \,\text{ d }s + f(t), \end{aligned}
(17)
\begin{aligned} \eta =1,2,\ldots \end{aligned}

In the case of the system (1), the method (17) reads

\begin{aligned} \bar{u}^{(\eta )}(t) = \bar{u}_0 + \int _0^{rb\tau } \textbf{K}(t-s)\bar{u}(s) \,\text{ d }s + \int _{rb\tau }^t \mathcal{F}^{(\eta )} (t-s,\bar{u}^{(\eta -1)}(s),\bar{u}^{(\eta )}(s)) \,\text{ d }s + \bar{f}(t), \end{aligned}
(18)

for $$\eta =1,2,\ldots$$, where, by the simplicity of the notation, and having in mind that the equations we are considering in the present paper are of convolution type, we have replaced $$\mathcal{F}^{(\eta )} (t,s,\bar{u},\bar{v})$$ by $$\mathcal{F}^{(\eta )} (t-s,\bar{u},\bar{v})$$, that is $$\mathcal{F}^{(\eta )}$$ is now a function $$\mathcal{F}^{(\eta )}:\mathbb {R}^+ \times \mathbb {R}^D \times \mathbb {R}^D \rightarrow \mathbb {R}^D$$. The condition (15) then reads

\begin{aligned} \mathcal{F}^{(\eta )} (t-s,\bar{u},\bar{u}) = \textbf{K}(t-s)\bar{u},\qquad 0<s<t,\quad \eta =1,2,\ldots \end{aligned}
(19)

Therefore we discretize (17) according to the numerical scheme (8)–(9) in the time mesh $$\{t_n\}_{1\le n\le N}$$, giving rise to the discrete NSWR method

\begin{aligned} \bar{u}_n^{(\eta )} = \bar{u}_0 + \sum _{j=0}^{rb} \textbf{K}_{n-j} \bar{u}_j + \sum _{j=rb+1}^{n} \mathcal{F}_{n-j}^{(\eta )}(\bar{u}_j^{(\eta -1)},\bar{u}_j^{(\eta )}) + \bar{f}_n, \qquad \eta \ge 1, \end{aligned}
(20)

for $$t_n\in (rb\tau ,rb\tau +b\tau ]$$. Once again for the sake of the simplicity of the notation we have replaced $$\mathcal{F}^{(\eta )}(t_n-t_j, \bar{u}_j^{(\eta -1)},\bar{u}_j^{(\eta )})$$ by $$\mathcal{F}_{n-j}^{(\eta )}(\bar{u}_j^{(\eta -1)},\bar{u}_j^{(\eta )})$$.

The iterative method (20) provides at every single time step $$t_n=n\tau$$ a sequence of vectors $$\{\bar{u}_n^{(\eta )}\}_{\eta \ge 0}$$, for which the number of iterations required to achieve the stated accuracy at time level $$t_n$$ will depend at every single time level solely on the last windows but not on the time step $$t_n$$ itself. That is, for the r-th window, $$0\le r\le R-1$$, there will be a threshold $$m_r>0$$ so that the iterative method reaches the required accuracy with at most $$m_r$$ iterations. Therefore we adopt $$\bar{u}_n^{(m_r)}$$ as the numerical approximation to $$\bar{u}(t_n)$$, that is $$\bar{u}_n^{(m_r)}\approx \bar{u}(t_n)$$, for every $$t_n\in (t_{(r-1)b},t_{rb}]$$, $$1\le n\le N$$, and $$1\le r \le R$$.

A convenient choice of the operators $$\mathcal{F}_j^{(\eta )}:\mathbb {R}^D\times \mathbb {R}^D\mapsto \mathbb {R}^D$$, for $$0\le j\le N$$, and $$\eta \ge 0$$, will allow the parallelization of the method by decoupling the system at every single time level $$t_n$$ into two independent systems. These operators will be required to comply two hypotheses:

• [H1] All operators are Lipchitz in both variables, that is there exists $$L>0$$ such that

\begin{aligned} \Vert \mathcal{F}_j^{(\eta )}(\bar{u},\cdot )-\mathcal{F}_j^{(\eta )}(\bar{v},\cdot )\Vert \le L \Vert \bar{u}-\bar{v}\Vert ,\quad \Vert \mathcal{F}_j^{(\eta )}(\cdot ,\bar{u})-\mathcal{F}_j^{(\eta )}(\cdot ,\bar{v})\Vert \le L \Vert \bar{u}-\bar{v}\Vert , \end{aligned}
(21)
\begin{aligned} \forall \bar{u},\bar{v}\in \mathbb {R}^D, \end{aligned}

for $$0\le j\le N$$ and $$\eta \ge 0$$. Without lost of generality, we assume a common Lipchitz constant $$L>0$$ for both variables.

• [H2] For $$0\le j\le N$$, and $$\eta \ge 0$$, there holds the equality

\begin{aligned} \mathcal{F}_j^{(\eta )} (\bar{u},\bar{u}) = \textbf{K}_j \bar{u},\qquad \forall \bar{u}\in \mathbb {R}^D,\quad 0\le j\le N, \quad \eta \ge 0, \end{aligned}
(22)

where $$\{\textbf{K}_j\}_{j\ge 0}$$ stand for the quadrature weights defined in (9).

For instance, the family of linear operators $$\{\mathcal{F}_j^{(\eta )}\}_{1\le j \le N, \eta \ge 0}$$, defined by

\begin{aligned} \mathcal{F}_j^{(\eta )}(\bar{u},\bar{v}) = N_j^{(\eta )}\bar{u}+ M_j^{(\eta )}\bar{v}. \end{aligned}
(23)

is probably the simplest one satisfying [H1], and for a convenient choice of $$N_j^{(\eta )}$$ and $$M_j^{(\eta )}$$ also [H2]. In particular, we focus our attention on the discrete NSWR (20) where

\begin{aligned} \mathcal{F}_j^{(\eta )}(\bar{u},\bar{v}) = N_j^{(\eta )}\bar{u}+ M_j^{(\eta )}\bar{v},\qquad \eta \ge 0, \end{aligned}
(24)

with

\begin{aligned} M_j^{(\eta )} = \mu _j^{(\eta )} I \quad \text{ and }\quad N_j^{(\eta )} = \textbf{K}_j - M_j^{(\eta )},\qquad 0\le j\le N,\quad \eta \ge 0, \end{aligned}
(25)

for certain values $$\{\mu ^{(\eta )}\}_{\eta \ge 0}$$ to be determined later, and where I is the identity matrix in $$\mathbb {R}^D$$. In our approach we avoid the dependence on j of $$\mu _j^{(\eta )}$$ so we have $$\mu ^{(\eta )}$$ instead of $$\mu _j^{(\eta )}$$, and we might write $$M^{(\eta )}$$ instead of $$M_j^{(\eta )}$$.

Finally, consider the windows length equal to $$\tau$$, that is $$b=1$$. In this case the discrete NSWR method is so-called discrete time-point (or time step) NSWR method and reads

\begin{aligned} \bar{u}_n^{(\eta )} = \bar{u}_0 + \sum _{j=0}^{n-1} \textbf{K}_{n-j}\bar{u}_j + \mathcal{F}^{(\eta )}(\bar{u}_n^{(\eta -1)},\bar{u}_n^{(\eta )}) + \bar{f}_n, \quad \eta \ge 1. \end{aligned}
(26)

Observe that since $$\mathcal{F}_j$$ does not depend in fact on the sub-index j and solely there appears $$\mathcal{F}_0^{(\eta )}$$, we denoted this operator simply as $$\mathcal{F}^{(\eta )}$$. Moreover, in the same manner $$N_j^{(\eta )}$$ and $$M_j^{(\eta )}$$ do not depend on j and are denoted by $$N^{(\eta )}$$ and $$M^{(\eta )}$$ respectively. The time step NSWR we propose reads as

\begin{aligned} \bar{u}_n^{(\eta )} = \bar{u}_0 + \sum _{j=0}^{n-1} \textbf{K}_{n-j}\bar{u}_j + (K_0-\mu ^{(\eta )} I)\bar{u}_n^{(\eta -1)} + \mu ^{(\eta )} I \bar{u}_n^{(\eta )} + \bar{f}_n, \quad \eta \ge 1. \end{aligned}
(27)

The class of iterative methods defined by (23)–(26) is known as Richardson discrete time-point NSWR methods, and they are the subject of our study in this paper.

Before ending this section several comments have to be made. On the one hand, if $$\mu ^{(\eta )}=0$$, for all $$\eta$$, then the iterative method (23)–(26) reduces to the functional iteration method.

Moreover, in spite of we here consider a time-point NSWR method, the results and the idea of the proofs straightforwardly extends for larger windows, $$b>1$$.

Finally observe that the choice (24) allows decoupling the $$D\times D$$ system (26) into D linear equations which can be independently solved, or in other words this allows the parallelization of the algorithm by solving independently D decoupled equations.

## 4 Main results

The first issue to be observed is that, as typically occurs with iterative methods, the method (23)–(26) does not allow to reach the exact solution $$\bar{u}_n$$ at any single step $$t_n$$ within a finite number of iterations, even in the case of the exact solution could be theoretically achieved. It is mainly due to the round-off errors. But not only the round-off errors are the matter, even if the exact solution can be theoretically achieved in a finite number of iterations, such a number may be so large that it makes the computational effort unfordable, in other words the required number of iterations might be so large as to make the computational advantages of the parallelization approach negligible. Keeping this in mind assume that, for $$1\le n \le N$$, we merely will have an approximation $$\bar{u}_n^{(m_n)}$$ instead of the exact value $$\bar{u}_n$$. The numerical solution $$\bar{u}_n^{(m_n)}$$ will stand for the true numerical approximation to $$\bar{u}(t_n)$$, $$1\le n\le N$$, under a previously stated tolerance.

Observe that in the case of the Richardson time step NSWR methods the number of iterations in each window obviously coincides with the number of iterations for every single time level, namely our case.

The first result of this section concerns the form the error takes for the numerical scheme (23)–(26) so firstly denote the error

\begin{aligned} \bar{e}_n^{(\eta )} := \bar{u}_n - \bar{u}_n^{(\eta )},\qquad 1\le n\le N,\quad \eta \ge 0. \end{aligned}
(28)

Denote also

\begin{aligned} P_{\eta }(x) =\prod _{r=1}^\eta (x-\mu ^{(r)}),\qquad \eta \ge 0, \end{aligned}
(29)

where $$\{\mu ^{(r)}\}_{r\ge 0}$$ is the set of values involved in (24). These values are discussed below. Admit (with abuse of the notation) that the evaluation of $$P_{\eta }(x)$$ may be matrix-wise done. Therefore we have the following lemma.

### Lemma 1

The error (28) yielded by the numerical scheme (23)–(26) writes as

\begin{aligned} \bar{e}_n^{(\eta )} = A^{(\eta )} \left\{ B^{(\eta )}\sum _{j=1}^{n-1}\textbf{K}_{n-j}\bar{e}_j^{(m_j)} + \bar{e}_n^{(0)} \right\} , \qquad \eta \ge 1, \end{aligned}
(30)

where

\begin{aligned} A^{(\eta )} = \frac{P_{{\eta }}(\textbf{K}_0)}{P_{\eta }(1)}, \qquad B^{(\eta )} = \sum _{j=1}^{\eta } P_{j-1}(1) P_{j}(\textbf{K}_0)^{-1},\quad \eta \ge 1. \end{aligned}
(31)

### Proof

The proof straightforwardly follows by solving some difference equations. We consider first the case $$n=1$$, for $$\eta \ge 0$$. From (28) and (24)–(25) we have

\begin{aligned} \bar{e}_1^{(\eta )}= & {} \big \{\bar{u}_0 + \textbf{K}_1 \bar{u}_0 + \textbf{K}_0\bar{u}_1\big \} -\big \{\bar{u}_0 + \textbf{K}_1\bar{u}_0 + \mathcal{F}^{(\eta )}(\bar{u}_1^{(\eta -1)},\bar{u}_1^{(\eta )})\big \}\\= & {} \textbf{K}_0\bar{u}_1 -\big \{ N^{(\eta )} \bar{u}_1^{(\eta -1)} + M^{(\eta )}\bar{u}_1^{(\eta )}\big \}. \end{aligned}

Definition (25), and adding and subtracting $$\textbf{K}_0 \bar{u}_1^{(\eta )}$$, lead to

\begin{aligned} \bar{e}_1^{(\eta )}= & {} \big \{ \textbf{K}_0\bar{u}_1-\mu ^{(\eta )}\bar{u}_1^{(\eta )} \big \} - (\textbf{K}_0-\mu ^{(\eta )} I) \bar{u}_1^{(\eta -1)}\\= & {} \textbf{K}_0 \big \{\bar{u}_1 - \bar{u}_1^{(\eta )}\big \} + \big \{\textbf{K}_0-\mu ^{(\eta )}I\big \} \bar{u}_1^{(\eta )} - \big \{\textbf{K}_0-\mu ^{(\eta )} I\big \} \bar{u}_1^{(\eta -1)}. \end{aligned}

Finally, adding and subtracting $$N^{(\eta )} \bar{u}_1$$ leads to

\begin{aligned} \bar{e}_1^{(\eta )}= & {} \textbf{K}_0 \bar{e}_1^{(\eta )} - N^{(\eta )} \bar{e}_1^{(\eta )} + N^{(\eta )}\bar{e}_1^{\eta -1}\\= & {} M^{(\eta )}\bar{e}_1^{(\eta )} + N^{(\eta )}\bar{e}_1^{(\eta -1)}. \end{aligned}

Therefore, recursively and by definitions (25) it follows that

\begin{aligned} \bar{e}_1^{(\eta )}= & {} \Big (I - M^{(\eta )}\Big )^{-1} N^{(\eta )} \bar{e}_1^{(\eta -1)}\\= & {} \frac{1}{1-\mu ^{(\eta )}} \Big (\textbf{K}_0 - \mu ^{(\eta )}I\Big ) \bar{e}_1^{(\eta -1)}\\&\vdots&\\= & {} \left\{ \prod _{j=1}^{\eta } \frac{1}{1-\mu ^{(j)}} \Big (\textbf{K}_0 - \mu ^{(j)}I\Big ) \right\} \bar{e}_1^{(0)}\\= & {} \frac{P_{\eta }(\textbf{K}_0)}{P_{\eta }(1)} \bar{e}_1^{(0)}. \end{aligned}

For $$n>1$$ and $$\eta \ge 0$$, in a similar manner as for $$n=1$$, we have again from (28) and (24)–(25)

\begin{aligned} \bar{e}_n^{(\eta )}= & {} \big \{ \bar{u}_0 + \textbf{K}_n\bar{u}_0 + \textbf{K}_{n-1} \bar{u}_1 +\textbf{K}_{n-2} \bar{u}_2 + \cdots +\textbf{K}_1 \bar{u}_{n-1} + \textbf{K}_0 \bar{u}_n \big \}\nonumber \\{} & {} - \left\{ \bar{u}_0 + \textbf{K}_n\bar{u}_0 + \textbf{K}_{n-1} \bar{u}_1^{(m_1)} + \textbf{K}_{n-2} \bar{u}_2^{(m_2)}\right. \\{} & {} \quad \left. + \cdots +\textbf{K}_{1} \bar{u}_{n-1}^{(m_{n-1})} + \mathcal{F}^{(\eta )}(\bar{u}_n^{(\eta -1)},\bar{u}_n^{(\eta )})\right\} \nonumber \\= & {} \sum _{j=1}^{n-1}\textbf{K}_j (\bar{u}_{n-j}- \bar{u}_{n-j}^{(m_{n-j})}) + \textbf{K}_0 \bar{u}_n - \big \{ N^{(\eta )}\bar{u}_n^{(\eta -1)} + M^{(\eta )} \bar{u}_n^{(\eta )}\big \}\nonumber \\ \end{aligned}

Now, adding and subtracting $$\textbf{K}_0 \bar{u}_n^{(\eta )}$$ first, then adding and subtracting $$N^{(\eta )} \bar{u}_n$$, joint with Definition (25) lead to

\begin{aligned} \bar{e}_n^{(\eta )}= & {} \sum _{j=1}^{n-1}\textbf{K}_j \bar{e}_{n-j}^{(m_{n-j})} + \textbf{K}_0 (\bar{u}_n - \bar{u}_n^{(\eta )}) + (\textbf{K}_0 -\mu ^{(\eta )}I) \bar{u}_n^{(\eta )} - (\textbf{K}_0-\mu ^{(\eta )} I) \bar{u}_n^{(\eta -1)}\nonumber \\= & {} \sum _{j=1}^{n-1}\textbf{K}_j \bar{e}_{n-j}^{(m_{n-j})} + \textbf{K}_0 \bar{e}_n^{(\eta )} - N^{(\eta )}\bar{e}_n^{(\eta )} + N^{(\eta )} \bar{e}_n^{(\eta -1)}\nonumber \\= & {} \sum _{j=1}^{n-1}\textbf{K}_j \bar{e}_{n-j}^{(m_{n-j})} + M^{(\eta )}\bar{e}_n^{(\eta )} + N^{(\eta )} \bar{e}_n^{(\eta -1)}, \end{aligned}
(32)

where $$\bar{e}_j^{(m_j)}$$ stands for the error of the iterative process after $$m_j$$ iterations once reached the threshold, at time step $$t_j$$, $$1\le j\le N$$. Therefore from (32) we have

\begin{aligned} \bar{e}_n^{(\eta )}= & {} \left( I-M^{(\eta )}\right) ^{-1}\left\{ \sum _{j=1}^{n-1}\textbf{K}_j \bar{e}_{n-j}^{(m_{n-j})} + N^{(\eta )} \bar{e}_n^{(\eta -1)}\right\} \nonumber \\= & {} \frac{1}{1-\mu ^{(\eta )}} \left\{ \sum _{j=1}^{n-1}\textbf{K}_j \bar{e}_{n-j}^{(m_{n-j})} + \Big (\textbf{K}_0 - \mu ^{(\eta )} I\Big ) \bar{e}_n^{(\eta -1)}\right\} . \end{aligned}
(33)

The solution of the difference equation (33) straightforwardly writes

\begin{aligned} \bar{e}_n^{(\eta )}= & {} \left\{ \sum _{j=1}^{\eta } \frac{1}{1-\mu ^{(j)}}\left( \prod _{k=j+1}^{\eta }\frac{1}{1-\mu ^{(k)}}\left( \textbf{K}_0-\mu ^{(k)}I\right) \right) \right\} \sum _{j=1}^{n-1}\textbf{K}_j \bar{e}_{n-j}^{(m_{n-j})} \\{} & {} + \left( \prod _{j=1}^{\eta }\frac{1}{1-\mu ^{(j)}}\left( \textbf{K}_0-\mu ^{(j)}I\right) \right) \bar{e}_n^{(0)}\\= & {} \left\{ \sum _{j=1}^{\eta } \frac{P_{j-1}(1)}{P_{\eta }(1)} P_{\eta }(\textbf{K}_0) P_{j}(\textbf{K}_0)^{-1}\right\} \sum _{j=1}^{n-1}\textbf{K}_j \bar{e}_{n-j}^{(m_{n-j})} + \frac{P_{{\eta }}(\textbf{K}_0)}{P_{\eta }(1)} \bar{e}_n^{(0)}\\= & {} \frac{P_{{\eta }}(\textbf{K}_0)}{P_{\eta }(1)} \left\{ \left\{ \sum _{j=1}^{\eta } P_{j-1}(1) P_{j}(\textbf{K}_0)^{-1}\right\} \sum _{j=1}^{n-1}\textbf{K}_j \bar{e}_{n-j}^{(m_{n-j})} + \bar{e}_n^{(0)} \right\} . \end{aligned}

Therefore the statement of the lemma follows.

In view of the error expression (30)–(31), it can be observed that operator $$\textbf{K}_0$$ plays a key role, and note also that $$\textbf{K}_0$$ expresses as

\begin{aligned} \textbf{K}_0 = \widetilde{\textbf{K}}\left( \frac{q(0)}{\tau }\right) = \widetilde{\textbf{K}}\left( \frac{1}{\tau }\right) , \end{aligned}
(34)

in particular if $$\{\textbf{K}(t)\}_{t\ge 0}$$ are the operators defined in (10), then $$\textbf{K}_0=\tilde{k}(1/\tau ) A = \tau ^\alpha A$$.

This suggests that the choice of the parameters $$\{\mu ^{(r)}\}_{r\ge 0}$$ in (29) must be related to the eigenvalues of $$\textbf{K}_0$$ as in fact does occur in our approach. In this regard denote the spectrum of $$\textbf{K}_0$$ as

\begin{aligned} \sigma (\textbf{K}_0):=\{\lambda _r\in \mathbb {C}: \lambda _r \text{ is } \text{ an } \text{ eigenvalue } \text{ of } \, \textbf{K}_0, 1\le r\le D\}\subset \mathbb {C}, \end{aligned}
(35)

and assume that they are sorted in increasing order of the absolute value. For the sake of the simplicity of the notation and without loss of generality all of them with multiplicity one, that is $$0 \le \Vert \lambda _1 \Vert< \Vert \lambda _2 \Vert< \Vert \lambda _3 \Vert< \ldots < \Vert \lambda _D \Vert$$. Therefore let us consider in (29)

\begin{aligned} \mu ^{(r)} = \lambda _r,\qquad \lambda _r\in \sigma (\textbf{K}_0),\quad \text{ for }\quad 1\le r\le D. \end{aligned}
(36)

Notice that in that case the iterative method (26) reaches the exact solution, at least theoretically, in a finite number of iterations since

\begin{aligned} P_j(\textbf{K}_0) = 0\in \mathcal{M}_{D\times D},\quad \text{ if }\quad j\ge D. \end{aligned}
(37)

However, as we have discussed above, in spite of the exact solution may theoretically be achieved after D iterations, since we are expecting that D is going to be quite large, that property will be useless from the practical point of view due to the high computational cost carried out by the large number of iterations required. But even though the number of iterations is from the computational point of view acceptable, hardly ever the exact solution will be reached due to the cumulative round-off errors.

That is why a compromise solution is typically adopted: To give up the exact solution of the iterative method, instead we opt for a strategy to obtain an approximate solution within an acceptable computational effort. In this regard we fix a threshold expected to be reached at every time step $$t_n$$ within a number of iterations let say $$m_n$$ expected to be much smaller than D, and giving rise to the numerical approximation we are looking for.

Let us highlight that some authors made use of Chebyshev polynomials and the Minimax property they enjoy [35, 36, 40, 41]. This approach consists of taking the set $$\{\mu ^{(r)}\}_{r\ge 0}$$ as the roots associated with the Chebyshev polynomial re-scaled to the interval bounded by the eigenvalues of $$\textbf{K}_0$$ with minimum and maximum absolute value, let say $$\lambda _1$$ and $$\lambda _D$$, that is

\begin{aligned} \mu ^{(r)} := \frac{\lambda _D-\lambda _1}{2}\cos \left( \frac{(2r-1)\pi }{2D}\right) + \frac{\lambda _D + \lambda _1}{2},\qquad 1\le r \le D. \end{aligned}
(38)

Our approach does not allow the use of the Minimax property if this choice is made. Moreover, it has a disadvantage: The choice (36) requires the computation of the whole spectrum of $$\textbf{K}_0$$ which for a large D may be highly costly, in spite of this such a computation is certainly done once for all, or in the particular case of the one-dimensional Laplacian the spectrum is explicitly known. On the contrary the choice (38) only requires the eigenvalues $$\lambda _1$$ and $$\lambda _D$$ which is much less expensive.

Since the proof of the main result follows the same ideas for both choices, for the sake of the simplicity of presentation we opted in this paper for (36). Anyhow the presence of a memory term in (30) makes in both cases new assumptions necessary in order to reach the error stability. In fact, if the choice (36) is made, then the new assumption is related to $$\sigma (\textbf{K}_0)$$ and reads

• [H3] The spectrum of $$\textbf{K}_0$$ is located in $$\mathbb {C}^-:=\{z\in \mathbb {C}: \text{ Re }(z)<0\}$$ excepts might be the eigenvalue 0, and there exists $$0<\rho <1$$ such that

\begin{aligned} \frac{\Vert \lambda _j-\lambda _r\Vert }{\Vert 1-\lambda _r\Vert }\le \rho , \qquad \lambda _j,\lambda _r\in \sigma (\textbf{K}_0). \end{aligned}
(39)

Two comments must be made on the Hypothesis [H3]. First of all the Hypothesis [H3] is in general not too demanding in view of the form $$\textbf{K}_0$$ takes in many cases, for instance in (10) where $$\textbf{K}_0$$ writes as $$\textbf{K}_0 = \tau ^\alpha A$$, $$\alpha >0$$. In that case [H3] reduces to take the time step $$\tau$$ small enough.

Moreover, in case of $$\sigma (\textbf{K}_0)\not \subset \mathbb {C}^-$$, a new family of operators whose spectrum locates in $$\mathbb {C}^-$$ may be achieved by shifting conveniently the original one. Notice that in spite of this hypothesis may look strange, it does not since the spectrum of $$\textbf{K}_0$$ is closely related to the spectrum of $$\textbf{K}$$ and consequently to the conditions for the well-posedness of (1).

On the other hand if the choice (38) is made, one simply has to re-formulate the Hypothesis [H3] in the following form

• [H3’] The spectrum of $$\textbf{K}_0$$ is located in $$\mathbb {C}^-$$ excepts might be the eigenvalue 0, and there exists $$0<\rho <1$$ such that

\begin{aligned} \frac{\Vert \lambda _j-\mu ^{(r)}\Vert }{\Vert 1-\mu ^{(r)}\Vert }\le \rho , \qquad \lambda _j\in \sigma (\textbf{K}_0),\quad 1\le r\le D, \end{aligned}
(40)

where $$\mu ^{(r)}$$ are defined according to (38).

In that case the proof of the convergence follows the same steps as in Theorem 1 below.

Before stating the following result recall that $$m_n$$ denotes the maximum number of iterations for the time step $$t_n$$, which according to the above discussion are expected to be much less than D.

### Theorem 1

Consider the iterative method (24)–(27), joint with (36) under the Hypothesis [H3]. Assume also that bound (12) satisfies

\begin{aligned} C\tau <1/D, \end{aligned}
(41)

for $$\tau$$ small enough.

Therefore, given a tolerance $$TOL>0$$, there exist $$m_n>0$$, $$1<n<N$$, such that

\begin{aligned} \Vert \bar{e}_n^{(m_n)}\Vert \le TOL,\quad \text{ for }\quad 1\le n\le N. \end{aligned}
(42)

Notice that (41) is straightforwardly satisfied merely taking a time step small enough; however, it can be certainly relaxed if a maximum number of iterations $$m<<D$$ is previously fixed for each time step $$t_n$$, that is $$m_n\le m$$, for $$1\le n \le N$$. If so (41) may be replaced by the less demanding one $$C\tau <1/m$$.

Theorem 1 states sufficient conditions to keep the error of the iterative method proposed under a given tolerance. This results reflects in the numerical experiments where the results show that reaching accurate numerical solutions is feasible with much lower computational effort than with classical/sequential approaches.

### Proof

The proof makes use of the error expression (30), where $$A^{(\eta )}$$ and $$B^{(\eta )}$$ are given by (31). According to this we have to prove that

$$\Vert A^{(\eta )}\Vert \rightarrow 0, \quad \text{ and }\quad \left\| A^{(\eta )} \displaystyle B^{(\eta )}\sum _{j=1}^{n-1}\textbf{K}_{n-j}\bar{e}_j^{(m_j)}\right\| \rightarrow 0,\quad \text{ as }\quad \eta \rightarrow +\infty ,$$

at every time level $$t_n$$.

To prove that $$\Vert A^{(\eta )}\Vert$$ tends to zero as $$\eta \rightarrow +\infty$$, observe that

\begin{aligned} A^{(\eta )} = \frac{P_\eta (\textbf{K}_0)}{P_\eta (1)} = \prod _{r=1}^{\eta } \frac{(\textbf{K}_0-\lambda _rI)}{1-\lambda _r}, \end{aligned}
(43)

whose spectrum consists of

\begin{aligned} \sigma \left( \frac{P_\eta (\textbf{K}_0)}{P_\eta (1)}\right) = \{0\}\cup \left\{ \prod _{r=1}^{\eta }\frac{\lambda _j-\lambda _r}{1-\lambda _r}\right\} _{j=\eta +1}^D, \qquad 1\le \eta \le m. \end{aligned}
(44)

Note that 0 turns out to be an eigenvalue of $$P_\eta (\textbf{K}_0)/P_\eta (1)$$ of multiplicity $$\eta$$, and by Hypothesis [H3], there exists $$0<\rho <1$$ such that

\begin{aligned} \frac{\lambda _j-\lambda _r}{1-\lambda _r} \le \rho , \qquad 1\le r \le \eta , \quad 1\le j \le D. \end{aligned}
(45)

Therefore, from (44) we straightforwardly have that

\begin{aligned} \Vert A^{(\eta )}\Vert = \left\| \frac{P_\eta (\textbf{K}_0)}{P_\eta (1)}\right\| \le \rho ^\eta \rightarrow 0, \quad \text{ as }\quad \eta \rightarrow +\infty . \end{aligned}
(46)

In particular once reached the maximum number of iterations $$\eta =m_n$$,

\begin{aligned} \Vert A^{(m_n)}\Vert = \left\| \frac{P_{m_n}(\textbf{K}_0)}{P_{m_n}(1)}\right\| \le \rho ^{m_n},\qquad 1\le n\le N. \end{aligned}
(47)

The first part of the proof then follows.

Related to the term $$A^{(\eta )} \displaystyle B^{(\eta )}\sum _{j=1}^{n-1}\textbf{K}_{n-j}\bar{e}_j^{(m_j)}$$, let us show first the boundness of $$A^{(\eta )} B^{(\eta )}$$. To this end we re-write such a term as follows

\begin{aligned} A^{(\eta )} B^{(\eta )} = \frac{P_\eta (\textbf{K}_0)}{P_\eta (1)}\sum _{j=1}^\eta P_{j-1}(1) P_j(\textbf{K}_0)^{-1} = \sum _{j=1}^\eta \left\{ \frac{1}{1-\lambda _j}\prod _{r=j+1}^\eta \frac{(\textbf{K}_0-\lambda _r I)}{(1-\lambda _r)} \right\} , \end{aligned}
(48)

and in the same manner as in (46) we have

\begin{aligned} \left\| \frac{1}{1-\lambda _j}\prod _{r=j+1}^\eta \frac{(\textbf{K}_0-\lambda _r I)}{(1-\lambda _r)}\right\| \le \frac{\rho ^{\eta -j}}{\vert 1-\lambda _j\vert } < \rho ^{\eta -j+1}. \end{aligned}
(49)

Therefore it easily follows that, for $$\eta \ge 1$$,

\begin{aligned} \left\| A^{(\eta )} B^{(\eta )} \right\| \le \sum _{j=1}^\eta \rho ^{\eta -j+1} \le \sum _{j=1}^\eta \rho ^{j} \le \rho \frac{1-\rho ^\eta }{1-\rho }, \end{aligned}
(50)

in fact,

\begin{aligned} \left\| A^{(m_n)} B^{(m_n)} \right\| \le \rho C_n,\quad 1\le n \le N,\quad \text{ where }\quad C_n:=\frac{1-\rho ^{m_n}}{1-\rho }. \end{aligned}
(51)

Recursively and according to (12), (47), and (51) we easily leads to

\begin{aligned} \Vert \bar{e}_n^{(m_n)}\Vert= & {} \left\| A^{(m_n)} B^{(m_n)}\sum _{j=1}^{n-1}\textbf{K}_{n-j}\bar{e}_j^{(m_j)} + A^{(m_n)}\bar{e}_n^{(0)} \right\| \\\le & {} \left\| A^{(m_n)} B^{(m_n)}\right\| \sum _{j=1}^{n-1}\Vert \textbf{K}_{n-j}\Vert \,\Vert \bar{e}_j^{(m_j)}\Vert + \Vert A^{(m_n)}\Vert \,\Vert \bar{e}_n^{(0)}\Vert \\\le & {} C\tau C_n\rho \sum _{j=1}^{n-1} \left\{ \rho ^{m_j}\Vert \bar{e}_j^{(0)}\Vert \prod _{r=j+1}^{n-1} (1+C\tau C_r\rho )\right\} \,\,+\,\, \rho ^{m_n} \,\Vert \bar{e}_n^{(0)}\Vert , \end{aligned}

for $$1\le n\le N$$. For the simplicity of the notation we have denoted $$\displaystyle \prod _{r=n}^{n-1} (1+C\tau C_r\rho )=1$$.

Finally, since $$C_n\le m_n$$, for $$1\le n\le N$$, and applying (41), if at every single time step $$t_n$$ the initial error of the iterative stage satisfies

\begin{aligned} \rho ^{m_j} \Vert \bar{e}_j^{(0)}\Vert \le \frac{TOL}{N(1+\rho )^N},\quad \text{ for }\quad 1\le j\le N, \end{aligned}
(52)

for a given tolerance $$TOL>0$$, then

\begin{aligned} \Vert \bar{e}_{n}^{(m_n)}\Vert \le TOL,\quad \text{ for }\quad 1\le n\le N, \end{aligned}
(53)

and the proof ends.

It is worth noting the following:

• The condition (52) may be slightly relaxed by replacing $$(1+\rho )^N$$ by $$(1+\rho )^{n-j-2}$$.

• Moreover, by the convergence of the original numerical solution $$\{\bar{u}_n\}_{1\le n\le N}$$, $$\Vert \bar{u}_{n}-\bar{u}_{n-1}\Vert =O(\tau )$$. Moreover, the convergence of the iterative method $$\{\bar{u}_n^{(m_n)}\}_{1\le n\le N}$$, and setting a small enough time step $$\tau$$, $$\bar{u}_n^{(0)}=\bar{u}_{n-1}^{(m_{n-1})}$$ seems certainly to be a suitable candidate for the starting value of the iterative method at time level $$t_n$$ satisfying (52).

## 5 CUDA implementation

The implementation of the parallel numerical scheme proposed in Section 3 for the problem (1), and theoretically analyzed in Section 4, is discussed more in-depth in the present section. In particular details related to the algorithm and technical resources involved in the practical implementation of the algorithms are shown below.

Instead of a MIMD (Multiple Instruction - Multiple Data) strategy, based on the cooperation of various Central Processing Units (CPUs), the proposed approach exploits a SIMD (Single Instruction - Multiple Data) methodology according to Flynn taxonomy [42]. This methodology takes advantage of Graphics Processing Units (GPUs) and allows using them for general purpose through CUDA (Compute Unified Device Architecture), introduced by NVIDIA in 2006. In fact, GPUs were originally designed for graphical purposes such as the computation of image geometrical transformations (translations, rotations, scaling, projections), rendering operations for image finishing, or rastering tasks.

Instead, GPUs for General Purpose (GPGPUs) paradigm aim to exploit Arithmetic Logic Units (ALUs) of GPUs. In this context, CUDA provides a general purpose architecture based on an instruction set so-called Parallel Thread eXecution (PTX), and an extension set for various programming languages. In addition, in the CUDA architecture, the GPU (device) represents a co-processor of the CPU (host) that is able to invoke kernels, functions that are executed by the device through threads.

Threads are the fundamental elements for the CUDA parallel design, since a thread executes a set of instructions on a data subset. Moreover, they have not activation/deactivation cost and a large number of them can be used to achieve the full efficiency. Threads are organized in blocks, and blocks are divided in a grid. Grid and blocks can be one-dimensional, two-dimensional or three-dimensional according to the specific problem (see Fig. 1). This structure based on grid, blocks, and threads allows to simplify the parallel implementation and takes advantage of a coordinate set accessible through instructions gridDim, blockDim, blockIdx, and threadIdx.

The proposed method implementation through CUDA is based on a one-dimensional grid and one-dimensional blocks. Moreover, numerical results are obtained through Google Colab that made available a Tesla T4 GPU as shown in Fig. 2. The Tesla T4 Compute Capability (CC) is 7.5, and the CC category describes the GPU hardware structure and determines limits for the implementation phase.

First of all let us recall the time discretization (27) of the problem (1)

\begin{aligned} \bar{u}_n^{(\eta )} = F_n + \Big (K_0-\mu ^{(\eta )} I\Big )\bar{u}_n^{(\eta -1)} + \mu ^{(\eta )} I \bar{u}_n^{(\eta )}, \quad \eta \ge 1 \end{aligned}
(54)

where $$F_n$$ stands for

\begin{aligned} F_n = u_0 + \sum _{j=0}^{n-1} \textbf{K}_{n-j}\bar{u}_j + \bar{f}_n, \qquad n\ge 1. \end{aligned}
(55)

Let $$0=t_0<t_1<\cdots <t_N=T$$ be a time mesh for the time discretization with $$t_n = t_0 + nh$$. Therefore Algorithm 1 describes the implementation of the numerical method (27). The prefix ’h’ expresses that the variables are memorized by the host, instead the prefix ’d’ precedes the variables memorized by the device. Initially, lines 1 and 2 fix the dimension of blocks and the grid which allow to divide operations in kernels through the provided structure. Then the weights (9) and the parameters (38) are determined according to line 3. In particular, weights are memorized in a multidimensional matrix d$$\_$$K $$\in \mathbb {R}^{D \times D \times N}$$ that contains N matrices of dimension $$D \times D$$. Instead that set of values is stored in the vector mu $$= \left( \mu ^{\left( 1 \right) }, \dots , \mu ^{\left( D \right) } \right) ^T \in \mathbb {R}^{D}$$.

Line 4 of the pseudocode summarizes the initialization of the starting term $$\bar{u}_0 + \bar{f} \left( t_0 \right)$$ memorized into the variable d$$\_$$unew. Then lines 5 describes the kernel that allows to copy the vector d$$\_$$unew into the first column of the matrix d$$\_$$un $$= \left( \bar{u}_0, \dots , \bar{u}_N \right) \in \mathbb {R}^{D \times \left( N+1\right) }$$.

The proposed Richardson time-point NSRW method is based on interactions on each subinterval between two consecutive discretization points. Then, the term (55) is computed through the kernel at line 8 and memorized into the vector d$$\_ F_n$$.

The iteration on the subinterval continues while the error is less than the tolerance and the maximum iterations number is not reached. Into the loop, the method is applied according to (54), and after its application lines 13–15 allow to calculate the error estimation in the point $$t_n$$ at iteration $$\eta =$$Niter:

\begin{aligned} \text{ EE}_n^{\left( \eta \right) } = \Vert \bar{u}_n^{\left( \eta \right) } - \bar{u}_n^{\left( \eta -1\right) } \Vert , \qquad 1 \le n \le N, \quad \eta \ge 1. \end{aligned}
(56)

In particular, at line 13 the kernel error allows to memorize the difference between components of $$\bar{u}_n^{\left( \eta \right) }$$ and $$\bar{u}_n^{\left( \eta -1\right) }$$, stored in the vector d_err. The function thrust::device_ptr allocates the new vector d_unew on the device, then the function thrust::reduce returns the error estimation (56).

Indeed, the approximated d_unew solution related to the discretization point $$t_n$$ is copied into d_uold at line 16. Finally, at line 19, the new approximation related to the last iteration $$m_n$$ is memorized into the matrix d_un.

Table 1 summarizes the variables related to Algorithm 1.

## 6 Numerical results

As performance metric related to parallel implementations we considered the Speed-Up, which consists of the rate of the runtime (in milliseconds) for the sequential and parallel implementations.

\begin{aligned} \text{ Speed-Up } = \frac{\text{ CPUtime }}{\text{ GPUtime }} \end{aligned}
(57)

For the Speed-Up assessment, the following subsections present several tests applied to practical problems. The numerical results aim to evaluate the ability of parallel computing to improve the velocity of the problem resolution. Moreover, they also allow for assessing the accuracy of the proposed approach and for empirically evaluating the convergence order of the method.

Results will be presented through tables that describe, in addition to Speed-Up, the dimension D of the system (1), sequential (CPU time) and parallel (GPU time) implementation times measured in milliseconds, and the error E referred to the exact solution $$\bar{u}\left( t \right)$$

\begin{aligned} \text{ E } = \max _{n=0,\dots ,N} \Vert \bar{u}_n^{\left( m_n\right) } - \bar{u} \left( t_n\right) \Vert , \end{aligned}
(58)

where $$m_n$$, defined in (42), represents the iteration that allows to satisfy the tolerance in the point $$t_n$$. Finally, tables show the Error Estimation EE, aka, the maximal value of errors EE$$_n^{\left( m_n\right) }$$ defined in (56)

\begin{aligned} \text{ EE } = \max _{1\le n \le N} \Vert \text{ EE}_n^{\left( m_n\right) } \Vert = \max _{1\le n \le N} \Vert \bar{u}_n^{\left( m_n\right) } - \bar{u}_n^{\left( m_n-1\right) } \Vert . \end{aligned}
(59)

Notice that E stands for the true error yielded by the numerical method, while EE is nothing but an estimation of it which is really useful as stoping criteria in cases where analytical solution is not available at all. In our experiments the analytical solution is actually known, despite that we show the values of EE.

We will also show the mean of iterations’ number at each time step $$t_n$$, i.e.,

\begin{aligned} m = \frac{m_1 + \cdots + m_N}{N} \end{aligned}
(60)

Moreover, let us notice that throughout the present section, the error will be measured in the sup norm.

### 6.1 Test problem 1

The first test problem consists of the system (1) with $$T=1$$, where

\begin{aligned} \bar{u}_0 = \textbf{0} \in \mathbb {R}^D, \qquad \bar{f} \left( t \right) = \left( t + \frac{4\sqrt{t^3}}{15\Gamma \left( \alpha \right) }, t, \dots , t, t - \frac{4\sqrt{t^3}}{15\Gamma \left( \alpha \right) } \right) ^T \in \mathbb {R}^D , \end{aligned}
(61)
\begin{aligned} \textbf{K} \left( t-s \right) = \frac{\left( t-s \right) ^{\alpha -1}}{\Gamma \left( \alpha \right) } A, \end{aligned}
(62)

with $$1< \alpha < 2$$, and the matrix A is the three-diagonal matrix

\begin{aligned} A = \left( \begin{array}{ccccc} -2 &{} 1 &{} 0 &{} \cdots &{} 0 \\ 1 &{} -2 &{} 1 &{} \cdots &{} 0 \\ \vdots &{} \ddots &{} \ddots &{} \ddots &{} \vdots \\ \vdots &{} &{} 1 &{} -2 &{} 1 \\ 0 &{} \cdots &{} \cdots &{} 1 &{} -2 \\ \end{array} \right) \in \mathcal{M}_{D \times D}(\mathbb {R}), \end{aligned}
(63)

representing the one-dimensional Laplacian discretization by means of the classical second order finite differences scheme (without re-scaling with spatial mesh length).

The exact solution to test problem 1 is

\begin{aligned} \bar{u} \left( t \right) = \left( t, \dots , t \right) ^T \in \mathbb {R}^D . \end{aligned}
(64)

Tables 2 and 3 present the results provided by sequential and parallel implementation of the numerical method for $$N=50$$ and $$N=100$$, respectively. Notice that the values of Speed-Up increase as the system dimension D increases. In particular, the values of Speed-Up provided by the case $$N=100$$ are better than the ones provided by the case $$N=50$$.

### 6.2 Test problem 2

The second test problem stands for a more complex problem than the previous one where, instead of considering a single value of $$\alpha$$, there are different values related to each single integral equation of the system. It, indeed, consists of the system (1) in which

\begin{aligned} \bar{u}_0 = \textbf{0} \in \mathbb {R}^D, \end{aligned}
(65)

and the vector $$\bar{f}\left( t \right)$$ has the following elements:

\begin{aligned} f_1 \left( t \right) = t + 2 \frac{t^{\alpha _1 + 1}}{ \Gamma \left( \alpha _1 \right) \alpha _1 \left( \alpha _1 + 1 \right) } - \frac{t^{\alpha _2+1}}{\Gamma \left( \alpha _2 \right) \alpha _2 \left( \alpha _2 + 1 \right) }, \end{aligned}
(66)
\begin{aligned} \begin{aligned}f_i \left( t \right) = t&- \frac{t^{\alpha _{i-1}+1}}{\Gamma \left( \alpha _{i-1} \right) \alpha _{i-1} \left( \alpha _{i-1} + 1 \right) } + 2 \frac{t^{\alpha _i + 1}}{ \Gamma \left( \alpha _i \right) \alpha _i \left( \alpha _i + 1 \right) } - \\ {}&- \frac{t^{\alpha _{i+1}+1}}{\Gamma \left( \alpha _{i+1} \right) \alpha _{i+1} \left( \alpha _{i+1} + 1 \right) } \quad i=2,\dots ,D-1,\end{aligned} \end{aligned}
(67)
\begin{aligned} f_D \left( t \right) = t - \frac{t^{\alpha _{D-1}+1}}{\Gamma \left( \alpha _{D-1} \right) \alpha _{D-1} \left( \alpha _{D-1} + 1 \right) } + 2 \frac{t^{\alpha _D + 1}}{ \Gamma \left( \alpha _D \right) \alpha _D \left( \alpha _D + 1 \right) }. \end{aligned}
(68)

Then the convolution operator has the form

\begin{aligned} \textbf{K} \left( t-s \right) = A I\left( t-s \right) , \end{aligned}
(69)

where the matrix A is defined in (63), and the matrix $$I(t-s)$$ represents a diagonal matrix in which the diagonal elements are

\begin{aligned} I_{ii} \left( t-s \right) = \frac{\left( t-s\right) ^{\alpha _i -1}}{\Gamma \left( \alpha _i \right) } \qquad i=1,\dots ,D. \end{aligned}
(70)

These hypotheses allow to know that the analytical solution is same as in test problem 1, that is (64).

Tables 4 and 5 summarize the obtained results for $$N=50$$ and $$N=100$$, respectively. We also report in these tables the mean of iteration (60) related to each system dimension because the selection of $$\alpha _1,\dots ,\alpha _D$$ through pseudo-casual numbers influences the total number of iterations.

Tables 4 and 5 show the increase of the Speed-Up as system dimension D growths. In particular, in the case N$$=100$$ and D$$=4096$$ we obtain a Speed-Up equal to 148.7127. We also notice that the obtained Speed-Ups are better than the ones of the previous numerical experiment.

### 6.3 Test problem 3

The last problem considered to test the proposed approach has the following specifications

\begin{aligned} \bar{u}_0 = \textbf{0} \in \mathbb {R}^D, \end{aligned}
(71)
\begin{aligned} \textbf{K} \left( t-s \right) = A I\left( t-s \right) , \end{aligned}
(72)

where $$I(t-s)$$ is diagonal matrix defined in (70), and the matrix A represent now the approximation matrix obtained by second order finite difference scheme to the two-dimensional Laplacian operator with Neumann boundary conditions. In particular, by considering $$D_x$$ discretization points in the x-axis direction and $$D_y$$ discretization points in the y-axis direction, the matrix A belongs to $$\mathcal{M}_{D \times D}(\mathbb {R})$$ with $$D = D_x D_y$$ and is defined by

\begin{aligned} \left( \Delta _{\tau } \bar{u} \right) _{i,j} = \frac{\bar{u}_{i+1,j} + \bar{u}_{i-1,j} -4 \bar{u}_{i,j} + \bar{u}_{i,j+1} + \bar{u}_{i,j-1}}{\tau ^2}, \end{aligned}
(73)

for $$i=1,\dots , D_x, \quad j=1,\dots ,D_y$$. Once again the spatial mesh length does not play any role in this approach, this is why it is taken as 1 and number of discretization points $$D_x$$ and $$D_y$$ chosen as in Table 6.

In this problem, in order to derive the same analytical solution of the previous test problems

\begin{aligned} \bar{u} \left( t \right) = \left( t, \dots , t \right) ^T \in \mathbb {R}^D, \end{aligned}
(74)

the source term $$\bar{f}(t) = (f_1(t), f_2(t),\ldots ,f_D(t))^T$$, comes defined by

\begin{aligned} f_i \left( t \right) = t - \sum _{i=1}^D a_{ij} \frac{t^{\alpha _i + 1}}{\Gamma \left( \alpha _i \right) \alpha _i \left( \alpha _i +1 \right) } \qquad i=1,\dots ,D, \end{aligned}
(75)

where $$a_{ij}$$ stand for the entries of the matrix A, and as done in test problem 2, the values $$\alpha _1,\dots ,\alpha _D$$ are generated through pseudo-random numbers and are related to each integral equation of the system.

Tables 7 and 8 present the results related to test problem 3 with N$$=50$$ and N$$=100$$, respectively.

Also in this case, the increase of the Speed-Up as system dimension growths confirms the benefits of the parallelization. In fact, for D$$=4096$$, the sequential times are more than a hundred times slower than their respective parallels with both N$$=50$$ and N$$=100$$.

### 6.4 Numerical results assessments

As mentioned above, another aim of numerical tests consists of the experimental assessment of the convergence order. The analysis related to the results arising from the three test problems is shown in Table 9 which presents the results obtained to a system of dimension D$$=32$$ as the step size of the time discretization decreases of a factor equal to 2. The increase of the discretization points N gives rise to the decrease of the error. The error in logarithmic scale, showed in Fig. 3, follows the slope of order 1 with the exception of the case N$$=3200$$ where there is a little deviation.

Then the assessment of the convergence, theoretically analyzed in Section 4, is experimentally confirmed. Moreover, Fig. 3 represents the errors in logarithmic scale and allows deducing that the slope related to error points in Table 9 follows the slope of order 1. The only exception consists of the second test problem at the case $$N=3200$$.

According to the results presented in previous subsections and summarized in Fig. 4, the three test problems provide a considerable improvement in performance from sequential to parallel executions. The runtime saved allows the application of the proposed method to fields in which the response time represents a key point.

In fact, nowadays a widespread application field of the analyzed problem is image or video processing. Systems of type (1) rather frequently appear in problems related to image and video processing (image filtering, clustering, video restoration). In this framework one of the main drawbacks is the size of images and/or frames to be processed which give rise to computations with very large system of equations (linear or even non-linear). What is harder, in many instances results are required to get ready in real time: restoration of damaged patches, dis-occluding hidden parts of frames in video. In this regard see for instance the recent paper [43] where linear Volterra equations of type (1) are proposed for video restoration. In this context, the response time is crucial and for related algorithms the possibility of reducing the run time represents a great challenge and any advantage is welcome. Moreover, the convergence analysis confirms the goodness of the proposed approach.

## 7 Conclusions and future works

NSWR methods and their parallel implementation by means of GPUs have shown a very good performance, particularly if compared to sequential implementation. The good performance extends further classical fractional integral equations but for more general linear systems of Volterra equations as the ones shown in Section 6.

What is more the numerical results reached in Section 6 are perfectly in line with the theoretical results in Section 4 related to the convergence

The statement of Theorem 1 may be extended without no significant differences to a wider class of equations of type (1). In fact one may assume that $$\textbf{K}(t)$$ admits an element-wise Laplace transform $$\widetilde{\textbf{K}}(z) =(\tilde{k}_{i,j}(z))_{1\le i,j\le D}$$ whose entries turn out to be analytic in the complex sector $$S_\varphi :=\{z\in \mathbb {C}: \Vert \arg (-z)\Vert < \varphi \}$$, $$0<\varphi <\pi /2$$, and for which there exist $$M,\nu >0$$ such that

$$\left\| \tilde{k}_{i,j}(z)\right\| \le \frac{M}{\Vert z\Vert ^\nu },\qquad 1\le i,j \le D,\quad z \notin S_\varphi .$$

In that case Corollary 3.2 in [16] extends (12) and says that there exists $$C>0$$ such that

$$\Vert \textbf{K}_n\Vert \le C t_n^{\nu -1} \tau , \quad 1\le n \le N,\quad \text{ and }\quad \Vert \textbf{K}_0\Vert \le C \tau ^{\nu -1}.$$

Under this hypothesis Theorem 1 follows in similar fashion.

We finally highlight that this paper focuses on the accuracy and Speed-Up analysis of the proposed parallelized approach for general purposes, that is for linear systems of general Volterra equations but not for particular applications. The application of this approach in the field of images processing represents a future development of the present work.