1 Introduction

In this paper we describe a specialised cyclic reduction (CR) algorithm for numerically solving linear algebraic equation systems:

$$\begin{aligned} \mathbf {A}\mathbf {x}=\mathbf {r}, \end{aligned}$$
(1)

in which \(\mathbf {x}=\left[ x_{1},\ldots ,x_{n}\right] \) is an n-dimensional vector of unknowns, \(\mathbf {r}=\left[ r_{1},\ldots , r_{n}\right] \) is an n-dimensional known vector of real numbers, and \(\mathbf {A}\) is a real \(n\times n\) known square matrix having the following quasi-tridiagonal structure:

$$\begin{aligned} \mathbf {A}=\left[ \begin{array}{ccccccc} b_{1} &{} \quad c_{1} &{} d_{1} &{} \quad e_{1} &{} \quad &{} \quad \ldots &{} 0\\ a_{2} &{} \quad b_{2} &{} c_{2} &{} \quad &{} \quad &{} \quad &{} \quad \vdots \\ &{} \quad a_{3} &{} \quad b_{3} &{} \quad c_{3}\\ &{} \quad &{} \ddots &{} \quad \ddots &{} \quad \ddots \\ &{} \quad &{} \quad &{} \quad a_{n-2} &{} b_{n-2} &{} c_{n-2}\\ \vdots &{} \quad &{}\quad \quad &{} \quad &{} \quad a_{n-1} &{} \quad b_{n-1} &{} \quad c_{n-1}\\ 0 &{} \quad \cdots &{} \quad &{} \quad f_{n} &{} \quad g_{n} &{} \quad a_{n} &{} \quad b_{n} \end{array}\right] . \end{aligned}$$
(2)

Matrix \(\mathbf {A}\) differs from a purely tridiagonal matrix by additional, possibly non-zero, elements \(d_{1}\), \(e_{1}\), \(f_{n}\) and \(g_{n}\), present in the first and last rows of the matrix. A typical situation requiring the solution of linear systems with matrix (2) arises from finite difference discretisations of two-point boundary value problems for second order ordinary differential equations (ODEs), or from analogous discretisations of initial-boundary value problems for (typically parabolic) one-dimensional partial differential equations (PDEs) (for an overview of finite difference ODE/PDE solving, see, for example, Jain [1], Smith [2], and Ascher et al. [3]). In particular, the present work is motivated by finite-difference simulations occurring in electroanalytical chemistry [4, 5]. In such applications, coefficients \(a_{i}\), \(b_{i}\) and \(c_{i}\) for \(i=2,\ldots ,n-1\) result from the replacement of spatial (second and first) solution derivatives in the ODEs or PDEs by standard or compact three-point central approximations. Coefficients in the first and last rows of \(\mathbf {A}\) result, in turn, from one-sided three- or four-point, standard or compact approximations to the first spatial derivatives occurring in boundary conditions. The latter discretisations are perhaps not very popular, since many authors use just two-point (one-sided or central) approximations to the boundary derivatives, which lead to purely tridiagonal matrices. However, there is evidence that the use of the multipoint one-sided finite difference approximations improves the accuracy of the solutions (see, in particular, the literature pertinent to electrochemical digital simulations: [4, 6,7,8,9,10]). Numerical algorithms for solving Eq. (1) should therefore be of interest.

Three-point one-sided approximations to boundary derivatives usually have a theoretical accuracy order consistent with that of the three-point central derivatives in the ODEs or PDEs. In such a case coefficients \(e_{1}\) and \(f_{n}\) are just zeroes. However, the use of more points for approximating derivatives at the boundaries may give still better results [4, 6,7,8,9,10]. Furthermore, we shall see that it is relatively easy to incorporate the non-zero \(e_{1}\) and \(f_{n}\) into the algorithm described below, whereas the consideration of more non-zero coefficients in the first and last rows of \(\mathbf {A}\), would be more complicated. For these reasons we admit a possibility of \(e_{1}\ne 0\) and \(f_{n}\ne 0\), in addition to \(d_{1}\ne 0\) and \(g_{n}\ne 0\).

In former works [4, 6,7,8,9,10] Eq. (1) was solved on serial computers by modifications of the classical Thomas algorithm [11], that is by such or other version of the serial LU factorization approach. For an overview of the literature related to the sequential algorithms of solving equations similar to Eq. (1), the Reader is referred to Bieniasz [12, 13]. In contrast to those serial algorithms, the numerical algorithm to be described here is an adaptation of the CR method for tridiagonal matrices, first described by Hockney [14] and Buzbee et al. [15] for block-tridiagonal matrices, and attributed to Golub and Hockney. This choice is dictated by the growing importance of parallel and vector computers for scientific and technical computing. Fine-grained parallelism is inherent in CR, which has prompted many authors to implement this method on a number of parallel and vector computers, or computer architectures (reviews of the method are available in [16,17,18,19]; a few example implementations of CR for scalar tridiagonal and block-tridiagonal systems are described in Refs. [20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38]). We shall focus on the most frequently considered, “stride of 2” version of CR (using the terminology of Evans [39]).

Apart from being used for scalar- and block-tridiagonal matrices, CR has been extended to periodic (cyclic) tridiagonal matrices [14, 40,41,42,43], which present a different class of quasi-tridiagonal matrices, compared to matrix (2).

Of course, matrix \(\mathbf {A}\) given by Eq. (2) can be viewed as a special case of banded matrices, for which a variety of parallel algorithms is available (see, for example, Refs. [44, 45]), possibly utilising the concept of CR in some way. In particular, extensions of CR to pentadiagonal and general banded matrices are known [46, 47]. However, since many of the elements on the external diagonals in \(\mathbf {A}\) are zeroes, the algorithms for general banded matrices are likely to be unnecessarily complex, and therefore computationally more expensive than the specially dedicated algorithm described here (the dependence of the arithmetic complexity of banded solvers on the matrix bandwidth is documented [48]). A similar unwanted overhead can be expected from algorithms that might be developed for other generalisations of matrix \(\mathbf {A}\) (for example for bordered matrices).

The CR algorithms and computer codes will be developed assuming that (as is often in practice) we need to solve a sequence of Eqs. (1) sharing the matrix \(\mathbf {A}\), but differing by vectors \(\mathbf {r}\). We shall also assume that n can be an arbitrary positive integer, in contrast to the frequently adopted simplification that n or \(n-1\) is a positive integer power of 2. Of course, for small \(n=1,\,2,\,3\) some of the extra coefficients \(d_{1}\), \(e_{1}\), \(f_{n}\) and \(g_{n}\) must be zeroes, and/or matrix \(\mathbf {A}\) is no longer (quasi-)tridiagonal.

2 The CR algorithms

We shall begin the algorithms presentation with a brief reminder (in Sect. 2.1) of the “stride of 2” CR (hereafter called just CR, for brevity) for a purely tridiagonal matrix. Then, in Sect. 2.2 we shall provide details of the method extensions to the quasi-tridiagonal matrix \(\mathbf {A}\).

2.1 The tridiagonal system

The basic idea of CR is to recursively reduce a tridiagonal system to a smaller system possessing an analogous matrix structure, by eliminating every second equation and every second variable. A complete set of operations for performing such a reduction, for a particular system of size n, will be called a “reduction step”. The above idea is the easiest to explain in the case when n is sufficiently large, and equation index i sufficiently bigger than 1 and smaller than n, so that the following subset of three adjacent equations is contained in the original tridiagonal system:

$$\begin{aligned} \left. \begin{array}{c} a_{i-1}x_{i-2}+b_{i-1}x_{i-1}+c_{i-1}x_{i}=r_{i-1}\\ a_{i}x_{i-1}+b_{i}x_{i}+c_{i}x_{i+1}=r_{i}\\ a_{i+1}x_{i}+b_{i+1}x_{i+1}+c_{i+1}x_{i+2}=r_{i+1} \end{array}\right\} . \end{aligned}$$
(3)

In order to eliminate the \(i-1\)st and \(i+1\)st equations, and retain the ith equation, the following operations are performed. The \(i-1\)st equation is multiplied by \(p_{i}=a_{i}/b_{i-1}\), the \(i+1\)st equation is multiplied by \(q_{i}=c_{i}/b_{i+1}\), and the resulting equations are subtracted from the ith equation, which then becomes:

$$\begin{aligned} a'_{i}x_{i-2}+b'_{i}x_{i}+c'_{i}x_{i+2}=r'_{i-1}. \end{aligned}$$
(4)

In Eq. (4) primes mark coefficients of the new, reduced system, so that \(a'_{i}=-p_{i}a_{i-1}\), \(b'_{i}=b_{i}-p_{i}c_{i-1}-q_{i}a_{i+1}\), and \(c'_{i}=-q_{i}c_{i+1}\) are new tridiagonal matrix coefficients, and \(r'_{i}=r_{i}-p_{i}r_{i-1}-q_{i}r_{i+1}\) is the element of the new right-hand side vector. Note that not only the \(i-1\)st and \(i+1\)st equations are eliminated, but also the unknowns \(x_{i-1}\) and \(x_{i+1}\).

The above basic idea needs to be completed with additional refinements (a)–(f) listed below.

(a) First of all one has to decide which of the equations are to be eliminated, and which retained in a given reduction step. This gives rise to (at least) two variants of the reduction step. In the “odd–even” variant odd equations are eliminated, and even equations are retained. Conversely, in the “even–odd” variant, even equations are eliminated and odd equations are retained. An issue usually overlooked is that the identification of the odd/even equations may depend on the direction of counting them. Although a forward counting (from 1 to n) is common, a backward counting (from n to 1) may give a different selection of the equations, since n can be either odd or even. A systematic application of the forward or backward counting in successive reduction steps may also result in a somewhat different evolution of the related and subsequent calculations, and in slightly different machine error accumulation. Any sequential numbering of the equations does not, however, imply that the equations must be transformed sequentially in some order. In fact, operations performed on every second equation are entirely independent and can be performed in parallel.

(b) Initial and final equations in a sufficiently large system have to be handled somewhat differently from Eq. (3), in a given reduction step. Their treatment is also different in the odd–even and even–odd reduction step variants. Consider, in particular, the first two equations (the handling of the last two equations is symmetrical). If the first equation is to be eliminated, we subtract it from the second equation, after multiplying it by \(p_{2}\), just as we do it in the case of the \(i-1\)st equation in the subsystem (3). The only difference is that we formally take \(a_{1}=0\). If the first equation is to be retained, we subtract from it the second equation multiplied by \(q_{1}\). Coefficient \(p_{1}\) is not needed in this case.

(c) Modifications of the procedure described for Eq. (3) are needed also in the case of small systems (\(n=1,2,3,4\)). We omit these details here, but the Reader will find all the relevant formulae in Tables 1 and 2 (see Sect. 2.2).

(d) Depending on how the reduction steps are applied to the equation systems, we can also distinguish “ordinary CR” and “parallel CR” (the name “parallel CR” is predominantly used, see Hockney and Jesshope [17], but an alternative name, “cyclic elimination”, is also encountered; see, for example, Gopalan and Murthy [30]).

In ordinary CR, after obtaining a first reduced system (having about n / 2 unknowns), an analogous reduction step variant is applied to it, giving a second reduced system (having about n / 4 unknowns), to which an analogous reduction step variant is again applied, etc. The reduction steps are continued until one finally obtains a single equation with a single unknown. A sequence of the so-called “back-substitution” steps is then recursively performed. In the first back-substitution step the last reduced system (with the single unknown) is solved, and the result is substituted into the equations that were eliminated while constructing the last reduced system. In every next back-substitution step a next portion of the unknowns is determined (this refers to the unknowns that were formerly eliminated from a given reduced system), based on the already determined unknowns, and substituted into the equations that were eliminated while constructing the given reduced system. These back-substitution steps are continued until all unknowns are determined. In particular, if the unknowns \(x_{i-2}\), \(x_{i}\) and \(x_{i+2}\) (and the remaining unknowns of the first reduced system containing Eq. (4) are already determined, then the back-substitution gives from Eq. (3):

$$\begin{aligned}&x_{i-1}=(r_{i-1}-a_{i-1}x_{i-2}-c_{i-1}x_{i})/b_{i-1}, \end{aligned}$$
(5)
$$\begin{aligned}&x_{i+1}=(r_{i+1}-a_{i+1}x_{i}-c_{i+1}x_{i+2})/b_{i+1}, \end{aligned}$$
(6)

and so on, for other unknowns that were eliminated from the first reduced system. As the determinations of the individual unknowns in a given back-substitution step are all independent of each other, they can be performed in parallel.

Table 1 Formulae pertinent to the reduction steps of matrix \(\mathbf {A}\)
Table 2 Formulae pertinent to the reduction steps of vector \(\mathbf {r}\)

Parallel CR consists in applying simultaneously the odd–even and even–odd reduction steps. In this way, after the first reduction step one obtains two reduced systems (instead of only one in ordinary CR), having altogether n unknowns. After the second step one obtains four systems, still having jointly n unknowns. After a sufficient number of steps one obtains n independent equations with one unknown each. The equations are then solved, which ends the calculations. Thus, parallel CR amounts to the diagonalization of the matrix \(\mathbf {A}\). The back-substitution steps are not needed in parallel CR. Note that the word “parallel” in the name of this method does not necessarily mean a parallel execution; all calculations can be done either sequentially or in parallel.

(e) From the programming point of view, single instances of data structures containing matrix \(\mathbf {A}\) and vectors \(\mathbf {r}\) and \(\mathbf {x}\) are sufficient in ordinary CR, since all reduction and back-substitution steps can be realised by gradually transforming the initial \(\mathbf {A}\) and \(\mathbf {r}\). It is convenient to introduce the stride s and half-stride \(h=s/2\). The first reduction step operates on (physical locations of) matrix rows and vector elements separated by \(s=2\), and s and h are doubled with every next reduction step. Later, they are halved with every back-substitution step. Hence, if Eqs. (36) are to be interpreted as formulae involving physical locations of the original matrix/vector elements, then indices \(i-1\), \(i+1\), \(i-2\), and \(i+2\), have to be replaced by \(i-h\), \(i+h\), \(i-s\), and \(i+s\), respectively. The primed quantities \(\mathbf {A'}\) or \(\mathbf {r}\)’ then refer to the same data structures that contain \(\mathbf {A}\) or \(\mathbf {r}\). In the case of parallel CR two instances of data structures containing \(\mathbf {A}\) and \(\mathbf {r}\) appear necessary. If \(\mathbf {A}\) (or \(\mathbf {r}\)) is contained in one instance during a particular reduction step, then \(\mathbf {A'}\) (or \(\mathbf {r}\)’) can be placed into the second instance. In the next reduction step \(\mathbf {A}'\) (or \(\mathbf {r}\)’) plays the role of \(\mathbf {A}\) (or \(\mathbf {r}\)), and \(\mathbf {A}\) (or \(\mathbf {r}\)) plays the role of \(\mathbf {A}'\) (or \(\mathbf {r}\)’).

(f) In order to allow for multiple right-hand sides \(\mathbf {r}\) (assumed in Sect. 1) it is desirable to implement all reduction steps for matrix \(\mathbf {A}\) in a one separate procedure. All reduction steps for vector \(\mathbf {r}\), and back-substitution steps or determinations of \(\mathbf {x}\) should then be contained in a second separate procedure. Coefficients \(p_{i}\) and \(q_{i}\) (for all reduction steps) are computed by the first procedure, and must be stored and supplied to the second procedure. The total number of these coefficients is relatively small in the case of ordinary CR, but can be quite large for parallel CR.

2.2 The quasi-tridiagonal system

The main complication arising from nonzero coefficients \(d_{1}\), \(e_{1}\), \(f_{n}\) and \(g_{n}\) is that they cause the first and last equations to involve four unknowns, instead of only two in the purely tridiagonal case. Therefore, if the first equation is to be retained in a given reduction step, it is not sufficient to subtract from it the second equation multiplied by a relevant coefficient \(q_{1}\), as was done for the purely tridiagonal system (see point (b) in Sect. 2.1). It is also necessary to subtract the fourth equation multiplied by another coefficient (we shall use for this purpose the coefficient \(p_{1}\), unused in the purely tridiagonal case), in order to eliminate one more unknown from the retained equation (n is assumed sufficiently large, the cases of small n require a separate treatment). The coefficients \(p_{1}\) and \(q_{1}\) are thus calculated and used differently in the quasi-tridiagonal system case. After one reduction step, coefficient \(e'_{1}\) in the retained first equation will vanish. After two reduction steps coefficient \(d'_{1}\) will also vanish, and the first equation will become purely tridiagonal. A symmetrical situation arises when retaining the nth equation: an additional subtraction of the \(n-3\)rd equation multiplied by an extra coefficient \(q_{n}\) is necessary.

Table 3 Formulae pertinent to the calculations of vector \(\mathbf {x}\) in back-substitution steps (for ordinary CR only)

The above complication seems relatively easy to handle, but in fact it generates a number of new cases and sub-cases in the formulae for reduction and back-substitution steps, that were not needed for the purely tridiagonal system. Tables 1, 2 and 3 provide a summary of all necessary cases and subcases, and related formulae, pertinent to the reduction steps of matrix \(\mathbf {A}\) (Table 1), reduction steps of vector \(\mathbf {r}\) (Table 2), and back-substitution steps for ordinary CR (Table 3). The formulae from Tables 1, 2 and 3 apply to physical locations of the original elements of \(\mathbf {A}\), \(\mathbf {r}\), and \(\mathbf {x}\), with index i numbering all the matrix rows and vector elements (from 1 to n). Stride s, and half-stride h depend on the reduction or back-substitution step, as was already noted (see point (e) in Sect. 2.1). The various cases and subcases usually correspond to the various total numbers of equations contained in a particular reduced system, and various positions of the ith equation of the initial system (1), inside this reduced system.

3 Numerical experiments

In order to test the CR algorithms described in Sect. 2, a large number of example systems (1) was solved. In a majority of these examples elements \(a_{i}\), \(b_{i}\), \(c_{i}\), \(d_{1}\), \(e_{1}\), \(f_{n}\) and \(g_{n}\), were initially filled with nonzero pseudo-random real numbers having a uniform distribution over a certain interval \((u,v)\subset \mathbb {R}\). The sums of absolute values of the off-diagonal elements were subsequently added to, or subtracted from the diagonal elements \(b_{i}\), depending on whether \(b_{i}\) was positive or negative, respectively. In this way diagonally dominant matrices \(\mathbf {A}\) were obtained. The diagonal dominance is known to be important for the numerical stability of the CR method for purely tridiagonal scalar matrices [49]. In order to have exactly known solution vectors \(\mathbf {x}_{\text {exact}}\), vectors \(\mathbf {x}_{\text {exact}}\) of n elements were prepared by filling them with with pseudo-random numbers. Vectors \(\mathbf {r}\) were then computed as

$$\begin{aligned} \mathbf {r}=\mathbf {A\mathbf {x}_{\text {exact}}}. \end{aligned}$$
(7)

After solving Eq. (1) errors \(\mathbf {e}_{x}\) of the solutions were calculated as \(\mathbf {e}_{x}=\mathbf {x}-\mathbf {x}_{\text {exact}}\).

The above tests using pseudo-random matrices \(\mathbf {A}\) allow one to more comprehensively verify the correctness of the CR algorithms than any tests involving matrices resulting from finite-difference ODE or PDE discretizations. The latter matrices often show regular patterns of coefficients (many coefficients may even be identical), so that the effects of individual coefficients may not be noticed in the tests.

All previously mentioned mutations of CR have been tested: ordinary CR with odd–even and even–odd reduction, and parallel CR, assuming either forward or backward equation counting. Purely sequential as well as parallel performances of the algorithms were examined (see implementation details below). In addition to using CR, the example systems were also solved, for comparison, by several variants of the sequential LU decomposition methods (or the Thomas algorithm [11]). These were: the Doolittle and the Crout methods (see, for example, Kincaid and Cheney [50]) with matrix factorization proceeding either in a forward or backward sweep. The relevant algorithms, applicable to Eq. (1), were obtained by modifying the procedures described in Refs. [8, 12, 13]. Hybrid algorithms, combining incomplete CR with LU decompositions were also tried, but we do not elaborate on them here, since they were not found more efficient under conditions of the present implementation.

All computer programs were written in C++ using double precision (double C variables having 64 bits and 16 digit precision, compliant with the IEEE 754 standard [51]), and compiled as 32-bit console applications under Bloodshed/Orwell Dev-C++ 5.7.1 environment [52, 53], using the TDM-GCC 4.8.1 compiler, a 32-bit release. Matrices \(\mathbf {A}\) were implemented as class objects containing three vectors (of length n each) with coefficients \(a_{i}\), \(b_{i}\), \(c_{i}\), and four real variables \(d_{1}\), \(e_{1}\), \(f_{n}\) and \(g_{n}\). The parallelism of the programs was achieved by multithreading, using OpenMP directives [54] to decompose loops over index i in matrix/vector reduction steps and in back-substitution steps, into portions performed by separate threads. Multithreading is generally not expected to provide the most efficient parallel CR performance for scalar systems (1), because the cost of maintaining and communicating the threads is considerable in comparison with the costs of CR calculations themselves, unless n is very large. OpenMP was previously used by Hirshman et al. [37] and Lecas et al. [38] in their CR codes for block-tridiagonal systems, in which case the costs of CR calculations were relatively larger, compared to the parallel overheads. Most efficient CR performance can probably be achieved at present by using modern GPU programming [35, 36], but this requires a suitable hardware, which is still not widely available. In contrast, OpenMP coding is relatively straightforward and applicable to the majority of currently available processors, and it is entirely sufficient for our main purpose of checking the correctness of the CR algorithms for Eq. (1) under conditions of parallel execution.

Calculations were run mostly on a multicore computer with an Intel Core i7-4960X CPU, operating at 3.6 GHz, under the Windows 7 x64 Ultimate operating system. The computer allowed for a maximum of 12 simultaneous threads being run on 6 independent cores.

4 Results and conclusions

There are two essential aspects of the numerical algorithms for solving Eq. (1), that should be of interest in practice: errors and computational times (including possible parallel speedups). These aspects are explored below.

In the case of small n one may expect that the relative errors \(\left\| \mathbf {e}_{x}\right\| _{\infty }/\left\| \mathbf {x}\right\| _{\infty }\) of the solutions result mostly from the relative perturbations \(\left\| \mathbf {e}_{r}\right\| _{\infty }/\left\| \mathbf {r}\right\| _{\infty }\) of vectors \(\mathbf {r}\), according to the well known estimate of the maximum error [50]:

$$\begin{aligned} \frac{\left\| \mathbf {e}_{x}\right\| _{\infty }}{\left\| \mathbf {x}\right\| _{\infty }}\approx \text {cond}(\mathbf {A})\frac{\left\| \mathbf {e}_{r}\right\| _{\infty }}{\left\| \mathbf {r}\right\| _{\infty }}, \end{aligned}$$
(8)

where \(\text {cond}(\mathbf {A})\) is the condition number of matrix \(\mathbf {A}\). This is because matrix \(\mathbf {A}\) was known precisely in our experiments, but vectors \(\mathbf {r}\) differed slightly from their exact values, since they were calculated numerically from Eq. (7). Assuming that \(\left\| \mathbf {e}_{r}\right\| _{\infty }/\left\| \mathbf {r}\right\| _{\infty }\) is roughly at the level of the machine precision \(\nu \) (\(\nu \approx 1.11\times 10^{-16}\) in the case of double precision variables [51]), Eq. (8) predicts:

$$\begin{aligned} \log \frac{\left\| \mathbf {e}_{x}\right\| _{\infty }}{\left\| \mathbf {x}\right\| _{\infty }}\approx \log \nu +\log [\text {cond}(\mathbf {A})]. \end{aligned}$$
(9)

As the matrices used in our experiments were random, their condition numbers also took random values.

With increasing n, errors \(\left\| \mathbf {e}_{r}\right\| _{\infty }/\left\| \mathbf {r}\right\| _{\infty }\) are expected to be greater than \(\nu \), due to machine errors generated in arithmetic operations involved in formula (7). Similarly, the machine errors arising in the CR or LU procedures (omitted in Eqs. (8) and (9)) bring an increasing contribution to the solution errors \(\left\| \mathbf {e}_{x}\right\| _{\infty }/\left\| \mathbf {x}\right\| _{\infty }\), above the maximum level indicated by these equations.

Fig. 1
figure 1

Relative solution errors \(\left\| \mathbf {e}_{x}\right\| _{\infty }/\left\| \mathbf {x}\right\| _{\infty }\) obtained in experiments with pseudo-random coefficients of Eq. (1), by using ordinary CR with odd–even reduction and forward counting, a sequential execution (a), and LU decomposition by the Doolittle method with forward matrix reduction (b). The errors are plotted as functions of the condition number \(\text {cond}(\mathbf {A})\) of matrix \(\mathbf {A}\), for various values of the system dimension n from the interval [2, 2000], indicated by various degrees of shadow. The pseudo-random coefficient values were chosen from the interval \((u,v)=(-10^{2},10^{2})\). Solid lines represent plots of Eq. (9)

Figure 1 demonstrates that solution errors measured in numerical experiments are consistent with the above expectations. Figure 1 shows typical error distributions for ordinary CR with odd–even reduction, and forward counting, and for LU decomposition by the Doolittle method with matrix factorization in a forward sweep. The distributions look very similar, and they were alike for all other methods tested. As can be seen, for \(n\le 2000\) the solution errors vary approximately in the interval \(10^{-16}\le \left\| \mathbf {e}_{x}\right\| _{\infty }/\left\| \mathbf {x}\right\| _{\infty }\le 10^{-11}\), so that, on average, a thousendfold increase of n corresponds to the increase of the errors by a factor of about \(10^{5}\). The generally small magnitudes of the errors confirm the overall correctness of the algorithms developed in this study.

The effect of the interval (uv) of pseudo-random coefficient values, on the errors, was investigated assuming \((u,v)=(-10^{2},10^{2})\), \((-10^{5},10^{5})\), \((-10^{10},10^{10})\), \((-10^{20},10^{20})\) and \((-10^{100},10^{100})\), but no significant changes of the error distributions were observed. For still larger intervals (uv) well expected overflow errors occurred, precluding obtaning correct solutions.

Computational times of the matrix \(\mathbf {A}\) reduction phase, or of the LU decomposition, varied between the minimum of about \(10^{-6}\) s for \(n=2\), and the maximum of about 1 s for \(n=10^{6}\) in the case of the sequential execution, and between \(10^{-4}\) and 1 s in the case of the parallel execution of the algorithms. This refers to all algorithms examined, although differences in timings, reaching even two orders of magnitude, were observed between various algorithms, also depending on the number of threads. Parallel CR was systematically much slower than ordinary CR.

Comparable ranges of computational times were observed in the vector \(\mathbf {r}\) reduction and solution phases.

Fig. 2
figure 2

Parallel speedups obtained for ordinary odd–even CR, by using sequential execution (white circles), two threads (black circles), four threads (black squares), and six threads (black triangles). Dotted lines indicate the speedup = 1. Subfigure a refers to the speedups of the matrix \(\mathbf {A}\) reduction phase, relative to the timings of the matrix LU decomposition phase by the Doolittle method with matrix factorization in a forward sweep. Subfigure b refers to the speedups of the vector \(\mathbf {r}\) reduction and back-substitution phases, relative to the timings of the solution phase in the LU decomposition method. All computational times were obtained as averages from 2000 program runs

At low n a performance loss of the parallel calculations, relative to the sequential calculations, was noticeable, due to the parallel overheads associated with maintaining and communicating the threads (see Sect. 3). Acceleration of the calculations, resulting from the parallel execution, was observed only at very large n (greater than about \(10^{4}).\) Parallel speedups (relative to the fastest sequential algorithm, which is LU decomposition) exceeding unity were recorded only in the case of ordinary odd–even or even–odd CR, at \(n\gtrsim 10^{5}\). Speedups corresponding to this particular situation are plotted in Fig. 2, as functions of n. As can be seen, with six threads the speedups approach maximally the value of about 1.4. Using more threads resulted in a decrease of the speedup. This was probably caused by the way the threads were distributed among the six available cores, although other reasons, especially the issues of the cache memory access, and the role played by the operating system, might also be important; these questions were not studied further. It can also be seen that ordinary odd–even CR is about 2–2.5 times slower than LU decomposition, in the case of the sequential execution. This result is consistent with the arithmetic cost evaluations for purely tridiagonal systems [18].

The speedups obtained confirm that the particular software and hardware configuration used in the tests (multithreading using OpenMP on the multicore Intel Core i7-4960X CPU, operating at 3.6 GHz, under the Windows 7 x64 Ultimate) cannot be recommended for the most efficient implementations of cyclic reduction. Alternative configurations (in particular those using GPUs) are likely to be more attractive.