1 Introduction

The family of Boys functions [2] plays a role of great importance in calculations of molecular integrals over Gaussian-type basis sets. Namely, they appear in the one-electron nuclear attraction and two-electron Coulomb interaction polycenter integrals [6]. The sheer number of integrals of the latter type required in typical quantum-chemical calculations calls for very efficient techniques of evaluating the underlying Boys functions.

Unfortunately, the n-th order Boys function

$$\begin{aligned} F_n(x) = \int _{0}^{1}t^{2n}e^{-xt^2}dt \end{aligned}$$
(1)

cannot be expressed in closed analytical form. There exists a solid amount of numerical methods for its evaluation in the literature [4, 5, 7, 911]. The most efficient in practical applications are those which rely on pretabulating function values on a dense grid of points and applying relatively short Taylor-type expansions [4, 7, 12]. It should be noted that such methods require, apart from the necessary arithmetic operations, some rather irregular memory accesses. This, depending on the particular computer architecture, may result in significant efficiency degradation. In the next sections, we analyze several possible ways of approximating the Boys function, focusing particularly on various aspects which may influence their efficient implementation on General Purpose Graphical Processing Units (GPGPUs).

Unfortunately, there is no easy way to estimate how the (numerical) accuracy of the results from the quantum-chemical calculations depend on the accuracy of the underlying Boys function approximation. Nevertheless, the general consensus is that the approximation error of order \(10^{-12}{-}10^{-14}\) is fully acceptable in standard double-precision calculations. Hence, we aim at the accuracy of the order of \(10^{-13}\) when considering numerical properties of a given approximation.

2 Boys function

2.1 Taylor expansion

Given that

$$\begin{aligned} \frac{\mathrm{d}F_n(x)}{\mathrm{d}x} = -F_{n+1}(x) \end{aligned}$$
(2)

and

$$\begin{aligned} F_n(0) = \frac{1}{2n+1} \end{aligned}$$
(3)

we easily obtain the explicit form of the Taylor expansion

$$\begin{aligned} F_n(x)= & {} \sum _{k=0}^\infty \frac{x^k}{k!}\frac{\mathrm{d}F_n}{\mathrm{d}x}(0)=\sum _{k=0}^\infty (-1)^k\frac{x^k}{k!(2k+2n+1)} \nonumber \\= & {} \sum _{k=0}^{k_{max}}(-1)^k\frac{x^k}{k!(2k+2n+1)} + R_{k_{max}+1} \end{aligned}$$
(4)

where

$$\begin{aligned} R_{k_{max}+1} = \frac{x^{k_{max}+1}}{(k_{max}+1)!(2k_{max}+2n+3)}\underset{k_{max}\rightarrow \infty }{\rightarrow }0 \end{aligned}$$
(5)

To determine the applicability range of the Taylor expansion, we first estimate the dependence of the summand on the expansion order \(k_{max}\). After applying some crude approximations, this yields

$$\begin{aligned} \tilde{x}_{max} = \exp \left( \frac{\log (\varepsilon )}{k_{max}}+\log (k_{max})-1\right) \end{aligned}$$
(6)

where \(\tilde{x}_\mathrm {max}\) is the lower bound of the maximum range in which the expansion holds the requested accuracy \(\varepsilon \). While this estimation holds for any degree n of the Boys function, it does not account for numerical inaccuracies. In order to investigate this issue, we have employed standard double-precision arithmetics to check the actual range in which the accuracy estimation holds for several expansion orders and compared them to the values obtained from Eq. (6). Looking at the results collected in Table 1, we see that the estimated \(\tilde{x}_{max}\) is smaller than, but fairly close to, the actual \(x_{max}\) for an expansion order up to \(k_{max}=50\). After this threshold, the numerical errors build up and degrade accuracy. Hence, the Taylor series cannot be used at values exceeding 11, and it still requires a fairly long expansion at values less than or equal to 11.

Table 1 Applicability of the Taylor expansion restricted to the first \(k_{max}\) terms

As a result, the straightforward Taylor expansion cannot be considered to be an effective method of calculating approximations to the Boys function for any reasonable domain.

2.2 Recurrence relations

After integrating Eq. (1) by parts, we obtain the downward recurrence relation

$$\begin{aligned} F_n(x) = \frac{2xF_{n+1}(x) + e^{-x}}{2n+1} \end{aligned}$$
(7)

which can be recast into the upward recurrence relation

$$\begin{aligned} F_{n+1}(x) = \frac{(2n+1)F_n(x) - e^{-x}}{2x}. \end{aligned}$$
(8)

Additionally

$$\begin{aligned} F_0(x) = \int \limits _{0}^{1} e^{-xt^2} dt = \frac{1}{\sqrt{x}}\int \limits _{0}^{\sqrt{x}} e^{-u^2} du\ \end{aligned}$$
(9)

where \(u = \sqrt{x}t\). The resulting form can be easily expressed in terms of the error function

$$\begin{aligned} {{\mathrm{erf}}}(x) = \frac{2}{\sqrt{\pi }} \int \limits _{0}^{x} e^{-t^2} dt \end{aligned}$$
(10)

as

$$\begin{aligned} F_0(x) = \frac{\sqrt{\pi }}{2\sqrt{x}}{{\mathrm{erf}}}(\sqrt{x}). \end{aligned}$$
(11)

2.2.1 Upward recursion

Together, Eqs. (8) and (11) may seem to form an effective recursive scheme, given that \({{\mathrm{erf}}}(x)\) can be efficiently calculated. Unfortunately, the formula given by Eq. (8) cannot be universally applied in the case of finite arithmetic, as it is numerically unstable for small values of the argument. This stems from the fact that for small values of the argument

$$\begin{aligned} F_n(x) = \frac{2xF_{n+1}(x) + e^{-x}}{2n+1} \approx \frac{e^{-x}}{2n+1}. \end{aligned}$$
(12)

Hence, in the numerator of Eq. (8) we subtract two almost equal numbers, which may lead to numerical instability. For the orders of the Boys function considered, the required accuracy in double precision is obtained only for \(x \gtrsim 6\), as verified by numerical experiments.

2.2.2 Downward recursion

Let us recall Eq. (7), which gives a formula for downward recursion. It has to be noted that the downward recursion, despite being obtained from the same relationship as the upward one, exhibits strikingly different numerical properties. To demonstrate that, let us assume that an approximation \(\tilde{F}_{n+1}(x)\) of the Boys function \(F_{n+1}(x)\) differs from the exact value by \(\Delta _{n+1}\). Then, it follows from Eq. (7) that

$$\begin{aligned} \Delta _n= & {} \frac{e^{-x} + 2x(F_{n+1}(x) + \Delta _{n+1})}{2n+1} - F_n(x)\nonumber \\= & {} \frac{e^{-x} + 2xF_{n+1}(x)}{2n+1} + \frac{2x\Delta _{n+1}}{2n+1} - F_n(x)\nonumber \\= & {} F_n(x) + \frac{2x\Delta _{n+1}}{2n+1} - F_n(x)\nonumber \\= & {} \frac{2x\Delta _{n+1}}{2n+1} \end{aligned}$$
(13)

which, after k-fold repetition, yields

$$\begin{aligned} \Delta _n = \frac{(2x)^k\Delta _{n+k}}{(2n+1)(2n+3)\ldots (2n + 2k - 1)}. \end{aligned}$$
(14)

Given that

$$\begin{aligned} \lim _{k \rightarrow \infty } \frac{(2x)^k}{(2n+1)(2n+3)\ldots (2n + 2k - 1)} = 0, \end{aligned}$$
(15)

we are able to find values of k that are large enough to obtain \(\Delta _n\) (calculated from Eq. (14)) smaller than the required precision. Obviously, convergence becomes slower as we consider larger values of x. It should be noted that such an approach is well known as the Miller’s algorithm [1] and has been already applied to the evaluation of several special functions.

In order to use the method described above, we have to first decide how to approximate \(F_{n+k}\), and then estimate how large k must be for a given n. We selected the following crude approximation

$$\begin{aligned} F_{n+k}(x) = \sqrt{\frac{\pi }{4x}}, \end{aligned}$$
(16)

trading (initial) accuracy for the low computational cost.

Rewriting Eq. (14)

$$\begin{aligned} \Delta _n = \frac{(2x)^k\Delta _{n+k}}{2^k(n+1/2)(n+3/2)...(n + k - 1/2)}. \end{aligned}$$
(17)

and estimating the right-hand side from above yields

$$\begin{aligned} \Delta _n \lesssim \frac{x^k\Delta _{n+k}(n-1)!}{(n + k - 1)!} \end{aligned}$$
(18)

we obtain

$$\begin{aligned} \ln (\Delta _n) \lesssim k\ln (x) + \ln (\Delta _{n+k}) + \ln ((n-1)!) - \ln ((n+k-1)!). \end{aligned}$$
(19)

Applying crude upper and lower bounds for the factorial

$$\begin{aligned} 1 + n (\ln n -1) \le \ln (n!) \le 1 + (n+1)(\ln (n+1) -1), \end{aligned}$$
(20)

and taking advantage of the fact that \(\Delta _{n+k} < 1\) we may simplify the estimate of the right-hand side to

$$\begin{aligned} \ln (\Delta _n) \lesssim k\ln {x} + n\ln {n} - (n + k)\ln (n + k - 1) + k. \end{aligned}$$
(21)

which for values of k large with respect to n (corresponding to large values of x) yields

$$\begin{aligned} \ln (\Delta _n) - n\ln (n) \lesssim k(\ln {x} - \ln {k} + 1). \end{aligned}$$
(22)

and finally

$$\begin{aligned} \exp \left( \frac{\ln (\Delta _n) - n\ln (n)}{k}\right) \lesssim e\frac{x}{k} \end{aligned}$$
(23)

The argument of the exponential function is negative, hence the left-hand side approaches 1 from the above for large values of k. Hence, we may expect that the upper estimate on the right-hand side depends linearly on x. Numerical experiments seem to support this conclusion.

Linear fit

$$\begin{aligned} n_t = a_n x + b_n \end{aligned}$$
(24)

where \(n_t\) denotes \(n+k\) yields

$$\begin{aligned} a_n= & {} -0.10453 \cdot n + 3.68823 \end{aligned}$$
(25)
$$\begin{aligned} b_n= & {} 0.76625 \cdot n + 16.00000 \end{aligned}$$
(26)

where the last coefficient was artificially increased to ensure that we obtained the upper bound. This way, the explicit form of the estimated starting index for the downward recursion reads

$$\begin{aligned} n_t = (-0.10453 \cdot n + 3.68823)x + (0.76625 \cdot n + 16.00000). \end{aligned}$$
(27)

We just note in passing that our results do not agree with those in [5]. This is so despite the fact that detailed numerical tests seem to confirm validity of our approximation.

2.3 Asymptotics

For large values of the argument we can approximate the Boys function with

$$\begin{aligned} \tilde{F}_n(x) = \int \limits _{0}^{\infty }t^{2n}e^{-xt^2}\,\mathrm {d}t = \frac{(2n-1)!!}{2^{n+1}}\sqrt{\frac{\pi }{x^{2n+1}}}. \end{aligned}$$
(28)

where the approximation error can be estimated from

$$\begin{aligned} \tilde{F}_n(x) - F_n(x) = \frac{\varGamma (2n+1, x^2)}{x^{4n+2}}. \end{aligned}$$
(29)

Specifically for \(n = 0\) we get

$$\begin{aligned} \tilde{F}_0(x) - F_0(x) = \frac{\sqrt{\pi }{{\mathrm{erfc}}}{(\sqrt{x})}}{2\sqrt{x}}. \end{aligned}$$
(30)

For \(x\gg 1\)

$$\begin{aligned} \frac{\sqrt{\pi }{{\mathrm{erfc}}}{(\sqrt{x})}}{2\sqrt{x}} \le \frac{e^{-x}}{2x} \end{aligned}$$
(31)

which shows that the estimated error decreases quickly as x grows. Actual numerical tests show that an accuracy better than \(10^{-13}\) may be expected from an asymptotic formula for \(F_0(x)\) when x is larger than 26.

Higher order Boys functions may be obtained from \(F_0\) using upward recursion in its asymptotic form

$$\begin{aligned} \tilde{F}_{n+1}(x) = \frac{2n+1}{2x}\tilde{F}_{n}(x). \end{aligned}$$
(32)

2.4 Minimax approximation

To get a useful approximation for the non-asymptotic region, the Remez algorithm[3] was used. This method aims to find a polynomial or rational function which brings the least error on a given variable range. Balanced distribution of the error is what distinguishes this method from the Taylor series or Padé-type approximation, which are focused on a single expansion point.

The concept of the Remez algorithm can be introduced in a few points. Let us consider a function F(x) that will be approximated by R(x), with nominator/denominator order n / m, in the range [ab]. The following steps are to be done:

  1. 1.

    Choose \(n+m+2\) points, where R(x) will pass through F(x)

  2. 2.

    Find coefficients of the rational function by solving a system of equations with these points

  3. 3.

    Calculate error extrema. There will be \(n+m+3\) of them, \(n+m+1\) between the chosen points and additional 2 points placed between the extreme points and endpoints of the range

  4. 4.

    Choose \(n+m+2\) of them

  5. 5.

    For each x set the new value to make errors equal (differing only by sign), obtaining new points

  6. 6.

    Construct a new rational function which passes through all of them

  7. 7.

    Repeat procedure from step 3 until satisfactory precision is achieved

For our purposes we used an efficient implementation of the Remez algorithm in the minimax program, which is provided by boost C++ libraries.

To improve convergence of the procedure, we fitted Boys functions multiplied by damping factor in the form of \(e^{ax}\). In this respect our approach differs from Ref. [8], where, instead of applying a damping function, specific form of the Remez algorithm, which takes the asymptotic behaviour of the Boys function into account, was applied.

3 Implementation

3.1 GPGPU-specific aspects

General Purpose Graphical Processing Units are becoming ubiquitous in computational centres on the promise of significant calculation speedup at relatively low cost. Unfortunately, the promise is not an easy one to fulfill. In practice, often substantial rework is required due to significant hardware dissimilarity. This is also the case for an efficient GPGPU implementation of any special function.

Table 2 Set of approximations used to evaluate the Boys functions using double-precision arithmetics for various highest orders required (N)
Table 3 Set of approximations used to evaluate Boys functions using single-precision arithmetics for various highest orders required (N)

First, it must be remembered that although the throughput of GPU memory is much higher than the host RAM memory, the access latencies are also high. Such latencies may be well hidden in arithmetic burden, provided that the memory access pattern is very regular, optimally resulting in consecutive threads accessing consecutive memory addresses. This is not the case for Boys function evaluation algorithms that rely on pretabulation and short Taylor expansions, where the access, due to distribution of function arguments, is rather random. As a result, arithmetic-heavy and memory-light implementation may be preferred.

Table 4 Polynomial coefficients (\(a_\mathrm {m}x^\mathrm {m}\)) for rational approximations used to evaluate \(F_0\) to \(F_2\) using double-precision arithmetics
Table 5 Polynomial coefficients (\(a_\mathrm {m}x^\mathrm {m}\)) for rational approximations used to evaluate \(F_3\) to \(F_6\) using double-precision arithmetics
Table 6 Polynomial coefficients (\(a_\mathrm {m}x^\mathrm {m}\)) for rational approximations used to evaluate \(F_7\) to \(F_8\) using double-precision arithmetics
Table 7 Polynomial coefficients (\(a_\mathrm {m}x^\mathrm {m}\)) for polynomial approximations used to evaluate \(F_0\) to \(F_2\) using double-precision arithmetics
Table 8 Polynomial coefficients (\(a_\mathrm {m}x^\mathrm {m}\)) for polynomial approximations used to evaluate \(F_3\) to \(F_6\) using double-precision arithmetics
Table 9 Polynomial coefficients (\(a_\mathrm {m}x^\mathrm {m}\)) for polynomial approximations used to evaluate \(F_7\) and \(F_8\) using double-precision arithmetics
Table 10 Polynomial coefficients (\(a_\mathrm {m}x^\mathrm {m}\)) for rational approximations used to evaluate \(F_0\) to \(F_3\) using single-precision arithmetics
Table 11 Polynomial coefficients (\(a_\mathrm {m}x^\mathrm {m}\)) for rational approximations used to evaluate \(F_4\) to \(F_8\) using single-precision arithmetics
Table 12 Polynomial coefficients (\(a_\mathrm {m}x^\mathrm {m}\)) for polynomial approximations used to evaluate \(F_0\) to \(F_6\) using single-precision arithmetics
Table 13 Polynomial coefficients (\(a_\mathrm {m}x^\mathrm {m}\)) for polynomial approximations used to evaluate \(F_7\) and \(F_8\) using single-precision arithmetics

Second, the performance of GPU units degrades a lot if so called warp divergence occurs. In modern GPU architectures the threads are grouped into warps of 16 or 32, which are executed in parallel on the same multi-processor. If all threads in a warp enter the same branch of the code, then full speed can be expected. On the contrary, if every branch is visited by some thread, the efficiency is such as if every thread would execute all code branches. If this kind of divergence is common, the expected cost of the algorithm execution should be evaluated as the number of operations in all branches summed up rather than an average of the single branches as it would be in CPU-based implementation.

Third, contrary to modern CPUs case, for GPUs there is a significant difference between the efficiencies of arithmetic operations executed on single and double precision floating numbers. It suggests that single-precision implementation should be provided together with standard double-precision one if it is expected that many function evaluations can be performed with lower accuracy.

3.2 Implementation details

For double- and single-precision implementations, we aimed at the maximum absolute errors of \(10^{-13}\) and \(3 \,\times 10^{-7}\), respectively. A few different approximations were combined to achieve the assumed accuracy with minimal computational effort. The choices that were made for each considered case are summarized in Tables 2 and 3. The orders of the employed polynomial or rational approximations are presented together with the form of exponential damping factors and the type of recursion used to generate all but highest order (downward recursion) or all but 0-th order (upward recursion) Boys functions. The damping factors \(d_k(x)\), actual functions \(f_k(x)\) fitted by minimax procedure and the required Boys function \(F_k(x)\) are connected according to the following formula:

$$\begin{aligned} F_k(x) = d_k(x)f(x), \end{aligned}$$
(33)

where k is 0 or N depending on the type of recursion that is used in a given range of x.

The polynomial coefficients obtained by minimax procedure are collected in Tables 4, 5, 6, 7, 8, 9, 10, 11, 12 and 13. Both rational and polynomial approximations are provided. The approximations which are suitable for both double- and single-precision computations are presented.

3.3 Microoptimizations

Several code microoptimizations are applied in the actual implementation. The Horner scheme is used to evaluate polynomial values. Fused multiply-add (fma) operations are employed to make better use of arithmetic units. Divisions, which are very costly on GPUs, are replaced by the sequence of inverse square root (rsqrt) and multiplication operations. The exponential function arguments in damping factors were chosen so that the number of required function evaluations is minimized. As in two-electron integral calculations usually not only \(F_n(x)\) value is needed, but also all lower order Boys function values (\(F_o(x) \ldots F_{n-1}(x)\)), all the required values are computed in a single call. For illustrative purposes we present in Fig. 1 an examplary code performing \(F_0\) and \(F_1\) computation using double precision arithmetics.

Fig. 1
figure 1

Implementation of \(F_0(x)\) and \(F_1(x)\) computation using double precision arithmetics. See text for details

4 Summary

We have developed a computational scheme for the evaluation of the Boys function which is suitable for execution on General Purpose Graphical Processing Units (GPGPUs). The scheme combines the polynomial and rational approximations, downward and upward recursions, and asymptotic approximations. This formulation differs greatly from the CPU-specific one, which stems from fundamental architectural differences between typical GPGPUs and standard processing units.

The proposed formulation allows the exploitation of computational power offered by modern graphical processors, which is an important part of efficient implementation of two-electron integral calculation on GPGPUs.