Boys function evaluation on graphical processing units

We propose a computational scheme for evaluating the Boys function on General Purpose Graphical Processing Units (GPGPUs). The scheme combines the polynomial and rational approximations, downward and upward recursions, and asymptotic approximations in a manner that facilitates efficient usage of hardware resources. Explicit formulas and implementation details are presented for two standard levels of accuracy.


Introduction
The family of Boys functions [2] plays a role of great importance in calculations of molecular integrals over Gaussian-type basis sets. Namely, they appear in the oneelectron nuclear attraction and two-electron Coulomb interaction polycenter integrals [6]. The sheer number of integrals of the latter type required in typical quantumchemical calculations calls for very efficient techniques of evaluating the underlying Boys functions. cannot be expressed in closed analytical form. There exists a solid amount of numerical methods for its evaluation in the literature [4,5,7,[9][10][11]. The most efficient in practical applications are those which rely on pretabulating function values on a dense grid of points and applying relatively short Taylor-type expansions [4,7,12]. It should be noted that such methods require, apart from the necessary arithmetic operations, some rather irregular memory accesses. This, depending on the particular computer architecture, may result in significant efficiency degradation. In the next sections, we analyze several possible ways of approximating the Boys function, focusing particularly on various aspects which may influence their efficient implementation on General Purpose Graphical Processing Units (GPGPUs). Unfortunately, there is no easy way to estimate how the (numerical) accuracy of the results from the quantum-chemical calculations depend on the accuracy of the underlying Boys function approximation. Nevertheless, the general consensus is that the approximation error of order 10 −12 −10 −14 is fully acceptable in standard doubleprecision calculations. Hence, we aim at the accuracy of the order of 10 −13 when considering numerical properties of a given approximation.

Taylor expansion
Given that dF n (x) dx = −F n+1 (x) (2) and F n (0) = 1 2n + 1 (3) we easily obtain the explicit form of the Taylor expansion where R k max +1 = x k max +1 (k max + 1)!(2k max + 2n To determine the applicability range of the Taylor expansion, we first estimate the dependence of the summand on the expansion order k max . After applying some crude wherex max is the lower bound of the maximum range in which the expansion holds the requested accuracy ε. While this estimation holds for any degree n of the Boys function, it does not account for numerical inaccuracies. In order to investigate this issue, we have employed standard double-precision arithmetics to check the actual range in which the accuracy estimation holds for several expansion orders and compared them to the values obtained from Eq. (6). Looking at the results collected in Table 1, we see that the estimatedx max is smaller than, but fairly close to, the actual x max for an expansion order up to k max = 50. After this threshold, the numerical errors build up and degrade accuracy. Hence, the Taylor series cannot be used at values exceeding 11, and it still requires a fairly long expansion at values less than or equal to 11. As a result, the straightforward Taylor expansion cannot be considered to be an effective method of calculating approximations to the Boys function for any reasonable domain.

Recurrence relations
After integrating Eq. (1) by parts, we obtain the downward recurrence relation which can be recast into the upward recurrence relation Additionally where u = √ xt. The resulting form can be easily expressed in terms of the error function as

Upward recursion
Together, Eqs. (8) and (11) Hence, in the numerator of Eq. (8) we subtract two almost equal numbers, which may lead to numerical instability. For the orders of the Boys function considered, the required accuracy in double precision is obtained only for x 6, as verified by numerical experiments.

Downward recursion
Let us recall Eq. (7), which gives a formula for downward recursion. It has to be noted that the downward recursion, despite being obtained from the same relationship as the upward one, exhibits strikingly different numerical properties. To demonstrate that, let us assume that an approximationF n+1 (x) of the Boys function F n+1 (x) differs from the exact value by n+1 . Then, it follows from Eq. (7) that which, after k-fold repetition, yields Given that we are able to find values of k that are large enough to obtain n (calculated from Eq. (14)) smaller than the required precision. Obviously, convergence becomes slower as we consider larger values of x. It should be noted that such an approach is well known as the Miller's algorithm [1] and has been already applied to the evaluation of several special functions. In order to use the method described above, we have to first decide how to approximate F n+k , and then estimate how large k must be for a given n. We selected the following crude approximation and estimating the right-hand side from above yields we obtain Applying crude upper and lower bounds for the factorial 1 + n(ln n − 1) ≤ ln(n!) ≤ 1 + (n + 1)(ln(n + 1) − 1), and taking advantage of the fact that n+k < 1 we may simplify the estimate of the right-hand side to ln( n ) k ln x + n ln n − (n + k) ln(n + k − 1) + k.
which for values of k large with respect to n (corresponding to large values of x) yields ln( n ) − n ln(n) k(ln x − ln k + 1).
and finally The argument of the exponential function is negative, hence the left-hand side approaches 1 from the above for large values of k. Hence, we may expect that the upper estimate on the right-hand side depends linearly on x. Numerical experiments seem to support this conclusion. Linear fit n t = a n x + b n where n t denotes n + k yields a n = −0.10453 · n + 3.68823 (25) b n = 0.76625 · n + 16.00000 where the last coefficient was artificially increased to ensure that we obtained the upper bound. This way, the explicit form of the estimated starting index for the downward recursion reads n t = (−0.10453 · n + 3.68823)x + (0.76625 · n + 16.00000).
We just note in passing that our results do not agree with those in [5]. This is so despite the fact that detailed numerical tests seem to confirm validity of our approximation.

Asymptotics
For large values of the argument we can approximate the Boys function with where the approximation error can be estimated from Specifically for n = 0 we get For which shows that the estimated error decreases quickly as x grows. Actual numerical tests show that an accuracy better than 10 −13 may be expected from an asymptotic formula for F 0 (x) when x is larger than 26. Higher order Boys functions may be obtained from F 0 using upward recursion in its asymptotic formF

Minimax approximation
To get a useful approximation for the non-asymptotic region, the Remez algorithm [3] was used. This method aims to find a polynomial or rational function which brings the least error on a given variable range. Balanced distribution of the error is what distinguishes this method from the Taylor series or Padé-type approximation, which are focused on a single expansion point. The concept of the Remez algorithm can be introduced in a few points. Let us consider a function F(x) that will be approximated by R(x), with nominator/denominator order n/m, in the range [a, b]. The following steps are to be done: 1. Choose n + m + 2 points, where R(x) will pass through F(x) 2. Find coefficients of the rational function by solving a system of equations with these points 3. Calculate error extrema. There will be n + m + 3 of them, n + m + 1 between the chosen points and additional 2 points placed between the extreme points and endpoints of the range 4. Choose n + m + 2 of them 5. For each x set the new value to make errors equal (differing only by sign), obtaining new points 6. Construct a new rational function which passes through all of them 7. Repeat procedure from step 3 until satisfactory precision is achieved For our purposes we used an efficient implementation of the Remez algorithm in the minimax program, which is provided by boost C++ libraries.
To improve convergence of the procedure, we fitted Boys functions multiplied by damping factor in the form of e ax . In this respect our approach differs from Ref. [8], where, instead of applying a damping function, specific form of the Remez algorithm, which takes the asymptotic behaviour of the Boys function into account, was applied.

GPGPU-specific aspects
General Purpose Graphical Processing Units are becoming ubiquitous in computational centres on the promise of significant calculation speedup at relatively low cost. Unfortunately, the promise is not an easy one to fulfill. In practice, often substantial rework is required due to significant hardware dissimilarity. This is also the case for an efficient GPGPU implementation of any special function. First, it must be remembered that although the throughput of GPU memory is much higher than the host RAM memory, the access latencies are also high. Such latencies  For more details see the text may be well hidden in arithmetic burden, provided that the memory access pattern is very regular, optimally resulting in consecutive threads accessing consecutive memory addresses. This is not the case for Boys function evaluation algorithms that rely on pretabulation and short Taylor expansions, where the access, due to distribution of function arguments, is rather random. As a result, arithmetic-heavy and memory-light implementation may be preferred. Second, the performance of GPU units degrades a lot if so called warp divergence occurs. In modern GPU architectures the threads are grouped into warps of 16 or 32, which are executed in parallel on the same multi-processor. If all threads in a warp enter the same branch of the code, then full speed can be expected. On the contrary, if every branch is visited by some thread, the efficiency is such as if every thread Table 4 Polynomial coefficients (a m x m ) for rational approximations used to evaluate F 0 to F 2 using double-precision arithmetics           See text for details Table 7 Polynomial coefficients (a m x m ) for polynomial approximations used to evaluate F 0 to F 2 using double-precision arithmetics    Table 8 continued m See text for details Table 9 Polynomial coefficients (a m x m ) for polynomial approximations used to evaluate F 7 and F 8 using double-precision arithmetics See text for details would execute all code branches. If this kind of divergence is common, the expected cost of the algorithm execution should be evaluated as the number of operations in all branches summed up rather than an average of the single branches as it would be in CPU-based implementation. Third, contrary to modern CPUs case, for GPUs there is a significant difference between the efficiencies of arithmetic operations executed on single and double precision floating numbers. It suggests that single-precision implementation should be provided together with standard double-precision one if it is expected that many function evaluations can be performed with lower accuracy.

Implementation details
For double-and single-precision implementations, we aimed at the maximum absolute errors of 10 −13 and 3 × 10 −7 , respectively. A few different approximations were combined to achieve the assumed accuracy with minimal computational effort. The Table 10 Polynomial coefficients (a m x m ) for rational approximations used to evaluate F 0 to F 3 using single-precision arithmetics  See text for details     See text for details Table 12 Polynomial coefficients (a m x m ) for polynomial approximations used to evaluate F 0 to F 6 using single-precision arithmetics      Tables 2 and  3. The orders of the employed polynomial or rational approximations are presented together with the form of exponential damping factors and the type of recursion used to generate all but highest order (downward recursion) or all but 0-th order (upward recursion) Boys functions. The damping factors d k (x), actual functions f k (x) fitted by minimax procedure and the required Boys function F k (x) are connected according to the following formula: where k is 0 or N depending on the type of recursion that is used in a given range of x. The polynomial coefficients obtained by minimax procedure are collected in Tables  4, 5 , 6, 7, 8, 9, 10, 11, 12 and 13. Both rational and polynomial approximations are provided. The approximations which are suitable for both double-and single-precision computations are presented.

Microoptimizations
Several code microoptimizations are applied in the actual implementation. The Horner scheme is used to evaluate polynomial values. Fused multiply-add (fma) operations are employed to make better use of arithmetic units. Divisions, which are very costly on GPUs, are replaced by the sequence of inverse square root (rsqrt) and multiplication operations. The exponential function arguments in damping factors were chosen so that the number of required function evaluations is minimized. As in two-electron integral calculations usually not only F n (x) value is needed, but also all lower order Boys function values (F o (x) . . . F n−1 (x)), all the required values are computed in a single call. For illustrative purposes we present in Fig. 1 an examplary code performing F 0 and F 1 computation using double precision arithmetics.

Summary
We have developed a computational scheme for the evaluation of the Boys function which is suitable for execution on General Purpose Graphical Processing Units (GPGPUs). The scheme combines the polynomial and rational approximations, downward and upward recursions, and asymptotic approximations. This formulation differs greatly from the CPU-specific one, which stems from fundamental architectural differences between typical GPGPUs and standard processing units.
The proposed formulation allows the exploitation of computational power offered by modern graphical processors, which is an important part of efficient implementation of two-electron integral calculation on GPGPUs.