Sound Approximation of Programs with Elementary Functions

Elementary function calls are a common feature in numerical programs. While their implementions in library functions are highly optimized, their computation is nonetheless very expensive compared to plain arithmetic. Full accuracy is, however, not always needed. Unlike arithmetic, where the performance difference between for example single and double precision floating-point arithmetic is relatively small, elementary function calls provide a much richer tradeoff space between accuracy and efficiency. Navigating this space is challenging. First, generating approximations of elementary function calls which are guaranteed to satisfy accuracy error bounds is highly nontrivial. Second, the performance of such approximations generally depends on several parameters which are unintuitive to choose manually, especially for non-experts. We present a fully automated approach and tool which approximates elementary function calls inside small programs while guaranteeing overall user provided error bounds. Our tool leverages existing techniques for roundoff error computation and approximation of individual elementary function calls, and provides automated selection of many parameters. Our experiments show that significant efficiency improvements are possible in exchange for reduced, but guaranteed, accuracy.


Introduction
Numerical programs face an inherent tradeoff between accuracy and efficiency. For example, choosing a larger finite precision provides higher accuracy, but is generally more costly in terms of memory and running time. Not all applications, however, need a very high accuracy to work correctly. We would thus like to compute the results with only as much accuracy as is needed, in order to save resources.
Navigating this tradeoff between accuracy and efficiency is challenging. First, estimating the accuracy, i.e. bounding errors, is non-trivial due to the complex and discrete nature of finite-precision arithmetic which inevitably occurs in numerical programs. Second, the space of possible implementations with different performance characteristics is usually prohibitively large and thus cannot be explored manually.
The need for automated tool support has been recognized previously. Today, users can choose between different tools for analyzing accuracy of floatingpoint programs [8,15,35,11,28,26] as well as for choosing between different precisions [5,10]. The latter tools perform mixed-precision tuning, i.e. they assign different floating-point precisions to different operations, and thus allow to improve the performance of a program w.r.t. a uniform precision implementation. The success of such an optimization is usually limited to the scenario when one uniform precision is just barely not enough to satisfy a given accuracy specification.
Another possible target for performance optimizations are elementary functions, such as sine and exponential. Users by default choose library function implementations which are correctly rounded for single or double precision. Such implementations are, however, expensive. When such high accuracy is not needed, we can save significant resources by replacing elementary function calls by coarser approximations. Unfortunately, existing automated approaches [34,1] do not provide sound accuracy guarantees.
On the other hand, tools like Metalibm [3] provide a procedure for approximating individual elementary functions by polynomials with rigorous accuracy guarantees. They, however, do not consider entire programs and leave the selection of its parameters to the user, limiting its usability mostly to experts.
We present an approach and a tool which leverages an existing whole-program error analysis and Metalibm's elementary function approximator to provide both sound whole-program guarantees as well as efficient implementations for programs with elementary function calls. Given a user-provided target error specification, our tool fully automatically distributes the error budget among the floating-point implementation of arithmetic operations and the elementary functions, and selects a suitable polynomial degree for Metalibm to use.
We have implemented our approach and evaluate it on several benchmarks from literature and compare the performance of generated programs against programs using library implementations. Naturally, we cannot compete against the highly (often hand-optimized) library implementations for target error specifications close to correct-rounding. When such a high precision is not required, however, our tool allows users to trade performance for larger, but guaranteed, error bounds. Our tool improves performance by on average 14% and up to 27% when approximating individual elementary function calls, and on average 17% and up to 34% when approximating several function calls at once. These performance improvements incur overall whole-program errors which are only 2-3 magnitudes larger than double-precision implementations using libm functions and are well below the errors of single-precision implementations.
Contributions In summary, in this paper we present the first approximation technique for elementary functions with sound wholeprogram error guarantees, an efficient heuristic to select a suitable degree of the polynomial approximation, extensive experimental evaluation on benchmarks from literature, and an implementation, which will be released as open-source.
Assuming library implementations for sine, our static analysis determines the worst-case absolute roundoff error of the result to be 3.44e-15. If such a high accuracy is not desired, the user can specify in the postcondition a higher error: } ensuring(res => res +/-1e- 13) Our tool computes how much of the total error budget can be used to approximate each elementary function call and calls Metalibm to generate a (different) approximation for each elementary function call in less than 5 minutes. In this case it determines that polynomial approximations with degrees 12 and 16, respectively, are suitable. This new implementation is approximately 2.9% faster than the implementation with libm functions. Our tool guarantees that the specified error bound is satisfied, but in fact it often provides tighter bounds, for instance here the actual error bound of the implementation with approximations is 2.09e-14, i.e. roughly an order of magnitude higher than the libm error.
This performance improvement is not very significant yet, but if we increase the error bound further, our tool generates programs which are 13.4% and 17.6% faster than a libm-based implementation (choosing degrees 24 and 20, and degrees 24 and 16, respectively). The actual errors of the generated implementation are 2.21e-13 and 1.56e-12, respectively. That is, we can save roughly half of the library function overhead, in exchange for somewhat larger errors. These errors are still much smaller than if we had used a single precision implementation, which incurs a total error of 1.85e-06.

Background
Before explaining our approach in detail, we present an overview of the necessary background. We provide substantial amount of detail on the algorithm employed by Metalibm, as it is important for our implementation and experiments.

Floating-point Arithmetic
Efficiently representing real numbers on electronic computers is not a straightforward task. In general, each real number can only be represented in the machine with some finite precision. One of the most popular representations of finiteprecision is floating-point (FP) arithmetic. Extensive hardware and software library support make it convenient and efficient to use.
The IEEE 754 [18] standard defines several precisions and rounding modes, of which the most commonly used ones are single and double precision with arithmetic operations in rounding-to-nearest mode.
Any operation in finite precision is susceptible to rounding errors. When propagated through programs, those errors can amplify and lead to (catastrophic) output accuracy degradations. A widely used error model [17] of FP operations over two FP numbers x and y is (in the absence of overflows and with round-tonearest rounding mode): where • ∈ +, −, * , / and • f l denotes the respective floating-point versions. Square root follows similarly, and unary minus does not introduce roundoff errors. The machine epsilon m bounds the maximum relative error for so-called normal values. Roundoff errors of subnormal values, which provide gradual underflow, are expressed as an absolute error, bounded by δ m . m = 2 −24 , δ m = 2 −150 and m = 2 −53 , δ m = 2 −1075 for single and double precision, respectively. This model provides a convenient way to analyze small numerical expressions and provides tight error bounds. However, manual analysis of whole programs can quickly become tedious, and automated tool support is essential.

Roundoff Error Analysis in Daisy
Daisy [8] is a static analysis-based tool for soundly bounding worst-case absolute finite-precision roundoff errors: where f and x are a mathematical real-valued arithmetic expression and variable, respectively, andf andx are their finite-precision counterparts. This definition extends to multivariate f component-wise.
Worst-case roundoff errors substantially depend on the ranges of inputs which have to be provided by a user ([a, b] is the range for x given in such a precondition in the equation above). Roundoff errors due to arithmetic operations equally depend on the ranges, but these cannot be easily provided by the user.
Given an arithmetic expression, together with an input domain, Daisy performs a two-step dataflow static analysis. In the first step, it uses interval [27] or affine arithmetic [14] to compute the ranges of all intermediate expressions, and in the second step it uses this information to compute and propagate roundoff errors. The propagation is performed with affine arithmetic [14].
This analysis is efficient and computes tight error bounds for straight-line arithmetic expressions over +, −, * , /, √ as well as the most commonly used elementary function calls (sin, cos, exp, log, tan, assuming a slightly larger error bound). Daisy currently does not support loops or conditional branches; for a discussion on the challenges involved, we refer the reader to [9,16].

Standard libm and its Limitations
An inherent part of scientific and financial computations are mathematical functions. The standard mathematical library containing elementary (exp, log, sin, cos and their inverses, etc) and special (power x y , erf, Γ , etc) functions is libm. This library is fully specified in the C language standard (ISO/IEC 9899:2011) though the recent 2008 revision of the IEEE 754 floating-point standard has also attempted standardization. There are various different implementations of libm that depend on the operating system and programming language. Here when referring to libm we mean the GNU libc 3 implementation. GNU libm provides a reference implementation of basic mathematical functions targeting double and single precision inputs. In scientific computing the performance of libm is of crucial importance. For example, large-scale simulation codes in high-energy physics run at CERN [19] rely on algorithms for reconstruction of particle collision coordinates that spend more than half of their time on computing the function exp. Similarly, the SPICE simulator [21] of analog circuits is based on solutions of non-linear differential equations and spends up to three quarters of its time on computing elementary functions. Finally, Figure 1 illustrates a rough estimate of the time spend on the calls to libm in our set of benchmarks.
Using the standard mathematical library in practice has several limitations. Standard libm functions were implemented to work with arguments on the whole range of representable numbers for the target format, and produce results which are as accurate as the target format allows. In practice, this is often more than is necessary. For example, the calls to cosine in some CERN simulations [20] always remain within one period and, due to the noise in measurements, the result is only needed to be correct to a few digits. There is obviously a need for more flavors of elementary functions: using non-standard domains and target accuracies.

Elementary Function Approximation in Metalibm
Adapting elementary function codes for different domains, accuracy, performance metrics (e.g. throughput vs. latency) is highly non-trivial and error prone. The state of the art solution for the automation of libm development is the Metalibm tool 4 [3]. It is able to quickly generate code for elementary functions and guarantee that the final accuracy is bounded by a user-given value. The produced code obviously cannot compete with handwritten manually-optimized codes of standard libm (for target errors close to correct rounding). However, the approximations are available quickly and automatically.
Metalibm is a pushbutton tool that tries to generate code, on a given domain and for a given accuracy, for evaluation of an arbitrary univariate function with continuous derivatives up to some order. There is no dictionary with a fixed set of functions, the target function is specified as a mathematical expression instead.
Metalibm currently handles exponential, logarithm, cosine, sine and tangent and their inverses, and hyperbolic cosine, sine and tangents, as well as their inverses. There has been some work towards support of special functions, such as Bessel [24]. Unfortunately, the bivariate functions are out of reach at the moment.
As backend, Metalibm uses the Sollya scripting language/tool [4]. Among others, Sollya provides state-of-the-art polynomial approximations [2]. To guarantee the bounds on evaluation errors, Metalibm uses the Gappa proof assistant 5 [12] which uses a very similar error analysis as Daisy.
Parameter space The goal of Metalibm is to provide more choices in implementations of metamathematical functions. The most important choices include: requirements on the accuracy of the final code maximum degree of the approximation polynomial d max size t (bits) of the table index for table-driven methods However, Metalibm does not help to navigate through this parameter space, proposing no guidelines to the user. This limits the usage of the tool mostly to experts in libm development.
High-level Algorithm Metalibm generates approximations in the following steps: argument reduction: First, Metalibm tries to detect algebraic properties of functions, such as parity or periodicity, in order to exploit them for an argument reduction. domain splitting: It may happen that even on possibly reduced argument approximation with one polynomial of maximum degree d max is impossible. Then, Metalibm splits the domains and performs a piece-wise polynomial approximation. polynomial approximation: Finally, Metalibm computes the polynomial approximation(s) for the reduced domain (or subdomains for splitting) and generates Gappa proofs to account for the approximation and evaluation errors due to floating-point arithmetic. Polynomials are evaluated using Horner's scheme.
At the end, Metalibm produces efficient C code implementing the target function.
Argument reduction and properties detection Exploiting algebraic properties helps to reduce the approximation domain in order to decrease the minimum degree of the approximation polynomial. For simple functions it is often possible to reduce the domain such that only one polynomial is used for the approximation. Only if the implementation with one polynomial is impossible, the tool moves to domain splitting. In Metalibm, domain reduction is always prioritized over domain splitting: in case of splitting, we must store the splitting points and coefficients of approximation polynomials for each domain. In addition to that, some splitting techniques yield if-else statements, which decreases throughput for vectorized implementations. The tricky part is that Metalibm does not know which function it implements analytically, so using text-book argument reduction schemes, e.g. for exponential, is impossible. The idea is to verify properties numerically, up to some accuracy. Thus, the domain should be "large enough" to be enable detection. Currently, the following properties are supported: Obviously, some functions, for example some compound functions, simply do not have efficient argument reduction schemes. In this case, piece-wise polynomial approximation is performed.
Domain splitting Given a function f , domain [a; b] and maximum polynomial approximation degree d max , Metalibm tries to split the domain into nonoverlapping intervals such that on each of them an approximation of low degree d ≤ d max is possible. There exist different schemes for domain splitting: uniform, logarithmic or arbitrary. Metalibm uses a non-uniform splitting [22] . The idea is to perform splitting only when it is impossible to approximate with the maximum degree. The algorithm uses a heuristic search for the best splitting and is based on the de la Vallée Poussin theorem to compute the minimal approximation degree. Their method results in (almost) uniform degree-usage for all subdomains, which yields uniform memory usage for polynomial coefficients and stable performance. However, this strategy means that Metalibm uses the polynomial approximation of the maximum degree basically all the time, and rarely less. This leaves it to the user to select a suitable degree, mostly by trial and error. In addition to that, even slight changes to the implementation domain might lead to completely different domain splitting scheme and different degrees of approximations.
Polynomial approximation When the domains are reduced, Metalibm uses Sollya-generated Remez-like approximation polynomials. These polynomials are guaranteed to have the best floating-point coefficients for a given precision [2]. Polynomial evaluation is implemented using Horner's scheme and then a Gappa proof [12] is automatically generated to obtain the bound on the approximation error. It should be noted that Metalibm is quite conservative in its implementations and usually leaves an error threshold between the actually implementation and user-given error bound. Sometimes, to improve the efficiency of approximation, table-driven methods [29] are used. The idea is to store the values of the function at some points of the domain in a table and use the function's algebraic properties to reduce the degree of the polynomial approximation. For each function, the choices on the usage of the tables and their size highly depend on the properties of the underlying hardware (size of caches, speed of arithmetic vs. memory access, etc.).
Reconstruction The goal of reconstruction is to provide mapping between the value of function f on the reduced argument or a subdomain to its value on the initial larger domain. When an argument was reduced using algebraic properties, the reconstruction is just an application of the inverse transformation. In case of domain splitting, reconstruction is more complicated. The common way for arbitrary splitting is just a series of if-else statements. However, this approach prevents automatic vectorization of generated code and Metalibm proposes a more versatile solution. It computes a special polynomial which, when evaluated on the function's argument, gives the index of the subdomain to be used. While being elegant, for scalar implementations this approach yields additional polynomial evaluation. This makes the dependency between the performance the code, function domain and approximation degree highly nonlinear.
Consider, for example, a case when on a domain [a; b] with a degree d max there was no need in domain splitting. It may happen that slight narrowing of this domain requires a domain splitting for the same d max . The implementation cost increases by at least a cost of polynomial evaluation for the reconstruction. On the other hand, increasing d max by one could have maintained implementation without domain splitting by adding a cost of just one multiplication and addition.

Whole Program Approximation
In this section we describe our approach in detail. Our tool supports approximations of straight-line expressions featuring the standard arithmetic operators (=, −, * , /) as well as a set of the most commonly used elementary functions (sin , cos , tan , log , exp , √ ), and attempts to approximate the latter. To avoid confusion, we will use the term 'program' for the entire expression in which the user would like to approximate elementary function calls, and 'function' for individual elementary functions. In addition to specifying the expression itself, the user of our tool also specifies the domains of all its inputs, together with a target overall absolute error which should be satisfied.

Overall Structure
We implement our approach inside the Daisy framework [8]. Daisy is built up in phases which provide different functionalities. We re-use the phases related to frontend and backend handling and roundoff error analysis and add two new ones to support automated elementary function approximation. Figure 1 shows an overview of the overall structure. After parsing the input file, Daisy decomposes the abstract syntax tree (AST) of the program we want to approximate such that each elementary function call is assigned to a fresh local variable. This transformation eases the later replacement of the elementary functions with an approximation.
As the next step, Daisy runs a roundoff error analysis on the entire program, assuming a libm implementation of elementary functions. This analysis computes a real-valued range, and a worst-case absolute roundoff error bound for each subexpression in the AST. In particular, it computes the ranges and errors of  the arguments of elementary function calls, which we will use for approximation in the step. Now we can perform approximation of elementary functions. For this, Daisy calls Metalibm for each elementary function which was assigned to a local variable. If Metalibm successfully computes an approximation, it generates a C implementation. Daisy extracts relevant information about the generated approximation (name of file, name of function, etc.) and stores it in the AST. We discuss the details of calling Metalibm in the following section.
Next, Daisy performs another roundoff error analysis, this time taking into account the new approximation's precise roundoff error bound reported by Metalibm. Finally, Daisy generates C code for the program itself, as well as all necessary headers to link with the approximation generated by Metalibm.

Approximation Phase
Several Metalibm parameters determine the accuracy and efficiency of the generated elementary function approximations. We next discuss how Daisy handles the selection of the degree of the polynomial, and the target error of each elementary function call, which are two of the most important parameters.
Error Distribution The user of our tool specifies only the error bound of the final result of each program, i.e. he or she does not have to specify the target error of each individual elementary function call separately. This is important for usability, as it is usually nontrivial even for experts to determine which error an individual elementary function should satisfy. This is especially true if there are several elementary function calls inside one program, or a function call is not the last statement of a program. Furthermore, Daisy (and all other similar tools) measures roundoff in terms of absolute errors, whereas Metalibm takes as input specification relative errors, hence a translation is needed.
Thus, Daisy needs to distribute a total absolute error specified by the user among the potentially several elementary function calls, and the remaining arithmetic operations. Let us denote the total absolute error by τ = |f (x) −f (x)|, where f denotes the real-valued specification of the program, andf the final finite-precision implementation with elementary function approximations. We will denote the real-valued implementation of the program but with elementary function calls replaced by polynomial approximations byf .
Daisy conceptually decomposes the total error budget into an approximation and a roundoff error part: The first part (τ approx = |f (x) −f (x)|) captures the error due to the elementary function approximations, assuming the arithmetic part of the program is still real-valued, i.e. without roundoff errors. The second part (τ f l = |f (x) −f (x)|) captures the roundoff error due to the finite-precision arithmetic. This separation is needed for Daisy to determine how much of the error can be used for the approximations.
Note, however, that at this point, Daisy cannot compute |f (x)−f (x)| exactly, as the approximations are not available yet. Instead, it computes an estimate of the roundoff error, by performing the analysis assuming libm library function implementations. This error will, in general, be slightly smaller than the final roundoff error. On the other hand, Metalibm usually generates approximations which satisfy error bounds which are smaller than what was asked for. For this reason, Daisy performs a second round of error analysis, after the approximations have been determined. In this second round, all errors (smaller or larger) are correctly taken into account.
If the program under consideration contains several elementary function calls, the approximation error budget needs to be distributed among them. For this, we consider two distribution strategies: -With the equal distribution strategy, Daisy distributes the error equally among elementary function calls. That is, for n function calls, each will have an error budget of τ i = τ approx /n for each call. -With the derivative based strategy, we take into account that that the errors introduced by elementary approximations may be propagated differently through the remaining program: some errors may be magnified more than others. It thus seems reasonable to allocate more of the error budget to those function calls, whose errors are potentially magnified more. We estimate the error propagation by symbolically computing the derivative w.r.t. to the elementary function call, and by bounding it w.r.t. to the variable ranges specified. The derivative bounds for the different function calls are normalized to one, such that we compute a weight w i for each call. The error budget for each call is then computed as τ i = τ approx * w i .

Total Error vs. Local Error Consider a program with two elementary function calls. Expanding Equation 2 we obtain
wheref 1 denotes the program with one elementary call replaced by an approximation andf 2 with both calls replaced. The error distribution assigns error budgets τ 1 and τ 2 to the first and second term, i.e. to the first and second elementary function call, respectively. These error budgets represent the total error due to the elementary function call at the end of the program. For calling Metalibm, however, we need the local error at the function call site. Again, due to error propagation, these two errors can differ significantly, and may lead to overall errors which exceed the error bound specified by the user.
We solve this issue by computing an approximation of the local error from the global error budget τ i . The local error gets propagated through the remaining program, and can thus be magnified or diminished. We compute a linear approximation of this propagation error m i , and then compute the local error ε i as where x i is the local variable to which the elementary function call is assigned to in the decomposition phase. Daisy computes partial derivatives symbolically and maximizes them over the specified input domain. We proceed similarly for all elementary functions in the program. We process the functions in the order in which they appear in the program. This is important, as we want to compute the partial derivatives on the real-valued programs (it is not clear how to compute derivatives of approximations). We compute the derivatives on the program which follows the function call. Thus, when processing the first call, all remaining (and following) ones are still real-valued and the derivative computation is possible. For the next call, the previously introduced approximation is preceding this call, thus the derivative computation can still be performed over the reals.

Degree Selection
The target error is one important parameter which significantly influences the efficiency of the approximations generated by Metalibm. Another one is the degree of the approximation polynomial. Unfortunately, it is not easy to predict exactly which degree is optimal, as the efficiency nonlinearly depends also on the domain of the function call, the function which is to be approximated itself, the internal logic of Metalibm, etc.
Daisy thus performs a linear search, starting from a small degree. The search stops, either when the (coarsely) estimated running time reported by Metalibm is significantly higher than the current best, or when Metalibm times out. The approximation with the smallest estimated running time is returned.
We have empirically observed that for functions which are monotone (i.e. exp , log , √ , tan ), small degrees often result in efficient approximations, whereas for non-monotone functions (i.e. sin, cos), higher degrees are necessary. For the former class, we thus search the degrees 4, 8, 12, and 16, and if the function to be approximated contains sin or cos, then we choose degrees 12, 16, 20 and 24. We have observed that steps of four are a good compromise between efficiency of implementations and efficiency of the analysis.
Depth of Approximation Another parameter with potentially high impact is what we call the depth of the approximation. We can ask Metalibm to approximate individual elementary functions, but it can also approximate compound functions (e.g. sin(cos(x) − 1)) directly. If successful, this has the advantage that only one polynomial approximation is generated and only one function call is involved. In addition to that, the Metalibm-generated code often wins in accuracy, over standard consecutive calls to the libm, thus provides a larger buffer to be traded for performance. Finally, Metalibm can overcome undesired effects, e.g. cancellation at zero for the cos(x) − 1 above. This is a highly application specific parameter, and we thus leave it under user control. Daisy provides automation in the sense that the user only needs to specify the desired depth as a number, where depth means height of the abstract syntax tree.
Other Parameters Metalibm provides several other parameters, such as width of the table index for table-driven methods, the minimum width of the subdomains for splitting, etc. Our current set of benchmarks uses the standard set of elementary functions (exp, log, sqrt, sin, cos, tan), which are relatively "well-behaved". We thus leave other parameters with their default values, which seem to perform well for our set of functions. For more demanding functions [24,3], fine-grained manual tuning by experts is required; automation of such specialized functions is out of scope of this work.

Code Generation
Metalibm generates C code for the elementary function approximations, and Daisy generates code for the arithmetic part of the program. These two need to be linked for a usable implementation. Daisy generates the necessary headers, as well as a compilation script for this purpose. This script inlines all approximations, as we have observed that this is necessary for performance. libm calls also get inlined with the -O2 compiler flag which we use for compilation.

Experimental Evaluation
We evaluate our approach on a number of benchmarks from the literature in terms of accuracy and performance.
Benchmarks The benchmarks forwardk2j* are part of the Axbench approximate computing benchmark suite [38]. Inspired by this benchmark, we have included axisRotation*, which rotates the x-y axis in carthesian coordinates counterclockwise by theta degrees. RodriguesRotation is a more involved formula for rotating a vector in space 6 . The pendulum benchmarks have been previously used in the context of roundoff error verification [9]. The benchmarks sinxx10, xu1, xu2, integrate18257, integrateStoutemyer are from the benchmark set of the COPRIN project 7 . The benchmarks ex2* and ex3_d are problems from a graduate analysis textbook. We show all benchmarks used in the appendix; while they are relatively short, then represent important kernels usually employing several elementary function calls.
We have based the target error bounds on the roundoff errors obtained for an implementation which uses libm library functions. For each benchmark we have three sets of target errors: small, middle and large errors, each of which is roughly two, three and four orders of magnitudes larger than the libm-based bound, respectively. Table 2 shows the error bounds of the libm-based implementations, as well as the final error bounds obtained by the implementation with approximations. Note that the table does not show the target error bounds, but the actually achieved ones. The exact target error bounds are listed in the appendix.
Performance Benchmarking To provide a fine-grained comparison, we measure performance in terms of processor clock cycles. We use RDTSC instruction available on x86 architectures. This function returns a time-stamp counter increased every clock cycle. We ensure that the instruction pipeline is flushed to avoid out-oforder executions and perform cache warm-up.
Daisy provides all necessary infrastructure for test case code generation and automatic benchmarking. Each benchmarking executable runs the Daisygenerated code on 100000 random inputs from the input domain. Of the measures number of cycles we discard the highest 10%, as we have observed these to be outliers. For the remaining ones, we compute the average, minimum and maximum number of cycles Finally, we perform 5 benchmarking runs for each test case and report average results of those. Table 3 shows the performance improvements of approximated code w.r.t. libm based implementations of our benchmarks. We show improvements for the following settings:

Performance Improvements
for small, middle and large target errors with an equal error distribution, and the table index width set to 8 bits (which enables table-based methods), approximating each elementary function call separately for large target errors with a derivative-based error distribution (and table index width 8) and individual approximation for large target errors with an equal error distribution, but with no table  (table index width set to 0), with individual approximation for large target errors, approximation of compound functions with depth 1 for large target errors, approximation of compound functions with as much depth as is possible (until the expression is no longer univariate) For benchmark 'ex2_3', Metalibm timed out for all elementary function calls. For the compound approximations, we only report performance improvements for functions which actually have compound calls. We observe that our tool can generate code with significant performance improvements for most functions. As expected, the improvements are smallest for the tightest target error bounds, and largest for the largest ones, on average 2.7% to 14.4%. If we consider only those cases where the approximation was successful, i.e. the performance is better than a libm implementation (otherwise we can just use libm), the average improvements are between 9.2% and 16.2%. We further note that for several benchmarks, improvements of more than 20% (up to 27%) are possible. Comparing the performance improvements to our estimates of the portion of running time spent in library calls (see Table 1), we see that we can recuperate much of the overhead (in exchange for lower accuracy, see Table 2).  Table 3. Performance improvements (in percent) of approximated code w.r.t. a program with libm library function calls.
Somewhat surprising, we did not observe an advantage of using the derivativebased error distribution over the equal one. We suspect that is due to the nonlinear nature of Metalibm's heuristics. We thus leave the equal error distribution as the default one. Table 3 further demonstrates the usage of tables generally improves the performance. Only for the benchmark xu1 we observe a slightly better performance when not using a table. Furthermore, we observed in our tests that for our magnitudes of target errors increasing the table sizes often leads to slower implementations (not shown in the table): tables then do not fit in L1 or L2 cache and memory access time prevails comparing to the computation time.
We can clearly observe the nonlinear relation between the performance and the target error for the approximations (for equal distributions they grow linearly with the small/middle/large program errors). For instance, for axisRota-tionX and rodriguesRotation benchmarks increasing the target errors from low to middle provides clear improvement but passing from middle to large leads to a worse performance. This is the result of discrete decisions concerning the approximation degrees and the domain splittings inside Metalibm. For those benchmarks which contain compound functions, we observe that it is generally beneficial to approximate 'as much as possible'. However, for some benchmarks, like sinxx10, we observe that Metalibm cannot provide any improvement. In general, we noticed that it has trouble in approximating sine functions and its powers.
Finally, we also considered implementations of our benchmark set in single floating-point precision, even though Metalibm does not support implementations in single precision (but can target large errors). We considered the following setting: program target errors are scaled to single precision, inputs/outputs and arithmetic operations are in single precision and approximation code is accurate just enough to guarantee that casting of its double result to single precision is accurate enough. On average we observe that a slight performance improvement is still possible, sometimes even reaching 18%. However, to achieve performance improvements comparable to those of double-precision code, we need a singleprecision code generation from Metalibm. Table 4 presents the average minimum and maximum number of cycles per program execution, in case of libmbased implementation and of our new approximations. Our implementations in majority of cases provide significantly smaller variance between the minimum small  middle  large  benchmark  equal  equal  equal  deriv no table  depth 1  depth ∞   sinxx10  30s  29s  32s  33s  28s  7m 58s  6m 56s  xu1  6m 12s  4m 0s  10m 55s  8m 7s 10m 30s  10m 5s  8m 51s  xu2  15m 56s  7m 33s  3m 39s  3m 44s  3m 39s  3m 6s  4m 48s  integrate18257  10m 1s  6m 49s  9m 1s  3m 24s 10m 19s  9m 1s  6m 52s  integStoutemyer  8m 23s  5m 25s  5m 19s  4m 18s  4m 56s  5m 39s  1m 10s  axisRotationX  8m 58s  10m 42s  9m 13s  10m 45s  9m 11s  --axisRotationY  10m 46s  10m 44s  8m 22s  4m 2s  7m 43s  --rodriguesRotation 11m 5s  10m 43s  7m 28s  6m 14s  7m 30s  6m 57s  6m 59s  pendulum1  30s  31s  28s  28s  26s  --pendulum2 1m  and maximum execution times. For large target errors the maximum execution time of our code is even often smaller than the best timings of libm-based implementations. For instance, for the forwardk2jY benchmark, our the derivativebased is not only 17% faster on average, but its maximum execution time is also considerably smaller than the one of the libm-based code.

Analysis Time
The time our tool takes for analysis and approximation is shown in Table 5. Analysis time is highly dependent on the number of required approximations of elementary functions: each approximation requires a separate call to Metalibm whose running time, in its turn, depends on the problem definition. Daisy reduces the number of calls to Metalibm by common expression elimination which improves the analysis time. Currently, we set the timeout for each Metalibm call to 3 minutes, which leads to an overall analysis time which is reasonable, and at most 16min for our benchmarks. We found it to be a reasonable bound for our relatively "well-behaved" functions. If Metalibm does not find an approximation within a few minutes, it usually means that it "wobbles" between domain splittings.

Related Work
Floating-point Analysis and Optimization Several static analysis tools exist which bound roundoff errors of floating-point arithmetic computations, including elementary functions [26,28,35], but all assume libm library implementations and do not attempt to optimize the running time. Lee et. al. [25] verify the correct-ness of several existing libm implementations in Intel's math library (exp, sin, tan, and log) automatically, but do not assume any approximations.
Mixed-precision tuning [32,5,23,10,6] selects different floating-point precisions for different arithmetic operations such that an overall error bound is satisfied and performance is improved. This work has a similar premise as ours, however only considers the precision of arithmetic operations, and not of elementary functions. In practice, mixed-precision tuning is most helpful when an accuracy bound is desired which is close to what uniform precision can provide. In contrast, our approximation of elementary functions operates in a different part of the tradeoff space.
The tool Herbie [30] and Damouche et.al. [7] perform a different kind of optimization. Instead of aiming to improve performance, it aims to increase the accuracy of a computation by leveraging the non-associativity of floating-point arithmetic.
Another way to improve the performance of (numerical) computations is autotuning, which performs low-level real-value semantics-preserving transformations of a program in order to find one which empirically executes most efficiently [31,36]. The approaches optimize for different hardware platform and do not consider reducing accuracy in exchange for performance.

Elementary Function Approximation
The problem of evaluation of elementary functions is highly dependent on the technology. The theoretical support followed the needs of evolving hardware and software but was proceeding in a rather function-by-function manner. A good overview of the existing approaches for can be found in [29]. However, to the best of our knowledge, Metalibm is the first and so far the only project aiming at automation of the libm development and providing strong accuracy guarantees.
Approximate Computing Approximate computing operates on the premise that many applications are tolerant to a certain amount of noise and errors and thus high accuracy is not always needed. Many techniques which trade accuracy for efficiency have been developed in the recent past and are for instance surveyed in [37].
Of particular interest to this project is work such as STOKE [34], which uses MCMC search to find reduced-precision implementations of short numerical kernels. Due to the stochastic search, the approach does not guarantee error bounds, however. The verifiable version of STOKE [33], which does not aim to reduce precision could be potentially used to improve the implementations generated by our tool further.
The sketching synthesis technique has been used to, among others, synthesize polynomial approximations of programs with elementary functions [1]. Correctness, respectively accuracy, is only checked on a small set of sample inputs and thus cannot provide any guarantees.
Similarly, neural networks have been used to learn approximations of numerical programs [13], which can be efficiently executed on custom hardware, but again do not provide accuracy guarantees.

Conclusion
We presented a fully automated approach and tool which approximates elementary functions inside small programs and provides rigorous whole-program error guarantees which take into account both approximation as well as roundoff errors. Our results show that it is possible to achieve significant performance improvements in exchange for reduced, but guaranteed, accuracy.
Elementary function approximations are challenging and their efficient implementations rely on a careful selection of many different parameters. Metalibm selects some of these parameters, and our combination with Daisy's error analysis and infrastructure allows our tool to select two more: correct error bounds of individual elementary function calls, as well as the degrees of polynomial approximations. This combination allows even non-experts to use rigorous elementary function approximations in their programs.