The Harmonized Parabolic Synthesis Methodology for Hardware Efficient Function Generation with Full Error Control

The Harmonized Parabolic Synthesis methodology is a further development of the Parabolic Synthesis methodology for approximation of unary functions such as trigonometric functions, logarithms and the square root with moderate accuracy for ASIC implementation. These functions are extensively used in computer graphics, communication systems and many other application areas. For these high-speed applications, software solutions are not sufficient, and a hardware implementation is therefore needed. The Harmonized Parabolic Synthesis methodology has two outstanding advantages: it is parallel, thus reducing the execution time, and it is based on low complexity operations, thus being simple to implement in hardware. A difference compared to other approximation methodologies is that it is a multiplicative, and not additive, methodology. Compared to the Parabolic Synthesis methodologies it is possible to significantly enhance the performance in terms of reducing chip area, computation delay and power consumption. Furthermore, it increases the possibility to tailor the characteristics of the error, improving conditions for subsequent calculations. To evaluate the methodology, the fractional part of the logarithm is implemented and its performance is compared to the Parabolic Synthesis methodology. The comparison is made with 15-bit resolution. The design implemented using the proposed methodology performs 3× better than the Parabolic Synthesis implementation in terms of throughput, while consuming 90% less energy. The chip area is 70% smaller than for the Parabolic Synthesis methodology. In summary, the new technology further increases the advantages of Parabolic Synthesis.


Introduction
Computation of elementary functions, such as trigonometric functions, logarithms and the square root, as well as binary functions, such as division, are numerous in a multitude of applications.A trend in wireless communication systems is handheld applications with very high data rates and moderate accuracy in the computations for ASIC implementation.With such applications follows also requirements on small chip area and low power consumption.Such an emerging application is wireless communication systems with multiple antennas on the transmitter and receiver, known as Multiple-Input Multiple-Output (MIMO).The great interest in MIMO falls back on its ability to cost-effectively improve transmission performance.To accomplish the high data rates these systems are performing an extensive amount of computation.An essential part of the computation is spent on matrix inversions, which are often executed as QR decompositions in which very high throughput is needed.An example of the required levels of performance for a QR decomposition is described in [1] for an Ordered Successive Interference Cancellation (OSIC) detector.The computation performance needed in the OSIC detector is 1.56 million inversions of complex-valued 4 × 4 channel matrices per second.Since each inversion uses 12 computations of trigonometric functions, 12 × 1,560,000 = 18.72 million computations per second of the trigonometric functions sine and cosine have to be executed.For these numerically intensive real-time applications, software routines -although they are capable of providing extremely accurate results -are too slow.Therefore, for the future of high-speed wireless communication, there is a significant interest for new development regarding hardware implementations of function generators.
In ASIC implementations of function generators, the control of the error (i.e., of its distribution and characteristics) is a neglected area.Full control of these parameters is indeed necessary in order to reach optimal performance in terms of small chip area, short critical path delay and low power consumption.Optimizing the word lengths and achieving a normally distributed error in the result from an approximation will enable reduced chip area, critical path delay and power consumptionnot only in the circuitry of the approximation itself but, even more so, in the rest of the algorithm, which uses the approximation as input.The main reason for this is that it will reduce the effects of the accumulated error in the following parts, which otherwise would have led to a necessary increase in word lengths in the computations to handle this effect.
Hardware computation of elementary functions can be performed by employing many different algorithms [2,3], such as table-based methods, polynomial and rational approximations and functional iteration methods.Table-based methods remain manageable for low precision computations as long as the input operand is up to 12-16 bits, corresponding to table sizes of 4 K-64 K words.The size of the table grows exponentially with the word length and becomes unacceptably large when operating with higher precision on an ASIC.An alternative way of making approximations is based on polynomials.Since polynomials only involve additions, subtractions and multiplications, using them is a natural way to approximate elementary functions.There are a number of polynomial schemes available for polynomial approximations, such as Taylor, Maclaurin, Legendre, Chebyshev, Jacobi, and Laguerre.For a given precision, the chosen polynomial scheme affects the number of terms included and thus the computational complexity.Two strategies are available in developing an approximation, one to minimize the average error, called least squares approximation, and one to minimize the worst case error, called least maximum approximation [2].An example of when least squares approximation is favorable is when the approximation is used in a series of computations.On the other hand, least maximum approximation is favorable when it is important that the maximum error to the function to be approximated is kept small.An example of when least maximum approximation is favorable is when the error from the approximation has to be within a limit from the true function value.An advantage of polynomials is that they are table-less, but their drawback is that they impose large computational complexities and delays [2].A reduction in computational complexity -and to some extent also in delay -can be accomplished by combining table-based methods with polynomial based ones [2].
For implementation of elementary functions in hardware, the approximation methodology of sum of bit-products [4] can be beneficial since it can give an area efficient implementation with a high throughput at a reasonable accuracy.
The commonly used COordinate Rotation DIgital Computer (CORDIC) algorithm [5,6] is an iterative algorithm.The benefit of the algorithm is that approximations of basic elementary functions only require small look-up tables, simple shifts and addition operations.However, since it is an iterative method, it is inherently slow and therefore insufficient for very high performance applications.As shown in [ [7], p. 125], CORDIC requires several extra bits of accuracy in order to cope with the error in the approximation.
The Parabolic Synthesis methodology [8][9][10][11] is founded on a multiplicative synthesis of factors, each of them a secondorder function.The more factors that are used, the higher the accuracy of the approximation is.In fact, the most fundamental difference in the Parabolic Synthesis methodology compared to many other approximation methodologies, like polynomial and CORDIC, is that it is a multiplicative and not an additive methodology.With the introduction of the Parabolic Synthesis methodology, the following improvements were accomplished compared to CORDIC.First, due to a highly parallel architecture, a significant reduction of the propagation delay was achieved, which also led to a significant reduction of the power consumption.Second, the Parabolic Synthesis methodology allows full control of the characteristics and distribution of the error, which opens an opportunity to design with shorter word lengths and thereby gain in area, speed and power.As reported in [12], a further improvement of the Parabolic Synthesis methodology was developed by combining it with second-degree interpolation [3,13,14].
When developing a multiplicative approximation based on such a combination of the methodologies, the first factor is computed using the Parabolic Synthesis methodology.It is a rough approximation of the target function.The second factor is computed using a Second-Degree Interpolation methodology where the number of intervals in the interpolation decides the accuracy of the approximation.The approximation can be looked upon as a synthesis of two second-degree polynomials with a different variable in each polynomial.In the first factor, the variable is over the total interval, whereas in the second factor the variable is over a sub interval.As a summary, the approximation is computed as a polynomial with two variables and with the total degree of 2 + 2 = 4.An attractive feature with the Parabolic Synthesis methodology combined with Second-Degree Interpolation [12] is that it enables a rough internal error adjustment as part of the approximation.The error compensation can be performed in order to improve the distribution of the error to suit the properties required of the approximation.Compared to the Parabolic Synthesis methodology the Parabolic Synthesis methodology Combined with Second-Degree Interpolation accomplishes the following [12]: A reduction in chip area because the number of second-order functions is reduced to two; a reduction of the critical path delays because the number of second-order functions is reduced to two; a reduction in power consumption because of reduced chip area and critical path delay.And last, the improved possibilities to tailor the distribution of the error.
This paper proposes an extension of the Parabolic Synthesis methodology Combined with Second-Degree Interpolation.This extension is named the Harmonized Parabolic Synthesis methodology because the two parts in this new methodology are developed as a unity in order to improve and optimize performance such as chip area, critical path delay and power consumption.The new methodology allows the exclusion of up to two additions and two multiplications in the algorithm, which is a considerable improvement.Another advantage with the Harmonized Parabolic Synthesis methodology is that it does not have the limitations of the Parabolic Synthesis methodology Combined with Second-Degree Interpolation when it comes to implementing roots, division and inverse roots.
The feasibility of Parabolic Synthesis has been verified by implementing a vast range of unary functions, as shown in [10].With the Harmonized Parabolic Synthesis the boundary conditions for the functions to be approximated are relaxed, which implies that more extreme functions also can be carried out.
A dilemma that appears when comparing different approximation methods is that such comparisons need to be done in the same context in order to be conclusive.The Parabolic Synthesis methodologies allow full control of the characteristics and distribution of the error, something that is very seldom the case in traditional approximation methods.Using accuracy as the only comparison criterion is not enough; other characteristics of the error, most notably its distribution, are equally important.An illustrative example is the comparison in [12] between an implementation based on one of the Parabolic Synthesis methodologies and an implementation performed with the CORDIC method (which, by far, is the most commonly used approximation method today).It is shown that the implementation performed using Parabolic Synthesis combined with seconddegree interpolation outperforms the CORDIC implementation in all aspects.In addition, as shown in [7, p. 125], CORDIC requires several extra bits of accuracy in order to cope with the error in the approximation; this will further degrade the performance of the CORDIC implementation.
The remaining part of this paper is organized as follows: Section 2 describes the Harmonized Parabolic Synthesis methodology; Section 3 describes the general structure of the hardware architecture resulting from the methodology; Section 4 proposes a general strategy for truncation as well as optimization, which, if combined, can have a positive impact on the characteristic of the error; Section 5 presents the implementation of the logarithm in Harmonized Parabolic Synthesis and the distribution of its error; Section 6 gives a comparison of implementations performed in the methodologies Parabolic Synthesis and Harmonized Parabolic Synthesis.The comparison is made with respect to chip area, critical path delay and power consumption; and Section 7 closes the paper with conclusions.

Harmonized Parabolic Synthesis
The Harmonized Parabolic Synthesis methodology is founded on two factors, both in the form of second-order parabolic functions, called the first sub-function, s 1 (x), and the second sub-function, s 2 (x).When recombined, as shown in (1), they equal the original function f org (x).
In (1) the first sub-function, s 1 (x), is a second-order parabolic function, which in conjunction with the second subfunction, s 2 (x), develops an approximation of the original function, f org (x).The second sub-function is a second-degree interpolation [3,13,14] specifically shown in [12], where the number of intervals in the interpolation decides the final accuracy of the approximation and allows the distribution of the error to be tailored.When developing an approximation with the proposed methodology, an empirical design methodology is the only feasible approach since the sub-functions are designed as a complete unity to fulfill the design criteria.

Requirements on the Original Function
To facilitate the development it is required that the function to be approximated is being normalized.When normalized the function has to satisfy the requirement that the values are in the interval 0 ≤ x < 1.0 on the x-axis and 0 ≤ y < 1.0 on the yaxis as well as have the starting point at (0,0), as illustrated in Fig. 1.The normalization of the function to be approximated creates the original function, f org (x).
Finally, the original function, f org (x), divided by the first sub-function, s 1 (x), must have a limit value when x goes towards 0. Otherwise the function to be developed, the second sub-function, s 2 (x), has no starting point when x = 0 and therefore cannot be developed.
Compared to Parabolic Synthesis, the number of criteria on the function to perform an approximation on has decreased from three to one.The two criteria that have been excluded have instead been transformed into recommendations.Thereby, the number of functions that can be approximated with the methodology is increased.The third criterion which restricts the limit of f org (x) divided by s 1 (x) when x goes towards 0 is eliminated because the Harmonized Parabolic Synthesis methodology can handle these limits.

The First Sub-Function
The first sub-function, s 1 (x), is a second-order parabolic function as shown in (2).
It can be seen that the two leftmost terms form a linear function with a constant l 1 and a gradient k 1 , while the rightmost term is the nonlinear part.As described above, the starting point for the first sub-function, s 1 (x), is in (0,0).This gives the start value, l 1 , to be 0. Furthermore, the gradient, k 1 , is 1 since the function starts in (0,0) and ends in (1,1).The first sub-function, s 1 (x), can therefore be rewritten according to (3).

The Second Sub-Function
The second sub-function, s 2 (x), is based on splitting the function in intervals to perform an interpolation in each of them.
The procedure when developing the second sub-function is to perform a division of the original function, f org (x), with the first sub-function, s 1 (x).This division generates the help function, f help (x), as shown in (4).
From the help function, f help (x), the second sub-function, s 2 (x), is developed as a second-degree interpolation, as shown in (5) and Fig. 2, where the number of intervals in the interpolation decides the order of the accuracy of the approximation.
To simplify the hardware representation of the interval, i, in hardware, the number of equal-range intervals in the second sub-function, I, is chosen as 2 to the power of w, where w is a natural number, as shown in (6).
To simplify the normalization of the interval of x w , it is selected as an exponential scaling by 2 of x where the integer part is removed.The normalization of x is therefore made by multiplying x with 2 w .Furthermore, the integer part is dropped, which gives x w as a fractional part of x, as shown in (7).
In hardware the normalization, x w , is simply performed as a truncation of the w most significant bits of x.This truncation performs normalization to the interval, as shown in Fig. 2. The dropped integer part from the normalization is used to decode the interval in which the second sub-function is performed and is therefore synonymous with the index i in the sub-function, as shown in Fig. 2.
The index i is defined as the number of the interval of the interpolation, starting with 0 and ending with I − 1.In (5), l 2,i is the starting point of an interval of the interpolation.This is computed by inserting the value of x for the starting point of the interval, x start,i , of the help function, f help (x), (4) as shown in (8) and Fig. 2.
Figure 1 The conditions for the original function.Eq. ( 8) does not apply on the start value of the first interval, which has to be calculated as the limit according to (9).Since the x 2 term in (9) goes faster towards 0 than the x term, it can be excluded when calculating the limit, as shown in (9).
In (5), k 2,i is the gradient for an interpolation interval i.The gradient k 2,i for an interval is computed as the end point value of the help function, f help (x end,i ), subtracted with the start point value of the help function, f help (xstart.i).Since the interval is normalized to 1 the denominator when computing the gradient, k 2,i , is set to 1, and therefore no division is needed, as shown in (10) and Fig. 2.
In (5) the coefficient, c 2,i , in an interval, i, of the second sub-function, s 2 (x), is pre-computed so that the second subfunction in an interval, s 2,i (x w ), cuts the help function, f help (x), in the middle of the interval when x w = 0.5.This satisfies the point x middle,i for the help function, f help (x), as shown in (11) and Fig. 2.
The sub-function in (5) can be simplified according to (12).
In Eq. ( 14), it is shown how the sub-function, s 2,i (x), is divided into partial functions.
Note that, in (14), x has been changed to x w .The change is because the intervals for the partial sub-functions, s 2,i (x), in (14) have equal ranges.

Simultaneous Development of the two Sub-Functions
The foundation of the Harmonized Parabolic Synthesis methodology is to approach the development with a more holistic view.This is expressed in that the development of the two subfunctions is made simultaneously.When developing an approximation of an original function, f org (x), the first and second sub-function are looked upon as one device.While, in the Parabolic Synthesis methodology, the first sub-function, s 1 (x), was developed to have as good conformity as possible with f org (x), the objective of the Harmonized Parabolic Synthesis methodology is rather to develop the first sub-function in such a way that the product of the two sub-functions gives a good conformity to the original function.This includes that the distribution of the error is to be favorable and the hardware implementation as simple as possible.The approach when developing the first sub-function, s 1 (x), can, in contrast to the Parabolic Synthesis methodology, not be based on independent analytical calculations since it is dependent on the performance of the second sub-function, s 2 (x).Therefore, the coefficient c 1 in the first sub-function, s 1 (x), has to be determined by, for different values of the coefficients, calculating the maximum absolute error, f error (x), between the approximation and the original function according to (15).
When developing the second sub-function it is dependent on the first sub-function as shown in (4), since the second subfunction is developed from the help function, f help (x).As shown in (3), only the coefficient c 1 has to be determined when developing the first sub-function.To perform the calculation of the absolute error, f error (x), the second sub-function, s 2 (x), has therefore to be made dependent on the coefficient c 1 in the first sub-function, s 1 (x), as shown in (3) to ( 5) and ( 8) to (11).The calculation is interesting only as an indication of how the absolute error, f error (x), depends on the coefficient c 1 .When choosing the coefficient c 1 it has to be made with regard to both the behavior of the error of the approximation and the efficiency of the hardware implementation.The number of intervals in the second sub-function, s 2 (x), needs to be increased to achieve the intended accuracy; this has also to be taken into account when performing the calculation of the error.As an example, Fig. 3 shows the bit accuracy for the sine function, shown when using 1, 2, 4 and 8 intervals in the second sub-function, s 2 (x), with different values of the coefficient, c 1 .In our example implementation of the fractional part of the logarithm later in this paper, further details of how these calculations are done are given.
Based on Fig. 3, values of the coefficient c 1 are chosen to allow an efficient implementation of the hardware.As shown in (3), c 1 is fed into a multiplier, why choosing a value that is a power of two is desirable.As an example, shown in Fig. 3, the desired accuracy is 15 bit.To accomplish this, at least four intervals are necessary.As shown in Fig. 3, only c 1 = 1.0 for four intervals is interesting since it is a power of two.If increasing the number of intervals to eight, the coefficient c 1 can be chosen to zero.This is interesting since this will exclude the multiplication in (3).The behavior of the bit accuracy in Fig. 3 results from the approximation using different values of the coefficient c 1 .From this, the number of intervals and the value on the coefficient c 1 used in the design can be selected.

Hardware Architecture
The hardware architecture resulting from the methodology can be divided into three parts as shown in Fig. 4, following a principle described by Tang [15].In the preprocessing part, the incoming operand is transformed to fit the processing part, in which the approximation is performed.In the postprocessing part, the result is transformed to the desired output format.
In most cases, preprocessing of the operand means normalization, but the transformation of the operand can also be more comprehensive such as converting a fixed-point number into a floating-point number.If the approximation is implemented as a block in a larger system, the preprocessing part can be integrated in the previous blocks, in which case the preprocessing part can be reduced or even excluded.For the postprocessing, similar conditions apply.
In the processing part, the approximation of the original function, f org (x), is directly computed in the way now described.
The architecture synthesized using Harmonized Parabolic Synthesis implements two sub-functions computed in parallel.The first sub-function is implemented as a second-order parabolic function and the second sub-function is implemented using the Second-Degree Interpolation methodology.Figure 5 shows the architecture, where the two sub-functions are implemented by the upper and lower half, respectively, and then combined via a multiplication.
It is worth noting that this, in fact, is a generic architecture, which can be used for approximating several different functions.The set of parameters defines the function.
In the first sub-function, the result of the (x−x 2 ) part is multiplied with c 1 and then added to x.In the proposed methodology, the coefficient c 1 is chosen as described in conjunction with Fig. 3, to reduce the hardware consumption.This implies that the algorithm of the first sub-function, s 1 (x), can be simplified, which also reduces the complexity of the implementation.The second sub-function is implemented as a second-degree   interpolation, consisting of a look-up table containing, each interval i, the coefficients j 2,i from (13).The coefficient j 2,i is multiplied with x w , which is the remaining part of x when the index part i is removed.The value of x w is also the normalized value of x for the interval.After the multiplication, the start value of the interval l 2,i is added.The third branch of the second sub-function consists of a look-up table containing the coefficients c 2,i from (11) for each interval i.The coefficient c 2,i for interval i is multiplied with x w 2 , which is the partial products of x 2 for interval i.To simplify this computation, a special squaring unit to compute x 2 has been developed; this unit is described in [8,10].The benefit of the squarer is that it computes all the partial products to x 2 in the same hardware and that it halves the chip area and computation time compared to a multiplier.The result of the multiplication between c 2,i and x w 2 is subtracted from the result of the previous addition.After the values of the first and second sub-functions are computed, they are multiplied with each other.
The architecture is suitable for pipelining in order to increase the throughput.An example of where pipeline stages can be introduced is shown in Fig. 5. Introduction of pipeline stages in this case is decided by the computation time of the rightmost multiplication, since this is the largest multiplier.To decrease the computation time further the next step will be to also insert pipeline stages in the multipliers.The introduction of data pipelining stages in the architectures will only have little effect on the size of the hardware, since the data paths are few.However, registers consume power, thus power consumption will increase.

Optimizing
The purpose with the optimization of the design is to reduce the chip area, the critical path delay and the power consumption, but also to gain full control over the error in the result of an approximation.The main factor that affects the parameters (chip area, critical path delay, and power consumption) is the data word lengths used in the computations.The main factor behind the need to increase the word lengths is that the errors accumulate through the computations.The characteristics and distribution of the error of the result of the approximation affect the word lengths in the computations in the remaining design.Thus it is important in the approximation to keep the word length of the result as short as possiblebut not shorterand see to that the error distribution is favorable for the remaining part of the algorithm.This can only be achieved if the designer gains full control over the error in the approximation and is able to tailor the characteristics of the error so that they improve the conditions for the subsequent calculations.Unlike many other approximation methodologies, the Parabolic Synthesis methodologies support such tailoring of the characteristics and distribution of the error.In other approximation methodologies, such as CORDIC, the tailoring of the characteristics and distribution of the error is not supported to any significant extent.The effects of the disadvantageous error performance of the CORDIC algorithm are commonly mitigated by increasing the accuracy, as shown in the analysis of error and datapath precision of CORDIC that is presented in [7, p. 125].Of course, increasing the accuracy is synonymous with longer word lengths in the design, which has devastating effects on the hardware efficiency of the implementation.
Finding the optimal set of coefficients and word lengths is, in this stage of the development of the methodologies, an empirical, iterative procedure that is performed manually using MatLab.After deciding the initial coefficients in the first and second sub-functions, the next step is to start the optimization of the architecture.The method for the optimization is to decide the word lengths used in the architecture and then optimize the coefficients.This optimization is performed in an iterative trial and error manner and the evaluation of different coefficient values should be performed in parallel with the evaluation of the word lengths, since the truncation error effects influence the performance of calculations in the design.The strategy is to adjust coefficients and word lengths in the design for best accuracy and distribution of the error.The procedure, when done manually, takes roughly two hours; it can certainly be automated in the futureor at least supported by appropriate tools.In the text below, the optimization strategy will be illustrated using the sine function, since this function is commonly used and has a simple implementation, thus making the steps of the optimization easy to follow.The target accuracy for the implementations is chosen to be around 15 bit.
In practice, the simulation of the approximation is performed with a bit-accurate C model and the performance of the approximation is analyzed in MatLab.

Truncation and Optimization
Truncation always results in a negative offset of the error compared to the non-truncated value, as illustrated in Fig. 6.In the figure, the gray curve shows the error before the truncation and the black is the error after truncation.
Figure 6 shows the errors in both curves winding through the eight intervals.The winding of the curves is caused by the rightmost term in (5), which is the nonlinear part.This term in each interval in the second sub-function, s 2 (x), causes a single winding of the curve.To counteract the negative offset of the error caused by truncation, the coefficients in the second subfunction, s 2 (x), can be adjusted.A favorable way to do this is to seek a solution in which the error has a distribution that is asymptotically normal.An advantage of an asymptotically normal error distribution is that it is centered around zero and the most error values are close to zero.This distribution implies that the number range is optimally utilized which has positive effects on the architecture, in reduced size of hardware and in subsequent calculations to reduce their error.In Fig. 7 the gray curve shows the error before the truncation of the word lengths and the optimization of the coefficients, and the black is the error after truncation and optimization of the coefficients.
When comparing the two figures it can be seen that the error after truncation and optimization, i.e., the one in Fig. 7, is more evenly distributed around zero than the error in the one without optimization.This shows that it is mainly by adjusting the coefficients that the distribution of the error is determined.

The Characteristic Metrics
The error resulting from an approximation can be characterized in several ways [16], the most important being the following: -The maximum absolute error denotes the largest error possible using the approximation.-The mean error denotes the average error over the approximation's sample space.
-The median error denotes the skewness in the error distribution, if there is any.-The standard deviation is a measure of the variation from the mean error.-The Root Mean Square (RMS) error is a measure of the magnitude of a continuously varying quantity of the error.
Furthermore, the evenness of the error distribution is determined by comparing the standard deviation with the RMS error.If the standard deviation and the RMS error are equal, it indicates that the error distribution is symmetric around zero.This is a preferred error distribution.An advantage of the methodology presented in this paper is that it obtains a very good possibility for developing a symmetrically distributed error around zero.To illustrate the distribution of the error a histogram is used, as shown in Fig. 8. Figure 8 shows the distribution of the error after truncation and optimization that was shown in Fig. 7.As shown, the error distribution is asymptotically normal and with a mean value near zero.

Implementation of the Fractional Part of the Logarithm
As an illustration of the Harmonized Parabolic Synthesis methodology, an implementation is presented of an algorithm that performs an approximation of the logarithm function.It is performed on a simple, nonstandard floating-point number format where the mantissa is in the range from 1.0 to 2.0.The approximation is only performed on the mantissa, since the exponential part of the input value is directly used as the integer part of the result and scaling the mantissa, [17,18].
The implementation is in hardware using a 14 bit input.The target accuracy is 15 bits and the target distribution of the error is the normal distribution.
The performance of the approximation is analyzed as described in Section 4.

Preprocessing
As described in Section 2.1, to facilitate the hardware mentation of the approximation, a normalization satisfy that the values are in the interval 0 ≤ x < 1.0 on the x-axis and 0 ≤ y < 1.0 on the y-axis has to be performed.The result of the binary logarithm satisfies that the values are in the interval 0 ≤ y < 1.0.As shown in Fig. 9, the operand v is within 1 ≤ v < 2. To satisfy that the incoming operand x is within the interval 0 ≤ x < 1, a 1 has to be added to the operand as shown in Fig. 9 and (16).
To normalize the f(v) = log 2 (v) function, v is substituted according to (17), which gives the original function, f org (x), shown in (17).
Figure 9 shows the function, f(v), together with the normalized function, f org (x).

Processing
For the processing part, the sub-functions are developed according to the description in Section 2.4.Developing the coefficient c 1 is made with two aspects in mind: the error distribution of the approximation, and the simplicity of the hardware implementation.The number of intervals that the second sub-function, s 2 (x), needs to be divided into, to achieve the intended accuracy, has also to be taken into account when performing the calculation of the error.Figure 10 shows the resulting minimum number of bits of accuracy for the logarithm function when using 1, 2, 4 and 8 intervals in the second sub-function, s 2 (x).
The needed accuracy for the implementation was set to 15 bits and the distribution of the error shall be similar to a normal distribution.Figure 10 shows that using 4 or 8 intervals will fulfill the accuracy demand with some specific ranges of coefficient, c 1 .The requirement to have a distribution of the error that is similar to the normal distribution implies that using 8 intervals would be advantageous since this will give greater margin when developing the second sub-function, s 2 (x). Figure 10 shows that using 8 intervals in the second subfunction will allow the coefficient c 1 to be 0, which in turn will imply that the hardware in the first sub-function, s 1 (x), is reduced, as shown in (18).
Choosing the coefficient c 1 to be 0 will always be beneficial in terms of reduced chip area.With increasing number of intervals, the chip area and the length of the critical path will decrease to a certain point.First, chip area and later the critical path.A disadvantage with increasing the interval is that the distribution of the error will go towards the rectangular distribution since the second sub-function will end up as one lookup table.The choice of 8 intervals is made to illustrate the methodology, not to demonstrate the optimal solution.
The help function, f help (x), is computed as shown in (19).
Figure 11 shows the help function, f help (x).
From the help function, f help (x), an initial second sub-function, s 2 (x), is developed according to the description in Section 2.3 using 8 intervals.
When the initial second sub-function, s 2 (x), is developed, the next task is to achieve a distribution of the error similar to the normal distribution, which is performed as described in Section 4.  Tables 1, 2 and 3 show the developed in (12) for 8 It can be from Table 2 that the three most significant bits after the binary point in the binary representation of the coefficients are zeros.As a result of this, the word length of the coefficients can be reduced from 15 bits to 12 bits.
It can be seen from Table 3 that the seven most significant bits after the binary point in the binary representation of the coefficients are zeros.As a result of this, the word length of the coefficients can be reduced from 16 bits to 9 bits.
When developing the algorithm for the approximation, there is an interaction between the word lengths and the values of the coefficients in the architecture.The word lengths of the coefficients and the data paths in the design are shown in Fig. 12.
In the process of truncating words in the design and optimizing the coefficients, the size and distribution of the error is analyzed and adjusted.Figure 13 shows the error of the final implementation compared to the error before the truncation of the word lengths in the design and optimization of the coefficients.
It can be seen that the error is fairly equal over the interval.
Figure 14 shows the absolute error of the implementation.It shows that the implemented approximation well meets the requirement of an accuracy of about 15 bits.
When optimizing the coefficients, the intention has been to achieve a normal distribution of the error.Such a distribution Table 1 shows the coefficients for the initial value, l 2,i , for each interval, i, in (12).Table 2 shows the coefficients for the gradient, j 2,i , for each interval, i, in (12).15 shows distribution of the error in the optimized implementation.
The in Fig. 15 shows a very high resemblance with the normal distribution.
The conclusion from Table 4 is that the maximum absolute error corresponds to an accuracy of more than 15 bits.The mean error is very small, less than 25 bits, and the median confirms that the error distribution is not skewed.When comparing the standard deviation value with the root mean square value, it can be seen that the values are nearly identical, which confirms that the error of the approximation is symmetrically distributed.

Postprocessing
No postprocessing is needed since the result from the processing part is in the right format.

Comparing Methodologies
This section compares implementations of the fractional part of the logarithm using two different methodologies; on the one hand the Parabolic Synthesis methodology, where the accuracy is decided with the number of sub-functions, and on the other hand, the Harmonized Parabolic Synthesis methodology, where the accuracy is decided with the number of intervals.The comparison will be performed in terms of chip area, critical path delay and power consumption.Both implementations are performed with the same bit accuracy and so that the error is asymptotically normally distributed.
The implementations use identical setups of the design tools and identical cell libraries.The implementations are made with the purpose of comparison, not to maximize performance; hence, potentially performance-increasing features like pipelining are not used.
The implementations are realized as ASICs with a 65 nm Standard-V T (1.2 V) technology.Synopsys Design Compiler [19], Synopsys Primetime [20] and Mentor Graphics Modelsim [21] are used for all implementations.The estimation of the power consumption is done with the help of default switching activity.When comparing different approximation methodologies, it should be noted that, to enable the same performance of the algorithm which they are part methodologies may require precisions.Approximation methodologies should therefore always be compared in the same context.In this case the methodologies have both been designed to have the same accuracy and an error that has a normal distributed around zero.Since they both have about the same characteristic of the error they can be compared without actually putting them in the same context.In paper [12] two implementations using the Parabolic Synthesis methodologies are compared with an implementation using the CORDIC methodology.A problem with this comparison is that the error distribution for the CORDIC methodology has major drawbacks compared to the one of the Parabolic Synthesis methodologies.As shown in [7, p. 125], CORDIC requires several extra bits of accuracy in order to cope with the error in the approximation.This illustrates that, when comparing different implementations, it is essential to use the same context.A poor error distribution of an approximation can corrupt the performance of the context that it is a part of.
The comparison is performed for an approximation of the binary logarithm of a mantissa in the range from 1 through 2. The required accuracy for the implementation is set to 15 bits.
The results of the implementation carried out with the Parabolic Synthesis are based on [11] after a redesign without bonding pads.The results for chip area, path delay, and energy consumption per sample are shown in the upper entry of Table 5.In [11], four sub-functions were used in the architecture.Unfortunately, in the implementation performed with Parabolic Synthesis in [11], no fully performed optimization of the word lengths in the look-up tables containing the coefficients was performed; therefore slightly longer word lengths than needed were used in this part of the design.This yields an impact on the chip area and, to some extent, on the critical path delay of the design.Hence the parameters such as chip area, critical path delay and power consumption are somewhat larger than they need to be.For the implementation using Harmonized Parabolic Synthesis [22], eight intervals are used in the second sub-function.In this case, optimization of the word lengths in the design was fully performed.The results of the implementation are shown in the lower entry of Table 5.
The conclusion from Table 5 is that the Harmonized Parabolic Synthesis is clearly superior to Parabolic Synthesis.The chip area is reduced with 70%, mainly because Parabolic Synthesis requires four sub-functions whereas Harmonized Parabolic Synthesis only needs two.To some extent, also the not fully performed optimization of the Parabolic Synthesis implementation affects its performance.With full optimization, the chip area for the Parabolic Synthesis is estimated to be reduced with roughly 10%.In computation speed, the Harmonized Parabolic Synthesis is 3 times faster.This is primarily due to that the implementation performed with Parabolic Synthesis in the critical path has more and larger multipliers.The smaller chip area and the shorter critical path of the Harmonized Parabolic Synthesis methodology results in an 11 times lower energy consumption.Comparison between Paraboloic Synthesis and the well-spread CORDIC methodology was presented in [12].Combining this with Table 5 gives the conclusion that the Harmonized Parabolic Synthesis provides the following performance compared to CORDIC: Chip area, 39% of the one for CORDIC; critical path delay: 8% of the CORDIC delay; and Energy consumption: 4% of the one for CORDIC.Again it should be mentioned that the Harmonized Parabolic Synthesis methodology has a more favorable error distribution.

Conclusion
This paper has described the Harmonized Parabolic Synthesis methodology for developing hardware approximations of unary functions, such as trigonometric functions, logarithm functions, exponential functions and square root.The methodology is mainly intended for hardware implementation with moderate accuracy in computation intensive applications within, e.g., computer graphics, digital signal processing, communication systems, robotics, astrophysics and fluid physics, as well as in many other application areas.An emerging computation intensive application that serves as a strong motivating example is wireless communications systems with multiple antennas on the transmitter and receiver, known as Multiple-Input Multiple-Output (MIMO).An essential part of the computation in these systems is performed when computing matrix inversions, which are often executed as QR decompositions in which very high throughput is needed.The QR decomposition algorithm mainly requires approximations of trigonometric and inverses.For these computation demands, the Harmonized Parabolic with its benefits in terms of small chip area, short critical path and low energy consumption is most favorable.When going from Parabolic Synthesis to Harmonized Parabolic Synthesis, the major attractiveness lies in how increasing the accuracy is done.In Parabolic Synthesis the way to increase accuracy of the approximation is to introduce more sub-functions.In Harmonized Parabolic Synthesis, which only uses two sub-functions, it is instead done by increasing the number of intervals in the second sub-function.The effect of this is that, with increasing accuracy, the resulting chip area, critical path delay, and energy consumption will show a much slower increase than for Parabolic Synthesis.The Harmonized Parabolic Synthesis has also more extensive opportunities to handle truncation effects and to achieve a desired distribution of the error.The use of second-degree interpolation in the second sub-function also has the positive effect that, compared, to Parabolic Synthesis, extraordinary functions such as the third root and inverse third root also can be approximated.
The architecture based on Harmonized Parabolic Synthesis is, as the one based on Parabolic Synthesis, very suitable for pipelining.Further, both architectures can easily be reused for approximations of other unary functions by simply changing the set of coefficients.
When analyzing the hardware architecture for implementation of the logarithm, it was found that Harmonized Parabolic Synthesis reduced the complexity of the architecture significantly, compared to Parabolic Synthesis.
When analyzing the characteristics of the error for the implementation of the logarithm, it was found that the Harmonized Parabolic Synthesis has a mean error that is near zero and a median error that is zero or near zero.This implies that only a minimum of skewness is present in the error distribution and that the difference between the standard deviation and the root mean square value is very small, which in turn guarantees that the error is evenly distributed.
To get an indication of the implementation performance of the Harmonized Parabolic Synthesis, it has been compared to an implementation carried out with the Parabolic Synthesis.The comparison shows that an approximation based on the Harmonized Parabolic Synthesis methodology results in a chip area that is smaller than with Parabolic Synthesis.The throughput is also higher, and the energy consumption per sample is almost ten times lower.
As noted in Section 3, the resulting architecture when using the Harmonized Parabolic Synthesis methodology is in fact generic.That is, the choice of function is given by the (small) table of coefficients.In earlier papers, it was shown that Parabolic Synthesis can be used with good results on several functions (e.g., the sine function [9] and the exponential function [11]).Since the Harmonized version of Parabolic Synthesis, presented in the current paper, can be applied to any functions (even without the limitations of earlier methods) it can be concluded that it is an efficient method for a wide class of functions.

Figure 2
Figure 2 Description of the development of the second-degree interpolation.

Figure 4
Figure 4The three parts of the hardware architecture.

Figure 3
Figure 3 The bit accuracy depending on c 1 for 1, 2, 4 and 8 intervals in the second sub-function, s 2 (x).

Figure 5
Figure 5 The hardware architecture of the design based on the Harmonized Parabolic Synthesis containing two sub-functions and with pipeline stages.

Figure 7
Figure7The error before and after truncation and optimization.

Figure 6
Figure6The error before and after truncation of the approximated value.

Figure 8
Figure8The error distribution after truncation and optimization of the approximation shown in Fig.7.

Figure 10
Figure 10 The bit accuracy depending on c 1 for 1, 2, 4 and 8 intervals in the second sub-function, s 2 (x), of the logarithm.

Figure 9
Figure 9 The function f(v) before normalization and the normalized function, f org (x).

Figure 11
Figure11The help function, f help (x), for the approximation of the logarithm.

Figure 12
Figure 12The hardware architecture of the implementation of the logarithm, with the word lengths.

Figure 13
Figure13The error of the implementation of the logarithm before and after truncation and coefficient optimization.

Figure 14
Figure14The absolute error in bits of the approximation of the logarithm.

Figure 15
Figure 15The error distribution of the implemented approximation.

Table 2
Coefficient j 2,i

Table 3
Coefficient c 2,i is beneficial in the calculations in most algorithms.Figure

Table 4
shows the error statistics of the implementation.

Table 4 :
Error statistics

Table 5
Comparison of chip area, critical path delay and energy consumption per sample.