Design of RNS Reverse Converters with Constant Shifting to Residue Datapath Channels

This paper presents a new general approach to simplify residue-to-binary (reverse) converters for a Residue Number System (RNS) composed of an arbitrary set of moduli. It is suggested to formulate the basic equation of the reverse converter in a form consisting of two separate parts: one depending on input variables of the converter whereas the other is a single constant. Then, the constant, instead of being added inside the reverse converter, can be shifted out to the residue datapath channels, in most cases at no hardware cost or extra delay. Thus, the hardware cost of the converter is reduced, because its multi-operand adder has one operand less to handle. To illustrate various design issues of this new design approach and to prove its efficiency, a new design method of the residue-to-binary (reverse) converters for the 3-moduli set {2n−1,2n,2n+1} is considered. Two versions of the new converters for the 3-moduli set {2n−1,2n,2n+1} as well as several of their known counterparts were synthesized for all dynamic ranges from 8 to 38 bits (i.e., for 3 ≤ n ≤ 13). The results obtained suggest that, compared to the best of the state-of-the-art converters, at least one of two versions of our converters is superior with respect to area and power consumption, for all dynamic ranges considered, in some cases accompanied by slight delay reduction. The area is reduced from about 5 % to about 20 % and the largest savings are observed for the power consumption—from over 10 % up to 27 %.


Introduction
The Residue Number System (RNS) offers several well documented advantages over the conventional 2's complement binary number system [16]. One of them is that the basic arithmetic operations like addition, subtraction, and multiplication can be carried out simultaneously in a number of parallel independent datapaths on relatively short numbers. It is therefore particularly well suited for hardware implementation of a typical computational problem in many Digital Signal Processing (DSP) systems, as the calculation of the vector inner product (the sum of products) where: S N is the numerical value of the function computed, X j is the j -th of the series of N input operands, and C j is the j -th of the series of N a priori known coefficients (which could be loaded at the system initialization). The RNS representation allows to execute inner product computations like those given in Eq. 1 using virtually carry-free arithmetic allowing for area, time, and power consumption savings compared to its 2's complement positional counterpart.
Most digital systems operate on data using a positional representation of numbers, hence using a non-positional RNS representation of numbers in some computational blocks requires conversions of the numbers back and forth to RNS, performed respectively by reverse and forward converters. Because, unlike residue datapaths executing useful computations, both converters are a pure overhead, it is desirable to maximally reduce their area, delay, and power consumption. The forward conversion is conceptually relatively simple, because it is an extraction of residues from a positional number, which can be implemented in hardware using residue generators [21,23]. On the other hand, the reverse conversion requires application of special methods, amongst which the most commonly used are the Chinese Remainder Theorem (CRT) and the Mixed-Radix Conversion (MRC) [16], whose efficient hardware implementations are significantly more difficult to design than forward conversion.
The main goal of this paper is to propose a new approach which could be taken into account to improve all characteristics of the reverse converters for an arbitrary set of moduli. It relies on the hypothesis that a reverse converter equation is given in the form in which the variable terms and a constant are separated. Although the mathematical expression of such an equation can be easily obtained, the main difficulty relies on its such a formulation that it can be efficiently implemented in hardware. Then, once such an equation has been found, the addition of the constant can be shifted from the reverse converter to the residue datapath channels, thus reducing by one the number of operands handled by the reverse converter. We will show that the latter can result in some area, delay, and power consumption savings, which can be achieved virtually at no cost. The above idea was already suggested by us for the first time in [19], but only for a particular case of reverse converters for two RNS moduli sets {2 k , 2 n −1, 2 n + 1, 2 n−1 − 1} and {2 k , 2 n − 1, 2 n + 1, 2 n+1 − 1} (n even). Also, in [19], neither any discussion on the possible performance degradation of residue datapaths nor the feasibility of applying this approach to other RNS moduli sets have been presented.
The choice of the moduli set of an RNS significantly affects performance of residue datapaths. Particularly efficient implementations of all basic modulo arithmetic operations and binary-to-residue (forward) converters have powers-of-two related moduli of the forms 2 n and 2 n ± 1 which hence are considered low-cost. Consequently, in search for efficient RNS-based hardware implementations, several special moduli sets composed exclusively of lowcost moduli have been proposed. The most intensively investigated has been the 3-moduli set {2 n − 1, 2 n , 2 n + 1} introduced in 1978 by Jenkins [14], which offers the (3n − 1)-bit dynamic range with 3-bit resolution: e.g., for n = 6, 7, and 8, the dynamic ranges available are respectively 17, 20, and 23 bits. Throughout the years, several reverse converters with steadily improved parameters have been proposed for this RNS [2-5, 9, 11-13, 22, 31, 32, 34, 35]. Amongst them, the most hardware efficient and highspeed converters can be designed using the methods from [12,34,35]. To note also that the reverse converter for this moduli set can be also obtained using recently proposed design methods of the converter for the general 3-moduli set {2 n − 1, 2 k , 2 n + 1} with flexible even modulus k [8,36], applied for the special case of k = n. Some specific applications of the 3-moduli set {2 n − 1, 2 n , 2 n + 1} include Finite Input Response (FIR) filters [10,15,24,28]. However, introducing in 2007 the flexible 3-moduli set {2 n − 1, 2 k , 2 n +1}, accompanied by an efficient reverse converter for k ≤ 2n [8], has superseded the 3-moduli set with only a single parameter n. This is because for the dynamic range of 3n−1 bits offered by the former, the equivalent 3-moduli set {2 n−1 − 1, 2 n+1 , 2 n−1 + 1} results in faster and more areaefficient residue datapaths. Although, no specific designs have been explicitly considered in the literature, the overall complexity figures of the latter can be easily obtained from complexity characteristics of the MACs mod 2 n ± 1 and 2 k for all the moduli sets concerned, provided in [10,15,25].
As a vehicle to present various facets of constant shifting to the residue datapath channels, to illustrate its feasibility, and to show its efficiency, we have chosen designing a new reverse converter for the 3-moduli RNS {2 n − 1, 2 n , 2 n + 1}. (It is worth to note that despite several reverse converter functions for this moduli set have been proposed to date, none of those referred earlier has been readily amenable for such a transformation.) Obviously, in the context of the state-of-the-art presented above, building another reverse converter for the 3-moduli set {2 n , 2 n − 1, 2 n + 1}, even with slightly improved performance, could seem hardly justifiable. Nevertheless, of significantly more interest is the possibility of using it not as a stand-alone circuit but rather as the main building block in various reverse converters for larger multi-moduli RNSs formed by extending the basic 3-moduli set. The design of the latter architectures is based on the premises of the MRC algorithm. The 3-moduli set {2 n , 2 n − 1, 2 n + 1} is one of the most commonly used for such an approach. Examples of the latter are reverse converters proposed recently for the special balanced 4-moduli sets {2 n , 2 n − 1, 2 n + 1, 2 n+1 − 1} (n even) [1,7], {2 n , 2 n − 1, 2 n + 1, 2 n−1 − 1} (n even) [7], {2 n − 1, 2 n , 2 n + 1, 2 n+1 + 1} (n odd) [1,30], and {2 n + 1, 2 n − 1, 2 n , 2 n−1 + 1} (n odd) [20], 5-moduli sets {2 n+1 , 2 n −1, 2 n +1, 2 n +2 (n+1)/2 +1, 2 n −2 (n+1)/2 +1} (n odd) [26], {2 n − 1, 2 n , 2 n + 1, 2 n+1 − 1, 2 n−1 − 1} (n even) [6], {2 n − 1, 2 n , 2 n + 1, 2 n+1 + 1, 2 n−1 + 1} (n odd) [18], as well as many more like e.g., those mentioned in [27]. Hence, the reverse converter for the above 3-moduli set is one of the most crucial parts, whose characteristics significantly affect performance of the whole family of converters. This paper is organized as follows. In Section 2, the theoretical background on RNS and the basic properties of arithmetic mod 2 n − 1 are presented. In Section 3, the general problem of designing a reverse converter for an arbitrary RNS moduli set, whose basic equation allows for shifting a constant to residue datapath channels, as well as some design suggestions are presented. In Section 4, the concept of shifting a constant to residue datapath channels is illustrated on the example of a new design method of the reverse converters for the 3-moduli set {2 n − 1, 2 n , 2 n + 1}, whose two versions are presented. In Sections 5 and 6, the complexity evaluation of the reverse converters for the 3moduli set, those proposed here and their most efficient existing counterparts, is presented: it includes both the gate level estimations of complexity figures as well as more accurate evaluations of the delay, area, and power efficiency of all circuits synthesized in 65 nm technology. Finally, Section 7 presents some conclusions and suggestions for future research.

Basic Notions
Let X be an integer and m be a positive integer called a modulus. We define X residue modulo (mod) m as a result of the integer division of X by m, denoted x = X mod m or x = |X| m , where usually 0 ≤ |X| m ≤ m − 1 (one notable exception is widely accepted double representation of zero mod 2 n − 1, which has two equivalent representations of n 0's (0 . . . 00) and n 1's (1 . . . 11), because in this case 0 ≤ |X| 2 n −1 ≤ 2 n − 1). An RNS is a set of numbers defined by a set of r pairwise prime moduli {m 1 , . . . , m r }. The dynamic range M of an RNS is a product of its moduli, i.e., M = r i=1 m i , hence, a positional binary representation of any RNS number occupies a = log 2 M bits. Any number 0 ≤ X < M in an RNS can be represented by an ordered set of Let X, Y , and Z be integers 0 ≤ X, Y, Z < M represented by the sets of residues as above. Then, any arithmetic operation • ∈ {+, −, ×} on these integers yielding Z = |X • Y | M is equivalent to the same arithmetic operation executed on their independent residues, i.e., z i = |x i • y i | m i . These operations executed on relatively short operands are the main advantage of RNS representation over its 2's complement positional integer counterpart.
Reverse converters, which are used to convert from the RNS to the positional representation of numbers, can be designed using the following general methods [16,33].
-The Chinese Remainder Theorem (CRT): wherem i = r j =1,j =i m j , i.e.,m i is a product of all moduli but m i . -The New CRT: where -The Mixed-Radix Conversion (MRC), which is an iterative method given by where the mixed-radix digits r-tuple {d 1 , d 2 , . . . , d r } is defined by For the special case of the 2-moduli set {m 1 , m 2 } and two respective residues {x 1 , x 2 }, which will be needed in Section 4, the simplified version of Eq. 4 is

Properties of Arithmetic Modulo 2 n − 1
For Readers' convenience, all properties needed in the case study given in Section 4 are presented here. Let n, s, and d be arbitrary positive integers, and z, x, y be positive integers such that 0 ≤ z, x ≤ 2 n − 1 and 0 ≤ y ≤ 2 sn − 1.
The binary representations of z, x, and y are respectively (z n−1 . . . z 0 ), (x n−1 . . . x 0 ), and (y sn−1 . . . y 0 ). As for y, if it is more than (s − 1)n-bit and less than sn-bit number, then it is preceded by some leading zeros on the most significant bit positions. The symbol denotes the concatenation of binary vectors. Then, some basic arithmetic operations mod 2 n − 1 are performed as follows.
-The sign change of z mod 2 n −1 (the additive inverse of z) is obtained by bit-by-bit complementing of all bits, i.e., -The multiplication mod 2 n − 1 of z by -For any non-negative integer i, |2 in | 2 n −1 = 1. Therefore, for an (sn)-bit number y we have i.e., to obtain the residue y mod 2 n −1, it simply suffices to partition y into s n-bit parts and sum up all of them mod 2 n − 1.

General Scheme of Constant Shifting for an Arbitrary RNS Moduli Set
In this section, we will present the general approach which can be used to simplify reverse converters for an arbitrary RNS moduli set, which relies on shifting constants from a reverse converter to residue datapath channels. Specifically, any RNS reverse converter can be seen as a particular modular datapath composed of a number of constant multiplications, bit-level manipulations, and finally additions. Because these arithmetic and logic operations may involve some variable or fixed operands, whose number and length determine the amount of required hardware and performance, any cost reduction boils down to the reduction of the number of executed operations and specifically a number of addition operands. Our goal is to consider the possibility of transforming the functions of the reverse converter to such a form that all constants are accumulated to a single one, which then could be shifted out to the residue datapath, thus reducing the number of operands handled inside the reverse converter. As each reverse converter for a given moduli set is a kind of custom designed datapath, isolating the constants can be done only on a case-by-case basis, depending on the moduli set and the architecture of the converter. An inspiration to present such a general framework for an arbitrary RNS reverse converter was our earlier results obtained for the reverse converters for two RNS moduli sets {2 k , 2 n − 1, 2 n + 1, 2 n−1 − 1} and {2 k , 2 n − 1, 2 n + 1, 2 n+1 − 1} (n even) [19]. The inspection of the basic formulas for the reverse conversion given by Eqs. 2-4 reveals that the positional representation of the output variable X could only be the sum mod M (for the CRT and the New CRT) or the simple sum (for MRC) of products of only one residue x i by some constant b i (1 ≤ i ≤ r), i.e., there never occurs any single term involving the products of two residues like x i x j . In particular, any CRT-based reverse converter function can be expressed in a generic form as where b i are some constant coefficients of the residues x i and C k is some total constant. It is important to recall that, because it is implicitly assumed that the RNS considered is a non-redundant one, each of r moduli contributes to the dynamic range M = r i=1 m i . Therefore, X must depend on all coefficients b i , which implies that all b i = 0 and the final addition has at least r operands. The actual number of operands of the latter, p ≥ r, depends on the possibility of obtaining simple binary expressions for |b i x i | M or b i x i (for MRC). Although either of the latter products can be implemented using ROM look-up tables, on one hand, it could be prohibitively complex for larger a i , and on the other hand, for many moduli sets with arbitrary a i the most efficient implementations have been obtained using arithmetic circuits (adders and subtractors) and some bit manipulations.
Obviously, a simple arithmetic expression like Eq. 6 does not necessarily imply that any simple hardware implementation with the smallest possible number of operands can be obtained even through skillful bit manipulations. Nevertheless, the motivation for reducing the number of operands stems from the fact that it can contribute not only to reducing area and power consumption (which seems quite obvious: eliminating one operand results in one up to abit CSA less, i.e., up to a FAs or HAs less, depending on whether an operand is a variable or a constant), but even in some cases it can be accompanied by reducing by one the number of CSA stages on a CSA tree that processes multiple operands to obtain the final result X. Now consider the RNS-based implementation of Eq. 1, which takes the form We assume that Eq. 7 is implemented using N Multiply-Accumulate units (MACs), in which the j -th stage of the where the initial values are generally assumed |S 0 | m i = 0. If the function of the reverse converter can be expressed as in Eq. 6, then, the residue datapath channels followed by the reverse converter can be implemented as shown in the upper part of Fig. 1. Now observe that the addition of the constant |C k | M in the reverse converter is equivalent to the set of additions of r constants |c i | m i in the residue datapaths, Therefore, the addition of the constant |C k | M can be shifted out from the (p + 1)-operand CSA tree mod M of the reverse converter, because it can be taken into account earlier at the stage of RNS datapath computations modulo all moduli m i , resulting in hardware implementation shown in the lower part of Fig. 1. The case study presented in the following sections will show that in most cases such a modification can be done at no hardware cost.
To facilitate finding the reverse converter function amenable for shifting out constants to residue datapaths, we suggest to proceed as follows.
-Identify all modular datapaths inside the reverse converter.
The reverse converter is a network of interconnected modular datapaths, whose architecture depends both on the moduli set and the general approach taken by a designer of the converter, resulting in a composition of CRT, New CRT, and MRC techniques. As such, each part of a converter is a chain of additions and multiplications of the residues x i by a constant. Hence, the first step is to identify such parts in the converter.
-Determine the value of constants added and multiplied inside these datapaths and calculate the cumulative Preparation of operands

Preparation of operands
Multiplication by constants Figure 1 General scheme of shifting constants from the reverse converter for an arbitrary RNS moduli set to residue datapath channels, according to Eq. 6. value of the constant for each of the residue datapaths The most common case of a constant addition is when numbers are subtracted, following the identity −x i =x i − 1. Alternatively, the addition mod m i of a negative number is expressed as −x i = m i − x i . Note also that each residue datapath channel could have its own cumulative constant.
-Calculate the constants for each channel.
Each datapath has its own modulus m i , which can be either the individual modulus from the moduli set (e.g. as in Eq. 6) or the product of the moduli-if the hierarchical structure of the converter is used (one example can be found in [19]). In the latter case, the value of the constant for each respective channel is the residue of the constant for the datapath; otherwise, the constant is added only in one channel. Because various datapaths inside a converter may result in various constants, in such a case, the constants are accumulated in channels modulo their respective moduli and added only once.
-Reformulate and simplify the functions of the converter with shifted out constants, including reorganization of modular CSAs and new alternative bit-level manipulations.
With the constants eliminated from a given datapath of the reverse converter, ones can be replaced with zeros on all relevant bits. The latter is done by replacing FAs in one stage of the CSA tree with HAs or even by completely eliminating one stage of them. Moreover, if some bit combinations in datapaths do not occur, FAs can be replaced with simpler AND/OR logic operators.
In the next section, we will show how to apply these general steps to the special case of the new reverse converter for the 3-moduli set {2 n − 1, 2 n , 2 n + 1}.

Design of Reverse Converters for the 3-Moduli
In this section, we will first detail a new method for designing reverse converters that implements a newly introduced set of equations in which variable and constant terms could be separated. Then, we will explain a new approach to improving converter performance that relies on shifting constants added by the converter out to the datapath channels and will argue that the latter operation in most cases can be done at no cost. A general logic scheme of the new converter is presented in Fig. 2.

New Basic Functions
The basic functions of the reverse converter for the set of three moduli {m 1 , m 2 , m 3 } = {2 n , 2 n − 1, 2 n + 1} have been given in many papers, notably in [12,34,35] which proposed designs currently considered the most efficient. The latter three methods start with the CRT (or the so-called New CRT in case of [34]) and then bring the problem of the conversion to a Multi-Operand Mod 2 2n − 1 Addition (MOMA mod 2 2n − 1). Because the final addition mod 2 2n − 1 seems inevitable in this RNS, all the efforts to increase the efficiency of the converter have been concentrated on the problem of reducing the number of MOMA operands. There are three operands in the case of [12,35] and four in the case of [8,34,36] (although in [34], the fourth operand is reduced to one bit). The varying composition of MOMA operands is mainly achieved by various bit-level manipulations, with the exception of the converter of [36]. In all previous designs, the MOMA operands are composed of variable parts depending on three input residues x 1 , x 2 , and x 3 mixed up with some constants. We have analyzed equations of all of the above converters and we have not found any obvious method to separate them. Here, our idea is to show that could a constant be isolated from variable parts, it would be unnecessary to add that constant within the converter. Therefore, we propose first to accumulate the constants to a single total constant and then to move its addition out of the converter to the residue datapath channels. It will be seen that this approach gives new options of bit-level manipulations being either a simplified version of the existing design from [35] or a new proposition which has not yet been explored in the open literature. In Section 4.4.2, we will argue that the addition of the constants in the residue datapaths in most cases can be done at no cost and that in a few special cases its cost is negligible in comparison to the benefits resulting from the simplification of the converter. Our new design method requires execution of the following two steps that are eventually merged in a single arithmetic block.
The first step relies on applying the CRT of Eq. 2 to the 2-moduli set {m 2 , m 3 } = {2 n − 1, 2 n + 1} which provides The second step relies on applying the MRC of Eq. 5 (similarly as in [29]) to the 2-moduli set {m 1 , m 2 m 3 }, which provides the final result X = {x 1 , X 23 } given by where the term X h can be computed by replacing X 23 with its expression of Eq. 9 resulting in Three terms which appear in the above equation will be transformed through some bit-level manipulations to obtain expressions not only better adapted for implementations using logic gates, but also having the advantage that variable parts (i.e., those depending on three input residues {x 1 , x 2 , x 3 }) will be separated from constant components. For the first term, we obtain the equation which contains the variable term depending on x 1 and the constant term equal to |2 −n − 1| 2 2n −1 . The former term can be presented conveniently as Note: because the multiplication by 2 −1 will appear in modified expressions for all three variable terms, we have left it intentionally instead of replacing it with the right cyclic shift by one bit.
In the second term, because x 2 (2 n + 1) = 2 n x 2 + x 2 and x 2 is an n-bit number, the addition of 2 n x 2 to x 2 is a simple concatenation of two variables x 2 , which yields Finally, the third term can be rewritten as Now, consider the variable part of Eq. 15 involving x 3 , denoted by A: First, note that if x 3 < 2 n then x 3,n = 0, so that Otherwise, if x 3,n = 1, then x 3,i = 0 for 0 ≤ i ≤ n − 1, so that Consequently, we consider two special cases depending on the value of x 3,n . First, note that if x 3,n = 1, then all other bits of x 3 are equal to 0. We will try to explore this fact either to reduce the size of A from two vectors in Eq. 16 to one vector, or to change the distribution of the bits of x 3 in such a way that it would enable merging it with the other operands, e.g., with the vector of Eq. 13. We have found two expressions for variable A of which the first is inspired by [35], while the second has not been explored in the open literature yet: 1. Addition of the constant 2 n and selective setting of n most-significant bits of the first vector as in Eqs. 17 and 18 yields One can easily verify that Eq. 17 (18) can be obtained mod 2 2n − 1 by setting x 3,n = 0 (x 3,n = 1) in Eqs. 19 and 20. Now, depending on which of Eqs. 19 and 20 is selected, two alternative versions of the converter can be obtained.

Version 1
First, we merge the constants |(2 −n − 1)| 2 2n −1 , |2 −1 (1 − 2 n+1 )| 2 2n −1 , and |2 n 2 −1 | 2 2n −1 which appear in respective Eqs. 12, 15, and 19 to obtain one total constant C k given by Second, we define new variables w 1 , w 2 , and w 3 involving variable terms from respective Eqs. 12, 14, and 19, each involving respectively exactly one of the input residues x 1 , x 2 , and x 3 . For convenience, when rewriting Eqs. 12, 14, and 19 we have performed multiplication by |2 −1 | 2 2n −1 (here, it is a right cyclic shift by one bit over 2n bits) to obtain three following raw binary vectors: Now Eq. 11 can be rewritten using these new variables w 1 , w 2 , w 3 , and the constant C k as Version 1 of the converter can be obtained by substituting in Eq. 10 X h with its expression of Eq. 25 and designing all circuitry as shown in Fig. 2a)-c). We have intentionally omitted C k in Fig. 2, as it is expected to be already included in input residues x 2 and x 3 , as described below in Section 3.4.

Version 2
First, we combine the variable terms from Eqs. 13 Next, we merge the constants |2 −n − 1| 2 2n −1 and |2 −1 (1 − 2 n+1 )| 2 2n −1 from respective Eqs. 12 and 15 into one total constant C k given by As in Version 1, we define two variables w 1 and w 3 involving the input residues x 1 and x 3 from Eq. 26 (which are actually slightly different than in Version 1) whereas the variable w 2 remains the same as before in Eq. 23. When rewriting Eq. 26 we have performed the multiplication by |2 −1 | 2 2n −1 (the right cyclic shift by one bit over 2n bits) to obtain two following raw binary vectors: Version 2 of the converter, whose circuitry is shown in Fig. 2a, b, and d, can be obtained by substituting in Eq. 10 X h with its expression of Eq. 25 wherein the variables w 1 , w 3 , and C k are given respectively by Eqs. 28, 29, and 27.

Elimination of the Constant C k
In this section, we will show how the general ideas presented in Section 3 can be applied in practice.

Theoretical Background
Besides three variable components w 1 , w 2 , and w 3 , Eq. 25 for X h contains one global constant C k given by Eq. 21 for Version 1 and by Eq. 27 for Version 2. Notice however that the addition of C k in the calculation of X h has the same effect as the addition of 2 n C k to X 23 . Even more, the latter may be also achieved by adding simultaneously c 2 = |2 n C k | 2 n −1 = |C k | 2 n −1 to x 2 mod 2 n − 1 and c 3 = |2 n C k | 2 n +1 = | − C k | 2 n +1 to x 3 mod 2 n + 1 (the even residue x 1 is left unmodified because |2 n C k | 2 n = 0). In summary, the constant C k indeed can be added either by the CSA part of the reverse converter or beforehand in the datapath channels mod 2 n ± 1. The latter possibility is attractive and feasible, because the addition of these constants may be performed at no cost, as they may be initially loaded as the starting values or merged with other constants in the residue datapath channels.
It means that the datapath channel mod 2 n − 1 does not have to be modified at all, whereas the value produced by the datapath channel mod 2 n + 1 only needs to be incremented by c 3 = 1. For Version 2, from Eq. 27 we obtain Figure 3 shows the RNS datapath using the 3-moduli set {2 n − 1, 2 n , 2 n + 1} and the reverse converter which realizes Eq. 10 in which X h is given by Eq. 30. We assume that the initial version of the RNS datapath (prior shifting out the constants from the reverse converter) implements the set of r = 3 equations (7), in which the j -th stage of the datapath (1 ≤ j ≤ N) is composed of three MACs mod m i , 1 ≤ i ≤ 3, each of which realizes Eq. 8, where the initial values are generally assumed |S 0 | m i = 0. All MACs (mod 2 n , 2 n − 1, and 2 n + 1) can be implemented using the design approach presented in [10]. Also, we suggest two following solutions for adding the constants c 2 and c 3 in their respective datapath channels mod 2 n − 1 and 2 n + 1.

Solution 1 Suppose that each computation is initialized by
activating the Clear signals in the registers |S 0 | m i containing the initial values mod m i , i = {1, 2, 3}. Then, the effect of adding mod 2 n − 1 and 2 n + 1 the respective constants c 2 and c 3 in the residue datapaths mod m 2 = 2 n − 1 and m 3 = 2 n + 1 can be obtained without any extra delay and virtually with no hardware cost by activating in these registers the Preset signal allowing to load the constants c 2 and c 3 in the channels mod 2 n − 1 and 2 n + 1.

Solution 2
Should the above simplest solution be unfeasible for some reason, which seems rather unlikely, the constants c 2 and c 3 can be added at any other stage of computation in respective residue datapath channels. We will detail the case of Version 2, because adding mod 2 n − 1 and 2 n + 1 of the respective constants c 2 = 2 n−1 − 1 and c 3 = 2 n−1 + 1 is more difficult than for the case of Version 1, for which always c 2 = 0 and c 3 = 1, and which will appear later as a special case. First, consider the datapath channels mod 2 n −1. Conceptually the simplest MAC mod 2 n − 1 [10] is nothing else but the array of n 2-input AND gates followed by the (n + 1)operand MOMA mod 2 n − 1. Obviously, no modifications are required for Version 1, because c 2 = 0. For Version 2, the addition of c 2 requires introducing somewhere in the datapath one CSA stage composed of n half-adders (with carry outputs non-complemented or complemented, depending on whether a given bit of c 2 is 0 or 1, respectively). It can be done by modifying an arbitrarily selected MAC of the datapath mod 2 n − 1. Such a modification would not incur any increase of the number of CSA stages of the MAC mod 2 n − 1, except for those n for which θ(n + 2) = θ(n + 1) + 1, where θ(p) denotes the number of CSA stages on a CSA tree that processes p input operands), i.e., only for n = 3, 5, 8, 12, 18, 27, etc. As for the datapath channel mod 2 n + 1, several possibilities exist, which do not involve any changes which could result in the increasing of the area or delay. However, to clarify this issue, we must first recall some details regarding the internal logic structure of a MAC mod 2 n + 1, e. g. one proposed in [10]. The latter design employs the tree of n-bit CSAs with some signals complemented and, in particular, complemented End-Around Carry (EAC) signals. In a given arithmetic circuit mod 2 n + 1, any complemented signal of weight 2 i requires to add mod 2 n + 1 the corrective constant equal to | − 2 i | 2 n +1 [21]. For such a circuit, a cumulative total constant |C total | 2 n +1 can be calculated and added whenever the most convenient, i.e., it does not have to be added at each stage of the datapath mod 2 n + 1, unless an intermediate unbiased result is actually needed. In the best case, a cumulative total constant |C total | 2 n +1 can be added only once, e.g., at the final stage of the datapath channel mod 2 n + 1, when the final result |S N | 2 n +1 = x 3 is produced, which is one of three input signals of the reverse converter considered here (recall also that x 3 assumes only the valid values mod 2 n + 1, i.e., 0 ≤ x 3 ≤ 2 n ). Obviously, for all cases when |C total | 2 n +1 = 0, the addition of c 3 can be done at no cost by replacing the addition of |C total | 2 n +1 with the addition of |C total + c 3 | 2 n +1 . In the remaining special case of |C total | 2 n +1 = 0, the adder mod 2 n + 1 used at the final stage of the datapath channel mod 2 n + 1, which adds the pair of n-bit vectors in the carry-save form C * = (c * n−2 . . . c * 0 c * n−1 ) and S * = (s * n−1 . . . s * 0 ) to compute x 3 = |C * + S * | 2 n +1 , must be modified, as it should produce |C * +S * +c 3 | 2 n +1 now. The most obvious general implementation of the latter involves introducing one n-bit CSA with complemented EAC (the latter imposes to subtract −1, so that the adder actually computes |C * + S * + c 3 − 1| 2 n +1 ) followed by the adder mod 2 n + 1. Such a modification involves the extra delay of one halfadder and the area cost of n extra half-adders (which still reduces by a half the total of 2n half-adders which would be used, should this constant be added within the reverse converter). Finally, the special case of |C total + c 3 | 2 n +1 = 1 (which is also the case of Version 1) involves no extra cost at all, because the addition |C * + S * + 1| 2 n +1 can be implemented using the n-bit adder with inverted EAC of [37].

Example
Consider an RNS-based implementation of the 16-tap FIR filter with 8-bit input operands and 8-bit coefficients. The required dynamic range of 20 bits is guaranteed by the 3-moduli set with n = 7 {m 1 , m 2 , m 3 } = {128, 127, 129}. Eq. 1 is implemented using three residue datapaths mod m i , 1 ≤ i ≤ 3, each of which realizes the function where the initial value of |S 0 | m i is usually set to 0. Three basic MAC units mod 128, 127, and 129, required to built these residue datapaths, can be implemented using the design approach presented in [10], although some slightly improved versions proposed in [25] can be used as well. The general scheme is as shown in Fig. 3, assuming that the following constants are added inside the residue datapath channels: c 2 = 0 and c 3 = 1-for Version 1, and c 2 = 63 and c 3 = 65-for Version 2. Should it be feasible for an actually used architecture, at the beginning of computations the residue datapaths registers containing the initial values |S 0 | 2 n −1 and |S 0 | 2 n +1 are loaded with the suitable constants required by the reverse converter rather than with 0's, i.e., |S 0 | 2 n −1 = c 2 and |S 0 | 2 n +1 = |C total + c 3 | 2 n +1 . Should loading of initial non-zero constants c 2 and c 3 be unfeasible, some of the alternative methods proposed in the previous subsection could be used. Further, in this example, we disregard the actual value of |C total | 2 n +1 (the only other constant to be added, which can be calculated according to the internal structure of the MAC mod 2 n + 1), take into account c 2 and c 3 , and assume that the result of the calculations from the residue datapaths From Eq. 30 we obtain X h = |w 1 + w 2 + w 3 | 2 2n −1 = |16128 + 129 + 8509| 16383 = 8383, which substitutes X h in Eq. 10 to provide the final value of X given by X = From Eq. 30 we obtain the same value of X h = |w 1 + w 2 + w 3 | 2 2n −1 = |16192 + 12384 + 12573| 16383 = 8383. In summary, we have shown the feasibility of avoiding the cost of one additional CSA stage in our new converters, by assuming that the appropriate constants are already added to the input residues x 2 and x 3 of the converter.

Constant Removal From the Reverse Converters for 4-and 5-moduli sets
The same rules of shifting out constants as presented above can be directly applied to other reverse converters for the special multi-moduli sets constructed by extending the classic 3-moduli set {2 n − 1, 2 n , 2 n + 1}, like those of [1,6,20], in which the reverse converter for the 3-moduli set {2 n − 1, 2 n , 2 n + 1} can be used as a subcircuit. For instance, it can be done by first dividing the moduli set into at least two subsets, one of which is the 3-moduli set {2 n −1, 2 n , 2 n +1}. The residues corresponding to this moduli set are transformed using the reverse converter to obtain one number, which is an intermediate result for this moduli set. (In particular, it can be the converter proposed above, provided that the necessary constants are added in residue datapath channels mod 2 n ± 1.) Next, the two-moduli MRC from Eq. 5 is applied once or twice to complete the reverse converter for the 4-or 5moduli set, respectively. The application of the MRC in this step is nothing else but the modulo generation from the number obtained in the preceding step followed by the modular subtraction and multiplication by the multiplicative inverse. Depending on a specific implementation of this step, a number of constants can appear. As this part of the converter is a kind of residue datapath modulo one of the two last moduli, the constants involved in these steps (or one last step, in the case of the 4-moduli set) can then be shifted out to the residue datapath channels corresponding to these moduli.

Complexity Evaluation
In this section, we will evaluate the gate-level complexity of a number of converters for the 3-moduli set {2 n , 2 n −1, 2 n + 1}, including: Versions 1 and 2 of our converters, three specific designs which in the literature have been considered to be the most efficient [12,34,35], and two converters for the general 3-moduli set {2 k , 2 n − 1, 2 n + 1} designed for the special case of k = n [8,36]. (To note that because the expressions used to implement all the converters of [12,34,35] contain no explicit constants, our technique of shifting the constants to the datapath channels can be applied only to the converters proposed here.) Here, we count logic gates, based on a number of basic primitives used in the architectural design detailed in the previous section, and then we estimate the circuit delay as the number of logic primitives present on the critical path. In the next section, we will evaluate the results of logic synthesis of all converters using commercial design tools, which would provide not only more accurate estimations of the area and delay but also of the power consumption.

Hardware Complexity Evaluation
In Table 1, which summarizes the gate-level hardware complexity, we can distinguish three groups of components. The first one consists of bit-level manipulation components like primitive gates, inverters and MUXes (which we also treat as primitive gates). As in every converter mentioned there is a 3-operand [12,35] or a 4-operand [8,34,36] addition mod 2 2n − 1, the second group are full-adders (FAs) and  half-adders (HAs) which form 2n-bit CSA stages producing the carry-save vector pair. The third group is the final adder mod 2 2n − 1 reducing the carry-save pair into the final result. The differences between converters are revealed in the first two groups of components, because the same final adder mod 2 2n − 1 occurs in all converters. It is seen that Version 1 of our converter is comparable with the converter from [35] (as it uses similar design assumptions), while in the other converters (notably in Version 2 of our converter) the number of two-input gates is fixed and does not depend on n. A real advantage of our new design (resulting from removal of the constants) may be seen in the total number of full-and half-adders. While in all converters there are at least 2n full-adders accompanied in [8,34,36] by about 2n half-adders, in both versions of our design the number of full-and half-adders is about n.
In summary, the above estimations suggest that our converters should occupy slightly smaller area resulting from smaller number of full-and half-adders. Table 2 provides the estimations of the delay of all converters considered. Each column shows the data of one converter, with delay components appearing in consecutive rows according to the order of calculations. Besides all the logic primitives included in the analysis of the hardware complexity, we have also included some fanout characteristics (expressed by the number of inputs driven by one signal), because of the following reasons. As gates have limited fanout, should it be exceeded, the logic synthesis tool either builds a buffer tree which drives larger number of gates or uses slower gates with higher output fanout. This may result in a synthesized circuit which is actually slower than it would indicate direct delay analysis of all components present on the critical path (indeed, most estimations presented in the open literature count only the number of gate levels, regardless on fanout). To note, however, that in all 3-moduli converters considered here, the only signals with high fanouts are the primary inputs to the converter. To expose the high fanout issue, we have found the maximal fanouts on the converter input (which we understood simply as the number of occurrences of some literals x i,j (1 ≤ i ≤ 3) in bit-level manipulation expressions) and included them in the first row of Table 2 as d f (a), where a is the maximal number of gate inputs driven by the primary input.

Gate-Level Delay Evaluation
Prior comparing delay, notice that any significant delay reduction seems unfeasible, because the most significant part of the overall delay is contributed by the final adder mod 2 2n − 1, whose usage could not have been avoided in any design proposed to date. It is seen that besides fanoutrelated delay, any differences between the delays of various converters stem from the number of CSA stages on the critical path. The first (faster) group of designs includes two versions of our converters and the converters from [12,35] which have only one FA stage, whereas the second (slower) group with two FA stages includes the converters from [8,34,36]. In the faster group, the delay of Version 2 is comparable with the delay of the converter from [12], which is due to small fanout and using similar gates for bit-level manipulations. The second best are Version 1 and the converter from [35] with the fanout equal to n in either case. In the slower group, the fastest seems to be the converter from [36], while the slowest is the converter from [34] which has large fanout depending on n and uses a XOR gate for its bit-level manipulations.

Synthesis Assumptions
To obtain as much as possible realistic complexity figures, we have performed logic synthesis of both versions of our converter as well as their best known counterparts from [8,12,[34][35][36]. In an attempt to produce systematic and fairly comparable descriptions, hardware description of all converters was done in parametrized structural Verilog, following identical coding guidelines and similar module/sub-module layout, i.e., we separated bit-level manipulations, CSAs, and adders mod 2 2n − 1 (designed according to [17]), and used hierarchy preserving feature of the logic synthesis tool. We have performed logic synthesis of all converters for a range of delay targets using Cadence RC Compiler (v. 8.10-s222 1) over the commercial CMOS065LP 65 nm low-power library from ST Microelectronics (CORE/ CLOCK65LPSVT, v. 5.2.2). As is customary, to obtain more realistic delay and power estimations, we have added input and output registers to all designs. For each design, the minimum delay was found, which we understood as the smallest delay target for which the logic synthesis tool still reported a non-negative timing slack. Next, we have performed physical synthesis on rectangular die whose size was selected to obtain the density about 75 % (we have achieved the values ranging from 70 to 80 %). Physical synthesis was performed using Cadence Encounter (v. 10.12-s181 1) and NanoRoute (v. 10.12-s010) tools. We have used the same scripts for synthesis of all converters, without any specific changes nor optimization.
The results of the place and route for all dynamic ranges (DR) from 8 to 38 bits, namely power-and area consumption at the minimal delay are visualized respectively in Figs. 5 and 6, and the minimal delay itself is shown in Fig. 4). It can be also observed that the results of place and route timing in general follow the estimations given by the logic synthesis tool alone, which seems to be the result of a relatively simple design size wherein the impact of placement and/or routing on the final timing is limited. Figure 4 reveals the logarithmic growth of the minimal delay with the dynamic range for all converters considered, which is the direct outcome of the logarithmic depth of the final parallel prefix adder [17] (which is the main delay Mindelay ns Figure 4 Minimum delay as a function of the dynamic range. (Note: For DR = 8 and 11, the delay of the converters of [12,35] is the same as for new Version 2). contributor). It is also seen that delay differences remain fairly constant for all converters. The synthesis results confirm that our converters are the fastest for all dynamic ranges considered, which could be attributed to a smaller number of potentially critical paths in the final addition (our converters involve about two and a half of operands compared to three complete operands in those of [12,35]), which in turn enabled the logic synthesis tool to use larger and faster gates (because they are fewer) having smaller overall impact on the final power and area.

Delay Evaluation
While the delays of both Versions 1 and 2 of our converters are largely the same for most of the dynamic ranges larger than 8, a noticeable exception are the dynamic ranges of 35 and 38 bits: a likely explanation of larger delay of Version 1 than of Version 2 seems to be larger fanout in the former, resulting in a deeper buffer tree (or slower gates driving large fanout). The special cases of the converters of [8,36] involve the addition of four operands resulting in an additional CSA stage contributing about 100 ps extra delay. The slowest converter amongst ASIC-specific designs is the converter from [34], in which further 50 ps is added by one XOR gate (see the fifth column of Table 2). Figure 5 shows the power consumption of all designs considered. Power consumption presented was obtained as a sum of dynamic and leakage power from the simulation using 1000 random vectors using PrimeTime PX v2012.1H on the netlists provided by the physical synthesis tool. While the minimal delay obtained was relatively easy to explain, as it was strictly related to the number of logic levels and hence it was easily traceable by evaluating the critical paths, the power consumption is the result of the simulation Po W Figure 5 Power at minimal delay. and internal estimations performed by the power simulation tool. Due to this, the power consumption estimations contain more nonlinearities and their contributing factors are more difficult to explain exclusively on the basis of the internal architecture.

Power Consumption Evaluation
In general, our converters have smaller power consumption (accompanied by smaller minimal delay) than all their counterparts. The power consumption of Version 1 is, on average, smaller than its counterpart of Version 2. Larger power consumption of the converters from [8] than those from [36], which both have very similar design and introduce nearly the same minimal delay, requires additional consideration. A likely reason is the presence of the additional OR gate on the critical path in [8] (compared to uniformly distributed NOT gates in [36]) that puts additional strain on all paths originating from this OR gate, as the logic synthesis tool must to scale up all gates on these paths which results in higher power consumption. Finally, it is noticeable that the converters from [34] enjoy relatively small power consumption which, unfortunately, is accompanied by the largest delay. Figure 6 shows the area of all synthesized converters, obtained as the sum of reported areas occupied by logic cells and estimated area of interconnections. As these figures strictly depend on the numbers of cells used, they are more exact and consistent than power estimations.

Area Evaluation
The area occupied by all presented converters in general follows a slightly faster than linear trend resulting from the composition of O(n log n) complexity of the final adder and linear complexities of other components (cf. Table 1), so no Figure 6 Area at minimal delay. significant differences are observed between various converters. A slight vertical shift from the linear trend of the area observed for nearly all converters between 23 and 26 bits results from the depth of the parallel prefix final adder [17] (as n changes from 8 to 9).
In general, the area of our converters is smaller than of all their counterparts for all dynamic ranges considered. We can observe also that delay, power, and area are somewhat correlated in our converters. For instance, for the dynamic ranges of 14, 23, and 29 bits, we observe higher delay of Version 2 accompanied by nearly the same power consumption and compensated by smaller area (smaller gates consume more power despite higher delay). Also, Version 2 of our converter occupies larger area for larger dynamic ranges (from 32 bits), which results from smaller delay achieved (despite that Version 1 uses a number of OR gates not used in Version 2). It seems that larger area of the converters from [12,35] results from using 2n full-adders ( Table 1 reveals that in our converters, there are roughly n full-adders and n halfadders). The area of the converters from [8,34,36] remains close to all other converters but, as it was observed before, it is accompanied by higher delay (thus the relatively small area results from the use of smaller and consequently slower gates). Table 3 summarizes the improvement, in percentage terms, of our converters with respect to their best existing counterparts of [12,35] as a function of the dynamic range: for a given parameter, the best of our two versions and of [12,35] is compared. The delay of the new converters is either the same as in the previous designs or it is only slightly reduced, up to 2.34 %. It is not surprising, because in all converters considered the final adder mod 2 2n − 1 (preceded by at least two CSA stages) is the main delay contributor. Consequently, replacing one CSA stage with one stage of logic gates performing simple bit manipulations can only result in a relatively small delay reduction (0.9 % on average). Nevertheless, the area is reduced from about 5 % to about 20 % (on average by 11.69 %), and it is accompanied by even larger savings in the power consumption-from over 10 % up to 27 % (on average by about 17.79 %). Consequently, due to negligible delay reduction (if any), the reductions obtained for the area-delay and power-delay products are highly correlated with those observed for the area and power consumption: they range for the area-delay product from over 6 % up to over 18.5 % (on average by 11.69 %) and for the power-delay product from about 13 % up to about 26 % (on average by 18.2 %).

Conclusion
In this paper, a new general approach to improving characteristics of reverse (residue-to-binary) converters for a Residue Number System (RNS) composed of arbitrary moduli sets was suggested. The main idea is to formulate the basic equation of the reverse converter in a form consisting of two separate parts: one depending on input variables of the reverse converter whereas the other is a single constant. Such a separation allows to reduce the number of operands added inside the reverse converter by one, because the constant, instead of being added inside the reverse converter, can be shifted out to the residue datapath channels. We have argued that adding the constant in the residue datapath channels, in most cases can be done at no hardware cost or extra delay, so that applying this design approach can lead to overall power and area reduction and in some cases also to decreasing delay. Some suggestions which facilitate obtaining the reverse converter equation with separated constants were also given. To illustrate various design issues of this new design approach and to prove its efficiency, a new design method of the residue-to-binary (reverse) converters for the popular classic 3-moduli set {2 n − 1, 2 n , 2 n + 1} was considered. Investigations of different bit manipulations of the converter's input operands resulted in its two new versions. Unlike any of previous designs, the new sets of equations contain separated constants, which can be shifted out from the converter to the datapath channels at no cost, thus reducing the cost of the tree of carry-save adders (CSAs) of the converter. Experimental results suggest that compared to all of the state-of-the-art converters for this 3-moduli set, the converters obtained using the newly proposed approach are superior with respect to the area and power consumption which are reduced on average by about 12 % and 18 %, respectively, while delay is the same or slightly smaller. As several larger multi-moduli RNSs have been proposed by extending the set {2 n − 1, 2 n , 2 n + 1}, for which reverse converters have been constructed using the converter for the classic 3-moduli set as a basic building block, all of the latter converters (including those that will be proposed in the future) could also enjoy better performance, once their basic building block is improved. Future research will include considering the possibility of formulating equations of reverse converters for other moduli sets, in which variable and constant parts could be separated, which would allow to add constants within the residue datapath channels and would likely result in more efficient converters.