Comprehensive study of 1-Bit full adder cells: review, performance comparison and scalability analysis

Full Adder (FA) circuits are integral components in the design of Arithmetic Logic Units (ALUs) of modern computing systems. Recently, there have been massive research interests in this area due to the growing need for low-power and high-performance computing systems. Researchers have proposed a variety of FA cells with diverse design techniques, each having its pros and cons. As a result, a systematic method for performance comparison of FA cells using a common simulation platform has become necessary. In this work, we present an extensive study of FA cells. We have compared the performance of thirty-three (33) existing 1-bit FA cells. The drive powers of these FA cells have been compared by applying a variety of load conditions. In addition, the 1-bit FA cells have been extended to 32-bit structures to test their scalability and to investigate their performance in wide-word structures. We have determined that twenty-one (21) of the thirty-three (33) FA cells cannot operate in a 32-bit structure, even though some of them exhibit excellent performance as a 1-bit cell. The main finding of this research is that the single-bit performance parameters of FA cells should not be considered as the main basis for performance comparison. Any FA cell should be analyzed in a multi-bit structure to determine its practical effectiveness. Hybrid full adders offer better performance than single logic full adders Many existing full adder cells are not scalable Conventional Mirror CMOS full adder offers better performance than many recent full adders in wide adder structure Hybrid full adders offer better performance than single logic full adders Many existing full adder cells are not scalable Conventional Mirror CMOS full adder offers better performance than many recent full adders in wide adder structure


Introduction
Due to the massive use of battery-powered, portable electronic gadgets, the use of VLSI circuits, that require high speed and consume less power, has become crucial [1][2][3]. Full Adders (FAs) form a vital component in the VLSI system design of advanced microchips. FAs are essential for the implementation of certain mathematical operations such as magnitude comparison [4], multiplication [5,6], subtraction [7], etc. In most cases, adder falls within the critical path of these operations which govern the comprehensive performance of the overall system [8]. Moreover, the implementation of a wide adder tree requires full adder cells [9][10][11]. Due to the towering utilization and crucial role in various operations, a multitude of FA cells have been implemented, each having its advantages and disadvantages.
Since numerous FA topologies have been proposed, especially in recent times, it is necessary to evaluate their performance metrics using a common platform to enable VLSI designers to pick the right FA topology that best suits their system requirements [12]. In many recent works, comparative analyses of FA designs have been discussed. For example, Prasad et al. [13] and Wariya et al. [14] compared XOR-XNOR-based FA circuits. A comparative study of FAs conducted in by Singh et al. [15] and Harish et al. [16] explored FAs that are implemented using various logics. However, the investigation was only conducted for four FAs in [15] and five FAs in [16]. FA comparison in [17] compared 7 cells. The study conducted in [18] provided the impact of voltage variation on FA cells. Research conducted in [19] analyzed the performance of FA in treestructured arithmetic units. In [20], only 4 FA cells have been analyzed and compared. However, these studies are not up-to-date as they do not have the FA designs developed in the past 10 years. In [21], an extensive investigation among various FA cells has been conducted for 180 nm CMOS process node, which is rarely applied to modern-day circuits. FA comparison in [22] contains simulation results for only 14 FA cells, which may not be enough for a comprehensive study.
To have a complete overview of FA cells, recent contributions need to be considered and performance comparison should not be limited to 1-bit cell. Therefore, FAs should be analyzed in multiple-bit structures. Moreover, the drive power of VLSI circuits is an important parameter. However, comparative analysis of FA drive power is missing in the existing literature.
In this work, we report an extensive analysis of 33 existing FA cell designs utilizing Cadence tools. The benefits and drawbacks of each FA design have been thoroughly discussed and summarized to allow VLSI designers to select the desired FA for circuit implementation.
The organization of the remaining portion of this research is as follows. In Sect. 2, a comprehensive review of FA cells has been provided. Section 3 provides information on circuit simulation parameters, transistor sizing, and simulation testbench. In Sect. 4, a comprehensive comparison of FA cells has been conducted based on the simulated results. Section 5 provides the major findings of this research. At last, concluding statements are provided in Sect. 6.

Single logic full adders
The early age of CMOS VLSI design highly relied on Complementary Pass Logic (CPL) where n-channel CMOS (NMOS) transistors were utilized for logic interpretation [24]. This logic technique is proficient in terms of logic swing. However, due to the utilization of only NMOS transistors, the design technique can only provide strong logic 0. In the case of providing logic 1, the output voltage becomes V dd -V t (here, V dd = supply voltage, V t = threshold voltage of NMOS). Therefore, CPL is unable to provide strong logic 1. FA employing CPL utilizes 32 NMOS for logic interpretation [25]. In addition to providing weak logic 1, high transistor count (TC) of CPL FA causes high power dissipation which is responsible for creating hot spots in IC [26]. Another FA employing CPL logic presented in [27] requires only 12 transistors (addressed as 12-T FA in this article). Although low TC reduces power dissipation and area requirements in IC, voltage degradation remains the key concern. Due to voltage degradation, CPL has been supplemented by Complementary CMOS (CCMOS) logic which is widely used in modern ICs [28]. In addition to providing strong logic 0 and 1, the CCMOS logic family is highly robust against voltage scaling [29]. Moreover, due Hybrid logic-based FA circuits have become popular because they leverage the benefits of various logic designs within the same circuit [31]. Transmission gate (TG) based logic implementation solves the issue of voltage degradation of CPL logic by adding swing restoring PMOS transistors [32]. Transmission Gate FA (TGA) in [33] and Transmission Function FA (TFA) in [34] employ TGs for FA logic interpretation. Although the issue of voltage degradation is solved, poor drive power is the major issue associated with these FA designs [33,34]. FAs employing 10 transistors (10-T) [35], 16 transistors (16-T) [36], 14 transistors (14-T and New 14-T) [37], 18 transistors (18-T) [38], and 26 transistors in [39] utilize hybrid logic style designs, unlike TGA and TFA FAs. 24-T FA employs a 3-input XOR gate to compute Sum. Carry-out bit calculation is the same as CCMOS based FA in [30]. In 14-T FA [36], a hybrid XOR gate works as the soul of the design since the output from the XOR gate is used for computing both sum and carry-out signals.
Two more hybrid FA cells named Hybrid Pass Static CMOS (HPSC) and Novel HPSC (NHPSC) are presented in [40,41]. HPSC uses Pass Transistor (PT) for XOR-XNOR function generation which works as internal nodes. The output side employs CCMOS logic to provide the circuit with ample drive power required in high-fan out cases.
Nowadays, the Gate Diffusion Input (GDI) method of implementing logic functions has become quite popular for implementing low power circuits [55][56][57][58]. GDI method was first introduced in [59] which later became a popular method for VLSI circuit design [60]. Logic implementation using the GDI technique can be realized from [61], where basic logic gates using the GDI technique, have been presented. The major issue regarding GDI methodbased circuit is its voltage degradation which reduces drive capability significantly [62]. Several FAs employing the GDI technique have been developed for low-power applications which require less surface area due to low TC [63]. GDI FA in [64] suffers from low drive power due to threshold voltage drop in GDI logic gates. However, low TC and low-power dissipation make them suitable for low-power applications. To provide full swing output in GDI gates, modified GDI gate-based FA designs have been implemented in [65,66].

Circuit simulation in cadence
To evaluate the performance metrics of various designs of FA cells, circuits are required to be simulated in a common simulation environment to ensure a fair comparison. Therefore, circuit simulation parameters need to be fixed and a proper transistor sizing technique needs to be applied for all FA cells. These are discussed in the following sub-sections.

Circuit simulation parameters
To simulate FA circuits to investigate their performance, a 45 nm CMOS process has been utilized. Supply voltage has been set to 1.0 V. Average power, propagation delay, and Power Delay Product (PDP) are the performance metrics that are used to compare the effectiveness of various FA cells. The input waveform for power and delay calculation is presented using Fig. 1, where it can be visualized that all possible input combinations from 000 to 111 are present in the waveform. In VLSI circuits, power and delay vary for different input combinations since pull-up and pull-down transistor paths for different input combinations are different. Therefore, to determine the average power dissipation of a FA cell, all possible input combinations are applied to the testbench and total power consumption due to each input combination is calculated. Later, an average value of total power, due to all input patterns, was taken as the average power. In the case of propagation delay calculation, 50% of input-output signal swing for the critical path (worst case delay path) has been chosen. For delay, all input-output combinations from 000 to 111 are generated separately and delay occurred due to all possible input combinations were calculated individually. Then, only the maximum delay has been considered as propagation delay of the circuit. PDP is simply the product of average power and propagation delay.

Transistor sizing
In the case of VLSI design, optimal implementation of circuits plays a crucial role [67,68]. In general, transistor sizing refers to increasing or decreasing the width of transistors to optimize the performance parameters of circuits. Due to its effectiveness in optimizing the performance of VLSI circuits, transistor sizing should be handled in a proper manner [69]. Transistor sizing for circuits comprising of a small number of transistors can be done manually. However, modern-day ICs are comprised of millions of transistors for which it becomes impossible to optimize transistor sizes manually. Therefore, bringing automation in design optimization becomes inevitable to cope up with the high integration density and complexity of modern IC designs.
The transistor sizing method in [70] presents a linear method of performing a trade-off between CMOS circuit parameters: power, delay, and area. However, modernday VLSI circuits behave in a non-linear manner for which this algorithm is unable to yield optimal performance. Transistor sizing methods in [19,33,71] present a simple but effective way of determining transistor sizes for delay optimization. However, only the critical path is considered in these two methods, for which power consumptions of circuits are not optimized. Nowadays, Power-Delay Product (PDP), which is simply the product of power consumption and delay of a circuit, has become the vital parameter and transistors have been sized for obtaining minimum PDP [35][36][37][38][39][40][41][42][43][44][45][46][47][48][49][50][51][52][53][54][64][65][66]. Particle Swarm Optimization (PSO) has become a popular method for optimization of VLSI design [72,73]. PSO-based inverter circuit optimization is presented in [74]. However, the algorithm is not tested for circuits having a large number of transistors and hybrid logic styles. Another PSO-based transistor sizing method is presented in [75]. But the algorithm is only tested for CCMOS logic-based designs. Simple Exact Algorithm (SEA) based transistor sizing presented in [76] has been specially designed for arithmetic circuits, taking into account various hybrid logic design methodologies. The authors have tested the algorithm for various FA cells such as: CPL [25], CCMOS [30], TGA [33], TFA [34], 14-T [37], NHPSC [41]. Due to the ability to optimize hybrid logic cells, the SEA transistor sizing method in [76] has been used in this paper for optimizing FA cells.

Simulation testbench
To inspect the performance parameters of FA cells, a feasible structure is required to perform simulations. Various simulation testbench, reported by researchers for FA simulation, have been illustrated in Fig. 2. In the case of testbench in Fig. 2a, three stages of FAs are connected having buffers in the input and output terminals. Delay for this testbench is measured from input terminals of the 1st FA stage to the last signals in the 3rd FA. Hence, it does not represent the delay of a single FA block. Moreover, inputs are only applied to the 1st FA stage. Therefore, the 2nd and 3rd FA stages are not tested properly. As a result, the power consumed by various FA stages is different. In addition, the fan-out of Sum is 1 whereas the fan-out of C out is 2. Therefore, the FA blocks are not similarly loaded. FA test benches in Fig. 2b and c are similar except for the last parts. Both of the test benches are free from the limitations of the testbench in Fig. 2a. Since the SEA transistor sizing method, described in sub-Sect. 3.2 used the simulation test bench in Fig. 2c, we have also used this test bench for FA simulation in this work.
The testbench in Fig. 2c, which is used in this research contains a set of buffers attached to the input terminals. In fabricated processors or ICs, the signals pass through several non-ideal circuit components which make the signals distorted. Therefore, while generating input signals for simulation, it is necessary to replicate the real-time scenario by introducing signal distortion. For this reason, the buffers are attached to input terminals to bring distortions in the input signals. On the other hand, the output of a circuit is always connected to other components in an IC which work as a load to the circuit. Therefore, in the case of simulation, it becomes necessary to attach a load circuit with an output terminal to work as load. The testbench demonstrated in Fig. 2c comprises buffers in the output terminals as load circuits to the output terminals.

Simulation results and performance comparison
For comparative investigations of performance parameters, simulations have been conducted considering various aspects and operating conditions. Obtained simulation results for FA cells are presented in the following sub-sections.

Performance of FAs as single cells
Obtained simulation results using simulation testbench in Fig. 2c has been presented in Table 1 and Fig. 3. It can be observed from Fig. 3a  In the case of speed (propagation delay), HBD 7 FA in [50] obtained predominant performance. HBD 7 FA cell used input signal C in as the gate control of transistors in the outermost terminals. As a result, the portion of the outermost terminal got switched on before the time it takes to generate signals in the internal nodes. Once the internal signals are generated, they instantly appear in the output terminals since the output terminal has been turned on beforehand. By this scheme, the circuit could ensure better speed. HBD 3 [46], GDI 2 [65], and GDI 3 [66] are close contesters of HBD 7 [50] in speed. Speed of CPL [25] and CCMOS [30] FAs are quite satisfactory in spite of being some of the oldest FA topologies. 12-T [27], 10-T [35], 14-T [37], NHPSC [41], and HBD 6 [48] FA cells have a very high level of propagation delay which limits their application in high-speed systems. 12-T [27], 10-T [35], 14-T [37] and NHPSC [41] FAs have threshold voltage drop issues in the internal nodes for which the internal nodes are subjected to voltage degradation. When this degraded voltage is used as the gate control of a transistor, it takes more time for the transistor to turn on. For this reason, 12-T [27], 10-T [35], 14-T [37] and NHPSC [41] FAs have severe speed issues. In HBD 6 FA design [48], at first, input terms A and B are used in an XNOR circuit. Then, an inverter is used to invert the XNOR signal into XOR. Later, these XOR-XNOR signals are used in sum and carry-out circuits to generate the final outputs. Since, the XOR signal faces one inverter stage delay than the XNOR signal, the sum and the carryout circuits become slower. This is the main reason behind the speed issues of HBD 6 FA [48].
In terms of PDP, HBD 7 [50] acquired the highest performance. HBD 7 obtained the best performance in speed while maintaining quite satisfactory performance in power consumption. For this reason, HBD 7 could attain the best performance in PDP. In spite of excellent performance in speed, CPL [25] has very high PDP due to its high average power. 16

Performance of FAs in various load conditions
Drive power of VLSI circuits is an important parameter that is highly required for high-fan out conditions. Highperformance circuits (high speed and low power circuits), Fig. 2 Various simulation testbench for FA a simulation testbench in [49], b simulation testbench in [33,36,41,[44][45][46]50], c simulation testbench in [76] Table 2.
After extensive investigation of the data presented in Table 2, FA cells are categorized in three major groups: low drive power FA (marked by bold italic texts in the Drive Power column of Table 2), moderate drive power FA (marked by italic texts in the Drive Power column of Table 2) and high drive power FA (marked by bold texts in the Drive Power column of Table 2). Simulation data of each group (low drive power, moderate drive power and high drive power FA) are displayed in Fig. 4a, b and [54], HBD 12 [54], GDI 3 [66], GDI 4 [66], and GDI 5 [66] are medium drive power FAs. As last, CCMOS [30], HPSC [40], NHPSC [41], ULPFA [42], HBD 8 [51], HBD [52], and GDI 2 [65] consist of high drive power group.
To compare among the groups, three FA cells from each group have been selected as representatives. The representatives from each group are: (1) FA that achieved best drive power, (2) FA that has least drive power and, (3) FA having middle-most drive power data between type (1) and type (2) FAs. The associated propagation delays for each type are shown in Fig. 4d. In Fig. 4d, CPL [25], 14-T [37], and GDI1 [64] FAs are the representatives from low drive power FA group. TFA [34], HBD 7 [50], and HBD 10 [53] represent moderate drive power FA group whereas ULPHA [42], HBD 8 [51], and GDI 2 [55] represent high drive power FA. It can be seen that, with increasing fan-outs, propagation delays for moderate drive power FAs rise at a higher pace compared to the graphs representing high drive power FAs. In the case of low drive power FAs, the propagation delay increased quite rapidly compared to the other groups. Output terminals of FAs that have voltage degradation issues mainly fall in the low drive power group.

Performacne of FAs in wide adder structure
Modern ALUs require wide adder structures (16-bit, 32-bit, etc.) to perform computation [77]. Therefore, it is important to compare the performance of FAs operating in wide adder architecture. To do so, the FA cells have been extended up to 32-bits using the Ripple-Carry Adder style [78]. Simulation results on performance parameters have been recorded in Table 3. No voltage level restoring buffers have been added while extending the FA cells to a wide adder structure.
It has been observed that 21 out of 33 FA cells (marked with 'F' in Table 3) could not operate when they were extended to 32-bits. This occurred due to the degradation of signal strength while propagating through a series of logic circuits. To eliminate this issue, level restoration buffers are required to be installed, which costs additional circuitry. As a result, delay and power consumption will increase. Therefore, circuits that can be incorporated  Figure 5 presents a comparison of carry-output graph between a scalable and a non-scalable FA cell extended using RCA style. CCMOS FA [30] represents scalable FA while HBD 1 [45] represents non-scalable FA in Fig. 5. For CCMOS FA in Fig. 5a, no voltage degradation in carry signals could be seen. On the other hand, carry signals C 4 and C 8 of HBD 1 in Fig. 5b seem to have voltage degradation issues. Due to this voltage degradation issue, the carry signal gets below threshold voltage at a point while propagating through series of FA cells. As a result, the signal becomes unable to drive the next stage and the circuit fails to operate. Due to this reason, carry signals C 16 and C 32 of HBD 1 FA are not available in Fig. 5b. Unlike HBD 1 FA, the condition is applicable for the circuits that could not operate in multiple-bit structures. Among the remaining 12 FAs, output terminals of CCMOS [30], 24-T [39], HPSC [40], NHPSC [41], ULPFA [42], HBD 8 [51], HBD 9 [52] and GDI 2 [65] FA cells are comprised of CCMOS logic circuits. The pull-up network of the CCMOS logic circuit is connected to V dd and the pulldown network to Ground. As a result, while extended to wide adder architecture, output signal voltage gets replenished after every FA stage. For the remaining 4 FA cells, which could be extended to 32-bits, the same output-carry signal does not propagate throughout the entire 32-bit stages. Hence, voltage strengths of signals do not decline [50]. As a result, the FA cells could operate successfully in wide adder architecture without using voltage restoring buffers.

Major finding and discussion
As modern microprocessors are not limited to only a 1-bit addition operation, FA cells need to have the ability to be scaled up to wide word-length adders. Therefore, scalability is a major factor that needs to be investigated while analyzing FA cells. In this research, the scalability test conducted in Sect. 4 [53,54], GDI 1 [64] and GDI 3 [66] FAs are quite satisfactory as per simulation data presented in Table 1 and Fig. 2. However, they could not operate while extended to 16-bits and 32-bits. Based on this analogy, it can be said that the performance comparison of adders based on only a 1-bit operation should not be the main parameter for analyzing FAs. Rather, it should be analyzed if the 1-bit adder cells are scalable or not. Moreover, based on data presented in Table 3, it is essential to mention that the classic CCMOS FA cell obtained better performance than many FA cells in case of operating in a wide word-length structure. This is the main reason for which CCMOS logic remains as the prominent circuit design methodology despite of being one of the oldest VLSI circuit design methods.
In recent research activities, the concept of fast parallel prefix adder has evolved which aims to generate carry terms in parallel to reduce carry propagation delay [79]. Most of the parallel prefix adders require carry-propagate and carry-generate to perform addition [80]. Carry propagate is the XOR function between the input bits that are required to be added. On the other hand, carry generate is the AND function between the input bits. Therefore, for fast parallel adders, FA cells incorporating XOR and AND functions will be highly suitable. Among FAs analyzed in this research, DPL [44], SR-CPL [44], HBD 7 [7], and GDI 4 [66] FAs have AND and XOR functions for which they will be able to create carry generate and carry propagate signals without any extra hardware. As a result, these FAs will be more suitable for modern fast adder architectures.
The multiplier is another potential application of FA. In multiplier, carry output of on stage do not need to propagate through several stages for which scalability is not the major concern [81,82]. For this reason, FA cells having good performance parameters while operating    [66] FAs have good performance due to which they will be good candidates for utilization in multipliers. If transistors are scaled to lower technology nodes, then parasitics associated with the transistors will decrease for which any circuit operating in lower technology nodes will exhibit better performance than operating in higher technology nodes. However, if FAs are simulated in lower technology node than the 45 nm CMOS process, then the performance difference among FA cells will likely remain the same since parasitics will decrease in the same manner for all FA cells. But in the case of lower technology nodes, interconnect parasitics does not decrease in the same manner as transistor parasitics do [83]. For this reason, interconnect widths are required to be optimized in lower technology nodes to maintain the performance levels of FA cells [83].

Conclusion
A comprehensive literature review and performance comparison of various FA designs have been conducted in this research. The performance of FA cells, operating both as single bit and wide-adder structures, has been investigated. The simulation results include average power, propagation delay, and PDP (Power-Delay-Product) that covers most of the main performance metrics.
To determine the effectiveness of FAs in high fan-out cases, and to have a comparative analysis of their drive powers, the FA designs have also been simulated using various load conditions. According to this study, only a few of the existing FA cells are capable of performing well when they are scaled up to multiple-bit structures. Hence, although it is popular to compare FA cells by comparing their performance parameters in the 1-bit structure, this research recommends that the practical