A Survey on Pipelined FFT Hardware Architectures

The field of pipelined FFT hardware architectures has been studied during the last 50 years. This paper is a survey that includes the main advances in the field related to architectures for complex input data and power-of-two FFT sizes. Furthermore, the paper is intended to be educational, so that the reader can learn how the architectures work. Finally, the paper divides the architectures into serial and parallel. This classification puts together those architectures that are conceived for a similar purpose and, therefore, are comparable.


Introduction
The year 2020 marks 50 years since the first pipelined FFT hardware architectures were presented in 1970 [44,68]. During these 50 years, the field of fast Fourier transform (FFT) hardware architectures has developed substantially. By deepening in the field, we have been able to understand the mathematical fundamentals that govern the architectures and this has allowed us to derive efficient circuits to calculate the FFT.
The aim of this paper is to collect the main advances in pipelined FFT hardware architectures so far, and present them in a way that the reader can understand how the architectures work, serving as an introduction to the field. The paper focuses on pipelined FFT architectures for power-of-two sizes and complex input data. Other types of architectures such as iterative and fully parallel FFTs, realvalued FFTs, variable-length architectures, FFTs for natural input/output order and non-power-of-two FFT sizes are outside the scope of this paper. Likewise, the paper targets the architectures themselves and the advances in other subfields related to them [36] are not considered. Therefore, This work was supported by the Ramón y Cajal Grant RYC2018-025384-I of the Spanish Ministry of Science, Innovation and Universities.
This article belongs to the Topical Collection: Survey Papers Mario Garrido mario.garrido@upm.es 1 Department of Electronic Engineering, Universidad Politécnica de Madrid, Madrid, Spain advances related to FFT algorithms, data management in FFTs, implementation of rotators and accuracy analysis are not studied in detail in this paper. Only some concepts in these sub-fields are provided in Sections 2 and 6, as they are necessary for understanding the architectures. Further information related to these sub-fields can be found in [40].
The paper is organized as follows. After discussing some basic concepts in Section 2, an overview of pipelined FFT hardware architectures is provided in Section 3. This overview introduces the types of pipelined FFT architectures and includes a chronology with the main advances in the field. Later, the different types of pipelined FFT architectures are described in Sections 4 and 5. Section 4 is devoted to serial FFT architectures, whereas parallel FFT architectures are discussed in Section 5. For a better understanding of the architectures, a brief discussion on rotations in FFT architectures is provided in Section 6. A comparison of the architectures is provided in Section 7. Finally, the main conclusions of the paper are summarized in Section 8.
The term FFT refers to a group of efficient algorithms to calculate the DFT. Among them, the most widely used algorithm was proposed by Cooley and Tukey [19]. The Cooley-Tukey algorithm reduces the number of operations from O(N 2 ) for the DFT to O(N log 2 N) for the FFT. This is achieved because the calculation of the outputs X [k] in the DFT includes operations that are shared among the outputs.

Flow Graph and Radix
FFT algorithms are generally represented by their flow graphs. Figure 1 shows the flow graph of a 16-point radix-2 FFT, decomposed according to the decimation in frequency (DIF) decomposition [27,40]. The FFT consists of n = log 2 N stages. At each stage of the graph, s ∈ {1, . . . , n}, butterflies and rotations are calculated.
The numbers at the input of the flow graph represent the index of the input sequence, whereas those at the output are the frequencies, k, of the output signal X [k]. Finally, each number, φ, in between the stages indicates a rotation by As a consequence, data for which φ = 0 do not need to be rotated. Likewise, if φ ∈ [0, N/4, N/2, 3N/4], data must be rotated by 0 • , 270 • , 180 • and 90 • , which correspond to complex multiplications by 1, −j , −1 and j , respectively. These rotations are considered to be trivial, because they can be calculated by interchanging the real and imaginary components and/or changing the sign of the data. Figure 1 also includes an index I with its binary representation b n−1 . . . b 1 b 0 . This index will be used to understand the architectures. Indeed, at each stage s, the bit b n−s plays a crucial role in the architecture, which will be explained in Section 2.2. Figure 2 shows the flow graph of a 16-point radix-2 FFT decomposed according to decimation in time (DIT). It can be noted that DIF and DIT decompositions only differ in the rotations at the FFT stages. Indeed, it was observed in [27] that FFT algorithms for powers of two FFT sizes and based on the Cooley-Tukey algorithms only differ in the rotations at the different stages. This means that the structure of butterflies is the same for all the algorithms.
The DIF and DIT radix-2 FFT algorithms were the first ones based on the Cooley-Tukey algorithm. Later, other radices were proposed. All the radices have the form ρ k . On the one hand, the base of the radix, ρ, indicates the size of the butterflies. Radices with base ρ = 2 use radix-2 butterflies as the basic structure, as shown in Figure 3. Radices with base ρ = 4 use radix-4 butterflies as the basic structure, as shown in Figure 4. On the other hand, the exponent of the radix, k, refers to how the rotations are distributed among FFT stages, meaning that the most complex rotations are placed every k stages. Figure 5 shows the flow graph of a 16-point radix-2 2 FFT. In radix-2 2 algorithms, odd stages in the flow graph only include trivial rotations, as can be observed in the figure.
In FFT architectures, this will allow for simplifying the rotators of the architectures.
Finally, higher radices such as radix-2 3 , radix-2 4 and radix-2 5 offer different distributions of rotations at the FFT stages. Further information on FFT algorithms can be found in [27,70].

The bit b n−s
The bit b n−s is the essence of FFT architectures. In order to understand its relevance, let us consider the flow graph in Figure 1. At the first stage, the butterflies operate on the pairs of data with indexes (0, 8), (1,9), (2,10), etc. By representing these indices in binary, we get (0000,1000), (0001,1001), (0010,1010). The comparison of the indices in each pair shows that each pair only differs in the most significant bit, which is b 3 , and this holds for all the pairs of indices at stage 1. By doing the same analysis for stage 2, we can see that the indices of pairs of data that are processed by the butterflies only differ in b 2 . For stage 3 it is b 1 and for stage 4 it is b 0 .
In general, for any N-point FFT with n = log 2 N stages, pairs of data that are processed together in the butterflies at stage s differ in b n−s . This is an important statement that leads to the following one: In any FFT architecture, pairs of data that are input at a butterfly at the same clock cycle must always differ in the bit b n−s of the index. This guarantees that the architecture calculates the butterflies of the algorithm correctly. This will be the base for the explanation of the architectures in Sections 4 and 5. Figure 6 shows the general structure of a pipelined FFT. It consists of n stages connected in series where data flows from stage 1 towards stage n, and each stage s of the architecture calculates all the computations of one stage of the FFT algorithm. Each stage has P inputs and P outputs, and data flow in continuous flow at a rate of P data per clock cycle.   Table 1 shows a classification of pipelined FFT architectures. The table separates the architectures into serial and parallel. Serial pipelined FFT architectures process a continuous flow of one datum per clock cycle and will be studied in Section 4. They are classified into single-path delay feedback (SDF), single-path delay commutator (SDC), singlestream feedforward (SFF) and serial commutator (SC).

Types of Pipelined FFT Architectures
Parallel pipelined FFT architectures process P data per clock cycles, where P > 1, and will be studied in Section 5. They are classified into multi-path delay feedback (MDF), multi-path delay commutator (MDC) and multi-path serial commutator (MSC). Table 2 shows a timeline with the main advancements in the area of FFT hardware architectures. The table includes only those works that have proposed new types of architectures or relevant modifications that have lead to a reduction in the number of resources. Other papers in the field that provide relevant contributions but target already known architectures are not included in the chronology.

Chronology
The first two pipelined FFT architectures where proposed in 1970 with a few months of difference. They were the first SDC FFT [68] and the first SDF FFT [44], both for radix-2. The first MDC FFTs were proposed in 1973 [42]. This work included alternatives for different radices. However, it had the limitation that the parallelization of the architectures must be equal to the radix, i.e., P = r. This way, radix-2 was used to process 2 data in parallel, radix-4 to process 4 data in parallel and radix-8 to process 8 data in parallel.
For SDF FFT architectures, the use of radix-4 was introduced in 1974 [23]. In 1979, an SDF FFT architecture that divided the calculation in FFTs of 16 points was presented [24]. Although it was referred to as a radix-16 algorithm, this architecture was what we call nowadays radix-2 4 . For SDC FFT architectures radix-4 was adopted in 1989 [7].
An evolution of MDC FFT architectures happened in 1983 [50], where the first radix-2 FFT hardware architectures for any P were presented. This dissociated the number of parallel data from the radix. The number of parallel data did not have to be equal to the radix anymore.
MDF FFTs had a late appearance with respect to SDF, SDC and MDC architectures, as the first MDF FFTs were presented in 1984 [82].
Radix-2 2 was introduced in 1998 for SDF FFT architectures [45]. Radix-2 made a better use of the butterflies than radix-4, whereas radix-4 made a better use of the rotators than radix-2. From them, radix-2 2 took the best of both radix-2 and radix-4. This is why the literature started to talk about radix-2 k since then.
However, there was still an issue with the usage of butterflies and rotators in FFT architectures, as no architecture so far achieved a 100% utilization of butterflies and rotators. This was solved in 2006 when a deep feedback SDF FFT was presented [85]. Nevertheless, this improvement came at the cost of and increase of 33% in memory.
From 1989 to 2008, SDF FFTs were the main serial FFT architectures. The reason was that SDC FFTs only reached 50% utilization, as they were not processing data during half of the time. This was solved in 2008, when an SDC FFT that split the higher and lower part of the data bits was presented [9]. This was followed by two works on SDC FFT architectures that split data into the real an imaginary components [65,81]. These two works differed in the management of the rotators. In the meantime, new advancements on MDC FFT architectures were presented. The first radix-2 2 MDC FFTs were introduced in 2009 [26], which was extended to radix-2 k in 2013 [33].
In 2016, a new MDF FFT called M 2 DF was introduced [80]. This architecture was based on making a better use of the butterflies in MDF architectures.
The first SC FFT architecture was also presented in 2016 [35]. This was the first architecture to use circuits for serial-serial permutation, leading to 100% utilization in butterflies and rotators, while using a small memory.
The SFF FFT was presented in 2018 [47]. This serial architecture uses a small number of butterflies, rotators and multiplexers. This is achieved by making use of a double memory.
Finally, the first MSC FFT was presented in 2020 [46], which is the parallel version of the SC FFT.   Figure 7 shows a 16-point radix-2 SDF FFT architecture [1,44]. Each stage includes a radix-2 butterfly (R2), a rotator (⊗) or trivial rotator (diamond-shaped), and a buffer of length L = 2 n−s . The internal structure of a stage is shown in Figure 8 and the timing of one stage of the SDF FFT is shown in Figure 9.

The SDF FFT Architecture
The SDF works in a simple way, which can be understood from Figures 8 and 9. At each stage, it receives the data in the flow graph in Figure 1 in the order of the index, i.e., from top to bottom in the flow graph. According to this order, pairs of data that differ in b n−s arrive with a difference of 2 n−s clock cycles. In order to have these pairs of data simultaneously at the input of the butterfly, a buffer of length L = 2 n−s is used. This way, the output of the buffer is computed in the butterfly together with the input of the stage. Afterwards, one of the results of the butterfly continues towards the multiplier and the other output is saved in the buffer. From Figure 8(b) it can be clearly seen that data do not change order and they simply pass through the butterfly at certain clock cycles. In Figure 9 the light blue rectangle indicates when the butterfly processes data, which occurs when part A is at the output of the buffer and part B at the input of the stage. This occurs 50% of the time. The other 50% of the time is used to allow data flow through the buffer. This results in a 50% utilization of the butterfly.
A radix-2 SDF FFT architecture uses one butterfly per stage, which corresponds to 2 log 2 N complex adders for the entire FFT. It also includes non-trivial rotators in all the stages but the last two, leading a total number of log 2 N − 2 The radix-4 SDF FFT architecture [23,71] is shown in Figure 10. The number of stages in this case is n = log 4 N = log 4 16 = 2. Each stage uses a radix-4 butterfly and three buffers of length L = 4 n−s . The internal structure of a stage in a radix-4 SDF FFT architecture is shown in Figure 11. The idea is the same as in the radix-2 FFT. However, in this case the butterfly processes four data that are equally separated in time. Figure 12 shows the timing. Here groups of data flow through the buffers and the butterfly is in use 25% of the time. This utilization is lower than that of radix-2. However, the reduction of stages in radix-4 reduces the number of rotators.
The total number of complex adders in the butterflies of a radix-4 FFT is 8 log 4 N, the total number of rotators is log 4 N −1 and the total memory is N −1. As a result, radix-4 halves the rotator complexity with respect to radix-2 but doubles the butterfly complexity.
As a result, a radix-2 2 SDF FFT has the same butterfly and memory complexity as radix-2 but half the complexity of the rotators, which is 1/2 · log 2 N − 1. This rotator  complexity is the same as in radix-4. Therefore, radix-2 2 benefits from the small complexity of the butterflies in radix-2 and the small complexity of the rotators in radix-4.
Despite the fact that high-radix SDF FFTs are very popular, they have a utilization of 50% for the butterflies. A way to improve this utilization is to use the deep feedback strategy in [85]. The deep feedback SDF FFT architecture is shown in Figure 14, the detail of a stage of the architecture is shown in Figure 15 and the timing in Figure 16. It can be noted that the stages include three buffers, one of them with double length than the other ones. First, data enter the long buffer, and then the other two buffers. The timing indicates when butterflies are calculated. They use a radix-2 butterfly that is reused at different time instants. Also, the input data to the butterfly is taken from different nodes of the circuit at different time instants. This is why the stage in Figure 15 uses multiplexers to route the data. Furthermore, there is no overlap between the calculations of the butterfly, and the radix-2 butterfly is in use 100% of the time.
The total number of complex adders in the butterflies of a deep feedback SDF FFT architecture is 2 log 4 N, the total number of rotators is log 4 N − 1 and the total memory is 4(N − 1)/3.

The SDC FFT Architecture
Although the SDC processes serial data, it is based on a 2parallel MDC FFT, such as that in Figure 17. The timing for the MDC FFT architecture is shown in Figure 18, where the data order at each stage is indicated. For instance, stage 1 receives the data with indexes 0 and 8 in parallel, followed by 1 and 9 and so on. It can be noted that data that differ in bit b n−s arrive in parallel at the input of the butterflies, which allows for calculating the butterflies properly. However, the data order is not the same at each stage, which makes it necessary to include shuffling circuits to reorder the data. These circuits consist of two buffers and two multiplexers, where the number drawn in the buffer is the buffer length. The buffers at stage 1 exchange the position of data with indexes (7, 6, 5, 4) and (11,10,9,8). This is done by delaying the lower path 4 clock cycles in the buffer, swapping both sets of data with the multiplexers and then delaying the upper path 4 clock cycles so that data are again aligned. The data exchanges done by the shuffling circuits are indicated in the timing with lines that connect the data sets and the resulting order is the order at the following stage.
The SDC FFT architecture [21,68] is shown in Figure 19. The only difference with the MDC FFT in Figure 17 lies in the input and output of the data. Whereas the MDC receives two data in parallel at the input, the SDC receives data in series. The first half of the data passes through the input buffer and the second half is connected to the lower input. This creates the same order as in the timing of Figure 18. Afterwards, the architecture processes the data. Finally, the output is made serial again.
The consequence of making data parallel is that the architecture only works 50% of the time, whereas the other 50% of the time it receives and outputs the data.
In terms of hardware components, the architecture in Figure 19 uses a total of 2 log 2 N complex adders in butterflies, log 2 N −2 rotators and a total memory of 2N −2.
To solve the low utilization of the SDC FFT in Figure 19, several alternatives have been proposed. The first one is shown in Figure 20 and was proposed in [9]. This architecture splits the high (H) and low (L) parts of the data in two branches. This way, the architecture is working 100% of the time. To achieve this, the architecture changes slightly with respect to the SDC in Figure 19. First, it includes preprocessing and post-processing stages. Furthermore, both upper and lower branches include multiplexers. Finally, the complexity of butterflies, rotators and buffers is reduced, as they receive half of the bits every clock cycle.
The timing for the architecture in Figure 20 is shown in Figure 21. The first two shuffling circuits are used to adapt the input order to the butterfly of stage 1. Note that the orders at the different stages fulfill the demand of b n−s . Finally, the shuffling circuit after the last stage places again the high and low bits of the data in parallel, which allows for concatenating them and form the output data.
For the SDC architecture in Figure 20, the number of complex adders in butterflies is 2 log 2 N. As the adders have half the data word length, the adder count results in log 2 N. The number of rotators is log 2 N − 2 after the half word      length correction. And the total memory is 3N/2 after the half word length correction.
An alternative to the architecture in Figure 20 consists in splitting the input data into its real and imaginary parts. This alternative was used in [10,65]. In this case, the timing is the same, and the architecture only differs in the rotators. If we observe the timing in Figure 21 and assume that H corresponds to the real part (R) and L corresponds to the imaginary part (I), then each rotator receives the real and imaginary components of the same datum in consecutive clock cycles through the same branch. This allows for using the circuit in Figure 22 to place the real and imaginary components of the data in parallel at the input of the rotator. Then the rotation is calculated and, after that, the data order is restored.
For this architecture, the complex adder count is log 2 N, the rotator count is log 2 N − 2 and the total memory is slightly larger than 3N/2 due to the registers used to adapt the order at the rotators.
A further step in the evolution of SDC FFT architectures is shown in Figure 23 and corresponds to [81], where it is observed that the separation in R and I leads to the fact that only data through the lower branch of the architecture need to be rotated. This allows for using the rotator in Figure 24, which has half of the complexity, as the rotation is done in two consecutive clock cycles.
For this architecture, the complex adder count is log 2 N, the rotator count is 1/2 · log 2 N − 1 and the total memory is 3N/2.

The SFF FFT Architecture
The SFF FFT shares with the SDF FFT the characteristic that data arrive in natural order at the input of the architecture. Indeed, the order at each stage in both architectures is from top to bottom of the flow graphs.

Figure 24
Alternative rotator in an radix-2 SDC FFT architecture that divides the data in real and imaginary parts. Figure 25 shows a 16-point radix-2 SFF FFT architecture and its timing is shown in Figure 26. A characteristic of the SFF FFT is that it calculates the addition and the subtraction of the butterfly at different time instants. This allows for using the same adder/subtractor for both of them. To achieve this, each stage needs two buffers of length L = 2 n−s . This allows for accessing the data that are processed in the adder/subtractor twice, first from the input and the point in between the buffers in order to calculate the addition of the butterfly, and then from the point in between the buffers and the output of the second buffer in order to calculate the subtraction of the butterfly. This is shown in the timing of Figure 26.
The rotators in an SFF FFT are the same as in an SDF FFT. They receive data from a single stream in natural order. Therefore, the SFF allows for the same use of the radices as in an SDF FFT. For instance, radix-2 2 makes trivial the rotators in odd stages, which reduces the rotator complexity.
The complex adder count of an SFF FFT is log 2 N, the number of non-trivial rotators is log 2 N − 2 for radix-2 and 1/2·log 2 N −1 for radix-2 2 , and the total memory is 2N −2. Figure 27 shows a 16-point radix-2 serial commutator FFT [35]. As in other FFT architectures, the stages consist of butterflies, rotators and shuffling circuits. What makes the SC FFT different is that it uses circuits that shuffle data arriving in series [32]. The timing of a stage in the architecture is shown in Figure 28. At each stage of the SC FFT, data that differ in bit b n−s arrive in consecutive clock cycles. This allows for delaying half of the data one  clock cycle to make values that differ in bit b n−s arrive simultaneously to the butterfly. After the calculations, the other half of the data is delayed one clock cycle to form the serial data flow again.

The SC FFT architecture
According to this, b n−s is related to consecutive clock cycles at stage s, and b n−s−1 is related to consecutive clock cycles at stage s + 1. Therefore, the shuffling circuits placed between stages are used to adapt these orders.
The internal structure of a stage in an SC FFT is shown in Figure 29. As data flows serially, both the real and imaginary parts of the data are provided at the same clock cycle. These parts are separated at the input of the stage. Thanks to the shuffling at the input of the stage, the butterfly will first add and subtract the real parts of the data, and in the next clock cycle it will add and subtract the imaginary parts of the data. This way, the butterfly only requires one real adder and one real subtractor. Similarly, the rotator receives data to be processed in two consecutive clock cycles. This allows for halving the complexity of the rotator. Instead of four multipliers an adder and a subtractor, the rotator only needs two multipliers and one adder/subtractor.
As a result, the SC FFT requires a total of log 2 N complex adders for the butterflies, 1/2 · log 2 N − 1 rotators and a memory slightly larger than N.

The MDF FFT architecture
The MDF FFT architecture is the parallel version of the SDF FFT. At first, MDF FFT architectures consisted of several SDF FFT architecture connected by shuffling circuits [82].

Figure 26
Timing of a radix-2 SFF FFT architecture. However, it is even easier to unfold an SDF into an MDF. Let us compare the SDF FFT in Figure 7 to the MDF FFT in Figure 30. In the SDF FFT data arrive in series in natural order, i.e., from index 0 up to N − 1. In the MDF FFT the data order at all the stages is shown in Figure 31. In it, even-indexed data flow through the upper path and oddindexed through the lower path. As a consequence, the last stage, which operates on b 0 directly, takes the data from both paths. Likewise, by separating even-indexed and oddindexed data, those data that differ in b n−s are closer in the pipeline, which halves the length of the buffers at stages 1 to n − 1. Except for these facts, the MDF FFT works as an SDF FFT.
A higher parallelization for the MDF FFT is also possible. Figure 32 shows the case of a 16-point 4-parallel radix-2 MDF FFT architecture, and Figure 33 shows its timing. Again, the first stages process data as in an SDF FFT and the length of the buffers in these stages is divided by P with respect to the buffer lengths in the SDF FFT. The last two stages process b 1 and b 0 . Both of them appear in parallel streams, so it is only necessary to combine those parallel streams, which differ in the bit corresponding to each stage.
In general, a P -parallel radix-2 MDF FFT uses 2P log 2 N − P log 2 P complex adders in butterflies, P log 2 N −P /2·log 2 P −P −[1] * non-trivial rotators, where the term [1] * only applies for P = 2, and a total memory of N − P . As in SDF FFT architectures, radix-2 2 used in MDF FFTs transforms the rotators in odd stages to trivial rotators, which halves the rotator complexity.
An alternative to the conventional MDF FFT architectures is the M 2 DF FFT shown in Figure 34 and presented in [80]. The M 2 DF FFT increases the utilization of the butterflies and rotators by creating two data flows in opposite directions. These two data flows do not overlap in time, which allows for sharing the butterflies and rotators and achieve a 100% utilization of these components. To guarantee that there is no overlap, one of the data flows needs to enter the pipeline in bit-reversed order [31]. This can be observed in Figure 34, where the lower input path is connected to a block that calculates the bit reversal (BR) before entering stage 3. The upper path flows from stage 1 to stage 3 and is connected to a bit reversal circuit after stage 3.
The timing of the M 2 DF FFT is shown in Figure 35. The indices for the current FFT are highlighted in black and the previous and following FFTs in the pipeline are highlighted in gray. The upper flow is shown to the left and the lower flow to the right. The upper flow follows the stages 1, 2, 3, BR, and 4, whereas the lower flow follows the stages BR, 3, 2, 1, and 4. The clock cycles where butterflies and rotators process data are indicated by a square. For instance, for the upper dataflow the first 4 clock cycles data are loaded into the buffer, and the next 4 clock cycles are used to process these data together with the input data. By analyzing the timing, it can be observed that the processing times at stages 1 to 3 of both data flows do not overlap, which allows for reusing the hardware components. Finally, the data from both data flows arrive simultaneously at stage 4 to be processed in parallel.
The 2-parallel radix-2 M 2 DF FFT architecture uses 2 log 2 N complex adders in butterflies, log 2 N − 2 nontrivial rotators and a total memory of approximately 2N. This memory is the result of adding the buffers at the FFT stages and the circuits for bit reversal, whose optimum implementation is explained in [31]. Higher parallelization for this architecture and other radices are also possible [80].

Figure 28
Timing of a radix-2 SC FFT architecture.

The MDC FFT Architecture
The MDC FFT architecture was explained in Section 4.2 for the case of 2-parallel data and radix-2. In this section, other MDC FFT architectures are presented. Figure 36 shows a 16-point 4-parallel radix-2 MDC FFT architecture [50]. The timing of the architecture is shown in Figure 37. At each stage, the shuffling circuits place data that differ in b n−s at the input of the butterflies.
For 8-parallel data, Figure 38 shows a 16-point 8-parallel radix-2 MDC FFT. Compared to the 4-parallel MDC FFT in Figure 36, the 8-parallel MDC architecture does not have shuffling circuits at the two first stages. The reason for this is that b n−s at the first stages is achieved by reorganizing the parallel streams and there is not need to exchange data that arrive at different clock cycles. This can be observed in the timing of Figure 39.
The number of hardware components of a P -parallel radix-2 MDC FFT architectures is P log 2 N complex adders in butterflies, P /2 · (log 2 N − 2) non-trivial rotators, and a total memory of size N − P .
For 8-parallel data, a 16-point radix-2 2 MDC FFT architecture is shown in Figure 42 and its timing is shown in Figure 43.
For a P -parallel radix-2 2 MDC FFT, the number of complex adders in butterflies is P log 2 N, the number of non-trivial rotators is 3P /8 · (log 2 N − 2) and the total memory is N − P .
The rotator complexity in MDC FFT architectures can be reduced even more by using higher radices, such as radix-2 3 or 2 4 [33] and by exploring other data orders that allocate the rotators in only some of the parallel branches [34,38].

The MSC FFT architecture
The MSC FFT is the parallel version of the SC FFT architecture. In order to obtain the MSC FFT, the SC FFT is unfolded. As a result, the MSC FFT consists of several stages that include an SC structure and other stages that are calculated as an MDC. Figure 44 shows a 16-point 4-parallel radix-2 MSC FFT. Its timing diagram is shown in Figure 45. The architecture consists of four stages where the two first ones process data as in an SC FFT. For these two stages, samples that are processed together in the butterflies arrive in consecutive clock cycles, as happened for the SC FFT. Later, butterflies at stages 3 and 4 process data arriving at the same clock cycle at the inputs of the butterflies.
The number of components in an MSC FFT is calculated by considering that in the SC-like stages only half butterflies and half rotators are used. For a radix-2 MSC, this results    in P log 2 N complex adders, P /2 · (log 2 N − 2) non-trivial rotators and a total memory of size approximately equal to N − P .
In the literature, a combination of radix-2 3 and radix-2 4 has been used for a 128-point MSC FFT [46], which allows for a further reduction of the rotator complexity. However, due to the novelty of the MSC FFT architecture, research still needs to be done before drawing general conclusions about this type of architecture.

Rotations in FFT Architectures
To deepen into the topic of rotations in FFT architectures is outside the scope of this paper. However, for a better understanding of the FFT architectures, this section provides some basic ideas about rotations in FFTs.
First, it is worth mentioning how to obtain the rotation coefficients in FFT architectures. This can be done from the flow graph of the FFT and the timing diagram of the architecture. The timing diagram shows the data indexes (I ) at any time and for every stage of the architecture, whereas the flow graph shows the rotations for each stage and index. Therefore, to determine the rotation coefficients of the FFT we just have to take each index from the timing diagram and obtain the corresponding rotation from the flow graph. As an example, let us consider the 4-parallel FFT architecture in Figure 36, its timing diagram in Figure 37 and its flow graph in Figure 1. At stage 2, data with indexes (12,13,14,15) flow through the lowest path of the  architecture. This means that the rotator at this path of the architecture must rotate by the rotations corresponding to these indexes. By checking the flow graph in Figure 1, it can be observed that the rotations for indexes (12,13,14,15) at stage 2 are φ = (0, 2, 4, 6). The exact coefficients are obtained from Eq. 2. For instance, for φ = 4, This means that the rotator at the lowest path at the second stage of Figure 36 must rotate the datum with index 14 by −j .
When the FFT size is large, it is not easy to represent its flow graph. However, it is still possible to obtain the values for φ at each stage of the FFT flow graph mathematically. This is explained in [70] for any FFT algorithm that can be represented by a binary tree and more generally in [27] for any FFT algorithm that can be represented by a triangular matrix.  For instance, for a radix-2 DIF FFT, φ is obtained as [27] In particular, for stage s = 2 and N = 16, Coming back to the example, the indexes I = (12,13,14,15) in binary are b 3 b 2 b 1 b 0 = (1100, 1101, 1110, 1111). By applying (5), this results in the rotations φ = (0, 2, 4, 6), which corresponds to the same values that were obtained from the flow graph. The implementation of a rotator takes into account the angles that it must rotate at different clock cycles. If all the angles are trivial, the rotator will be a trivial rotator. Otherwise, the rotator will be more complex. When many different angles must be rotated, the rotator is called general rotator. General rotators are usually implemented by a      complex multiplier or by the CORDIC algorithm [78]. When the number of different angles is small, it is feasible to implement the rotators as shift-and-add operations [39], which reduces their complexity.
Regarding the storage of rotations, for general rotators it is common to store all the coefficients in a readonly memory (ROM), although it is also possible to use memoryless CORDIC approaches [30] that do not require any memory. This is especially useful for large FFTs [51], where the coefficient memory would otherwise be large. For small rotators implemented as shift-and-add, it is also possible to generate the control signals for the rotator without storing them in a ROM [39].
Nowadays, existing FFT architectures already achieve the minimum number of adders, the smallest memory and the lowest latency. However, the amount and complexity of rotators in FFT architectures still need to be optimized. As a result, recent FFT architectures [34,37,38] focus on reducing the number of rotators and their complexity, which requires to study different data orders and obtain the best results among them in terms of rotators. Table 3 compares the different types of serial pipelined FFT architectures. The table includes the type of architectures, the hardware resources in terms of complex adders in butterflies, complex rotators and complex data memory, and the performance in terms of the latency and throughput of the architectures. For simplicity, n = log 2 N is used for reporting the number of adders and rotators.

Comparison
SDF FFT architectures minimize the memory usage and the latency. Among them, radices 4, 2 2 and 2 4 also minimize the number of rotators. Conversely, the number of adders in SDF architectures is larger than in other architectures. The only exception is the deep feedback SDF FFT, which minimizes the number of adders at the cost of a larger memory.
Except for its first version, the SDC FFT minimize the number of adders. Its latest version also minimizes the number of rotators. However, SDC FFT architectures have higher memory usage and latency than other serial pipelined FFT architectures.
The SFF FFT minimizes the number of adders and the latency. Its radix-2 2 version also minimizes the number of rotators. The drawback is a larger data memory.
For 4-parallel architectures, M 2 DF, MDC and MSC FFTs achieve the smallest number of adders. The number of rotators in 4-parallel FFT architectures depends not only on the type of architecture, but also on the radix. The 4-parallel radix-2 2 MDC FFT architecture achieves the smallest number of rotators. However, the complexity of the rotators in the radix-2 4 MDC FFT is smaller. This fact is detailed in the comparison in [34]. Regarding data memory and latency, MDF, MDC and MSC FFT architectures require a small memory and have a low latency.
For 8-parallel architectures, M 2 DF, MDC and MSC FFTs achieve the smallest number of adders. The smallest number of rotators is achieved by the radix-2 3 MDC FFT architecture. MDF, MDC and MSC FFT architectures have a small data memory and achieve the lowest latency.

Conclusions
This survey paper has provided the main advancements on complex-input-data and power-of-two pipelined FFT hardware architectures during the last 50 years. The main types of serial FFT architectures are called SDF, SDC, SFF and SC. All of them process a continuous data flow of one sample per clock cycle. However, they follow different strategies to organize the data flow and do the calculations. This results in trade-off among adder, rotator and memory complexity. Regarding parallel FFT architectures, the main types are MDF, MDC and MSC, which are the parallel version of the SDF, SDC and SC FFTs, respectively.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons. org/licenses/by/4.0/.