Ultra-High-Throughput EMS NB-LDPC Decoder with Full-Parallel Node Processing

This paper presents an ultra-high-throughput decoder architecture for NB-LDPC codes based on the Hybrid Extended Min-Sum algorithm. We introduce a new processing block that updates a check node and its associated variable nodes in a fully pipelined way, thus allowing the decoder to process one row of the parity check matrix per clock cycle. The work specifically focuses on a rate 5/6 code of size (N, K) = (144, 120) symbols over GF(64). The synthesis results on a 28-nm technology show that for a 0.789 M NAND-gates complexity complexity, the architecture reaches a decoding throughput of 0.9 Gbps with 30 decoding iterations. Compared to the 5G binary LDPC code of the same size and code rate, the proposed architecture offers a gain of 0.3 dB at a Frame Error Rate of 10-3\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$10^{-3}$$\end{document}.


Introduction
Non-Binary (NB) Low-Density Parity-Check (LDPC) codes allow to close the performance gap with the Shannon limit [1] when using small or moderate frame lengths. They are defined on high order Galois Fields (GF) of order q with q > 2 and have been proven to be more robust than convolutional turbo-codes and binary LDPC codes [2]. However, even if they present numerous advantages (see [3,4]) their main drawback is in their complexity, which is challenging at the receiver side. In NB-LDPC decoders, the direct application of the Belief Propagation (BP) [5] algorithm leads to O(q 2 ) complexity and is thus prohibitive for q > 16 . A considerable amount of work has then been dedicated to reduce the complexity of decoding algorithms and their associated architectures ( [6][7][8][9][10][11], among others), with a special focus on the Check Node (CN) processing which is the major bottleneck in NB-LDPC decoders.
This work particularly focuses on the Extended Min-Sum (EMS) algorithm [12,13] as it currently continues to present one of the most competitive complexity/performance trade-offs [14,15]. For the EMS CN implementation, the Forward-Backward (FB) approach was introduced in [15] as a serial concatenation of Elementary CNs (ECNs). This structure suffers from high latency and low decoding throughput. Other approaches were proposed (e.g., the Trellis-EMS (T-EMS) [16]) to reduce latency. This complexity was reduced with the one-minimum T-EMS [17] and the Trellis Min-Max (T-MM) [18,19] algorithms. However, all variant of the T-EMS algorithms present a complexity that increases with the cardinal q of the Galois Field.
In this paper, we show some innovative ideas that significantly enhance the throughput and the hardware efficiency of the prior designs [4,20]. This prior work includes the Syndrome-based algorithm [21] that efficiently performed parallel CN computations for q ≥ 16 and was initially considered for implementing a GF(256) CN processor with a CN degree d c = 4 [22]. However, its complexity is dominated by the number of computed syndromes, which increases quadratically with d c . This limits its interest for high coding rates (i.e., high d c values). A solution was then proposed based on sorting the input vectors according to a reliability criteria [23,24] to significantly reduce the CN hardware complexity without affecting performance. This so-called presorting technique was applied to the syndrome-based architecture in [23] and to the FB architecture in [24]. A hybridization of those two architectures was presented in [20] for high q and d c values.
In this paper, we extend the work presented in [25] to design an ultra-throughput decoder that can reach an average decoding throughput of 14 Gbits/s at high Signal to Noise Ratio (SNR). Besides that, more algorithmic and hardware decoding block details are shown along with a global timing diagram of the decoder. Also, additional state of the art simulation and complexity results comparison is presented along with a hardware simulation curve of the proposed decoder. Furthermore, the flexibility of the proposed design is highlighted. Moreover, an unexpected result is found out where the proposed architecture almost halves the memory bandwidth compared to a classical binary LDPC code. The design includes the presorting technique and an innovative unit that processes both the CNs and Variable Nodes (VN) in a fully pipelined architecture [25]. This architecture is called "Full Parallel Hybrid Check Node" (FPHCN) in this paper. For the practical description of the decoder, we consider a specific code but the new proposed principles can be easily generalized to any NB-LDPC code, knowing that its benefits are especially interesting for high rates.
The paper is organized as follows: Sect. 2 introduces notation, NB-LDPC codes, the EMS algorithm principles and the code structure considered in this work. Section 3 recalls the presorting technique and describes the decoding steps and the CN-VN merging technique. Section 4 is dedicated to the global decoder architecture. Simulation results for the proposed code and its binary LDPC counterpart are compared in Sect. 5. Implementation results, throughput analysis and a detailed comparison with the state of the art are presented in Sect. 6. Finally, conclusions and perspectives are discussed in Sect. 7.

Notation, NB-LDPC Codes and EMS Algorithm
This Section introduces NB-LDPC codes, describes the calculation of the intrinsic messages and the principles of the EMS algorithm. Table 1 lists the symbols considered throughout the paper, which include the characteristics of the code, the exchanged messages in the decoder and other terms for the description of the global decoder.

Intrinsic Messages
The exchanged messages in the EMS decoding algorithm are Log Likelihood Ratio (LLR) values. The intrinsic LLR values are computed from an observed information coming from the channel. If we consider the Additive White Gaussian Noise (AWGN) channel and a Binary Phase-Shift Keying (BPSK) modulation, the GF symbol x will be modulated by m BPSK channels with amplitude B(x p ) = (−1) x p , p = 0, … , m − 1 . At the receiver side, the received samples r p are expressed as: where w p is a realization of a Gaussian noise of variance 2 . Let Y = (y 0 , y 1 , … , y m−1 ) be the LLR intrinsic vector associated to x. Each value y p is defined as: where the values x p are defined as hard decisions such that if sign(y p ) > 0 , then x p = 0 , x p = 1 otherwise. Considering the hypothesis that all the symbols in the GF(q) alphabet have equal probability, the expression of the LLR I + (x) of a symbol x knowing y is expressed as: where Δ(x p ,x p ) = 0 if x p =x p and Δ(x p ,x p ) = 1 otherwise. Note that, by definition, I + (x) = 0 is the smallest (3)  (1), (2)} be respectively the index of the smallest, the second smallest and the third smallest magnitude |y p | values, p = 0, 1, … , m − 1 . Then, the second smallest LLR value I + [1] is obtained by flipping the bit of x of index (0) to obtain I ⊕ [1] , i.e., I [1] = (I + [1] = |y (0) |, I ⊕ [1]) . The third smallest value I + [2] is obtained by flipping the bit of x of index (1) to obtain I ⊕ [2] . Thus, I [2] = (I + [2] = |y (1) |, I ⊕ [2]) . Finally, the fourth smallest LLR value I + [3] is given by min(A, B) with A = |y (0) | + |y (1) | and B = |y (2) | . If min(A, B) = A , the associated GF value I ⊕ [3] is obtained by flipping the bits of index (0) and (1) of x , otherwise, if min(A, B) = B , I ⊕ [3] is obtained by flipping the bit of index (2) of x . This method and its generalization to compute in parallel the first n m in terms of the intrinsic vector I are described in details in [26]. Finally, for hardware design, the LLR values need to be quantized on a fixed point precision. To do so, we use the following expression , ⌊x⌋ indicates the floor function, Q = 2 b−1 − 1 is the saturation value expressed as a function of b, the number of bits of quantization (for b = 6 , Q = 31 ). The fix scaling factor encompasses the factor 2∕ 2 found in the LLR expression of (2). The value is set empirically to = 1.2 in order to optimize the decoding performance. In the hardware architecture, we will consider n m in = 4 . The intrinsic vector I associated to a given received symbol Y is thus composed as Note that, with b = 6 and m = 6 , the size n I of I is given by n I = 3b + 4m = 42 bits. Associated to the symbol Y, the pre-processed vector Ĩ is defined as and allows to reconstruct easily I with few hardware resources. It also allows to compute I + [ ] for any ∈ GF(64) using (3). The size ñI for b = 6 and m = 6 is ñI = 6 × 5 + 3 × 3 + 6 = 45 bits (note: m − 1 = 5 bits to encode |y p | and 3 bits to encode each index of Π).

Extended Min-Sum Algorithm for NB-LDPC Codes
A detailed description of the different steps and equations in the Min-Sum (MS) algorithm was presented in [4]. We consider in this paper the Extended Min-Sum (EMS) [14] y p = sat(⌊ r p Q + 0.5⌋, Q), [1], I [2], I [3]).
with the following characteristics: VN degree d v = 2 , CN input messages truncated to a size n m in ≪ q and CN output messages to n m out ≪ q . This leads to computation and storage reduction without necessarily performance loss [12,13].
For the EMS algorithm description, we define the following: Each U i message can be written as i ′ ≠ i and ⊕ refers to GF addition (i.e., XOR gate). The final stage is to partially sort in increasing order the set of values of V + i (x) indexed by x ∈ GF(q) to obtain an ordered set V ⊕ i = {x 0 , x 1 , … , x n m out −1 } that verifies and The i th output message is thus given as In the state of the art, the GF values outside V i are associated with a default LLR value D i , with D i = V + i [n m out − 1] + O , i.e., the default value is equal to the highest LLR value of the V message added with O, a positive offset value (see [13] for more details on the definition of the offset value). In Sect. 3.1, we propose a new method to determine the default value D i to facilitate the hardware implementation of the full parallel CN architecture. 3. VN update: After processing CN a, the required inputs of a VN are: the intrinsic vector I of size n m in , the m values of Y to be able to compute I + (x) for any x ∈ GF(q) using (3), the received message V a of size n m out coming from CN a, the default value D a associated to message V a and the message U a sent to the CN a ( U a encompasses both I information and the updated message V b coming from CN b. Note that U a is used only for the decision process). The two outputs of the VN are the current message U b of size n m in to be sent to CN b and the VN decision x obtained by combining V a and U a . The first step of the VN processing is the addition of the intrinsic LLR values on the incoming V a message to generate the message V a defined as: Since V a associates LLR values for only a subset of GF values, in parallel, a second message Ī is generated as Then, the n m in smallest values of set V a ∪Ī in terms of LLR value are extracted along with their associated GF symbols to generate the vector messages Ū b . Note that by construction, V a,⊕ ∩Ī ⊕ may not be empty. In that case, the corresponding LLR element in Ī is saturated so that Ū b contains the first n m in smallest LLR values with distinct GF values. The last step to generate the final message is the normalization process that keeps the first LLR of the message equal to zero, i.e., 4. Decision making: in the MS algorithm, the decision is done by adding all the incoming information, i.e., In the EMS algorithm [4], this process is simplified first by considering that U a already contains the summation V b + I , then by pruning the n m in elements of vector U a to its first 3 elements. The merging of the n m out elements of V a and the first three elements of U a gives V T as Note that U a,+ [0] = 0 and that, compared to [4], the new version omits the case where V a,⊕ [j] = U a,⊕ [2].
Before describing the flooding algorithm, we present the NB-LDPC code implemented in the paper.

Code Structure
The code considered in this work is a (N, K) = (144, 120) NB-LDPC defined over GF(64) with d v = 2 , d c = 12 and code rate r = 5∕6 . This code is a Quasi-Cyclic LDPC code constructed from the complete 2 × 12 base matrix H defined as with an expansion factor of 12. During the lifting process of H , every element H(i, j) , i = 0, 1 and j = 0, 1, … , 11 , is replaced by the 12 × 12 identity matrix with a right shift rotation equal to H(i, j) . Based on this definition, the resulting matrix H is of size (M, N) = (24, 144) in GF(64). The equivalent size in binary is thus 6(M, N) = (144, 864) . We restricted our architectural study to the case of d v = 2 , since it is shown in [27] and [28] that this low degree allows to achieve very good performances with high cardinality GF(q), q ≥ 64 typically, NB-LDPC code.
Let us define layer one (L 1 ) as the set of CNs of index 0 to 11 and layer two (L 2 ) as the set of CNs of index 12 to 23. Then, any variable is connected to a unique CN in L 1 and a unique CN in L 2 . According to the parity check equation given in (1), the 12 indexes k(j, i) of the j th parity check are given by k(j, i) = j + 12i when j = 0, 1, … , 11 and k(j, i) = mod (j + i, 12) + 12i when j = 12, 13, … 23.
The GF coefficients {h j,k(j,i) } i=0,1,…11 of the first layer L 1 and second layer L 2 are

Flooding Scheduling
The decoding process iterates until a maximum number of iterations ( n max,it ) is reached or the M parity equations are satisfied. In each iteration, M CN and M × d c VN updates are performed. At the end of every iteration, a decision is taken on the N VNs. Let l = 0, … , n max,it − 1 be the iteration number. In the following, every vector message that is being processed at iteration l is appended by (l) at its exponent. The decoding process is described in Algorithm 1.

Pipelined CN-VN Unit
This section presents a pipelined architecture able to perform a CN of degree d c = 12 and its 12 associated VNs every CC. To do this parallel architecture, we first remind the principle of the hybrid CN architecture [20]. Then, based on this formalism, we describe the result of the optimization process of the CN architecture for the 5/6-rate N = 144 code in the paper. Then, an innovative method to merge VN and CN processing is presented.

Principle of the Hybrid CN Architecture
In [20], the authors present the principle of the hybrid CN architecture. The three main functions performed by this CN are recalled: the presorting, the ECN processing and the decorrelation using the Valid Syndrome Vector (VSV).

Presorting
The preliminary step of CN processing is the presorting block that leads to a significant reduction of the CN computations. In [24] the authors proposed the sorting of the CN input vectors based on the LLR value of the second GF element. This sorting polarizes the reliability of the input vectors and classifies them into two sets: high reliability and low reliability. As described in Sect. 2.2, the first element (i.e., the most reliable symbol) is always having a zero LLR value. The sorting criteria is the following: the higher the LLR value of the second element, the higher the reliability of the vector. In other terms, a big difference between the first and second LLR values indicates that the competition between the GF symbols in the vector will clearly favor the first one and the rest will rarely contribute to the final decision on the codeword. Discarding (or eliminating) them leads to computational reduction without any performance loss. As a consequence, presorting helps the CN to concentrate its processing effort on low-reliability vector messages. of each message ( U 0 to U 3 ) are considered as inputs to a sorter block. Then, the d c input messages are switched based on the indexes . The high reliability messages are concentrated in one region and dashed elements are discarded prior to the CN processing. More details on the presorting technique are presented in [23,24] and [29].

ECN
For the implementation of the CN processing, Eq. (6) is implemented in a simplified way using the hybrid CN architecture defined in [20]. The whole CN architecture is characterized graphically in Fig. 2a). Let us describe, from top to bottom, the graphical elements used in this design and the corresponding processing.
First, the number of elements of U ′ i , i = 0, … , 11 , that enter the CN is indicated by the number of circles below it.
The outputs of the multipliers enter a datapath composed of a network of ECNs. Each ECN performs the bubble check algorithm. Let us give the key to understand the processing performed by the generic ECN given in Fig. 2b). An ECN receives two input vectors A and B of size n a and n b given by the number of circles (or bubbles) respectively in the first column and in the first row ( n a = 4 and n b = 3 in Fig. 2b)). It generates an output message C of size n c . Note that in Fig.2a), the output size is implicitly defined as the number of inputs (i.e., number of vertical bubbles) of the next ECN. A circle in position (t 0 , t 1 ) , t 0 = 0, … , n a − 1 and The n c bubbles of minimum LLR sorted in increasing order constitute the output vector of the ECN. The VSV is appended with a boolean value that indicates whether U c [t 2 ] , t 2 = 0, … , n c − 1 , has been generated with Bubbles in dark color append a false Boolean value to the corresponding position in the VSV vector.
The dashed line labeled T+6 at the output of ECN9 and ECN10 indicates that three pipeline stages are inserted in these blocks. This pipeline labeling starts at T+0, where the first pipeline stage that represents the input registers storing the input of the pre-sorting architecture is inserted, followed by three pipeline stages inserted in this pre-sorting architecture (see Fig. 4). This labeling T+i will continue through the CN-VN architecture indicating the position at which each pipeline stage is inserted. The three last ECNs (ECN11, ECN12 and ECN13) are slightly simplified compared to the other ECNs because all the bubbles are output without any sorting. In fact, one of the main ideas in the architecture is to save hardware complexity by postponing the sorting operation in the VN processing. Since no sorting is performed, the default value D of the check to variable message cannot be determined. Thus, we propose to empirically set it to the LLR of a fixed bubble position indicated by D g , D 10 and D 11 in ECN11, ECN12 and ECN13, respectively. Note that the size of the output message S of ECN11 is n S = 20 while the size of the message of ECN12 and ECN13 is n FB = 16 . Any element of S is given by the summation of all the incoming messages, a decorrelation process is thus required [20]. It suppresses the GF symbol U �⊕ . This is done thanks to the VSV vector that is being checked by the DB. The final multiplication is applied on the GF value to compute the output . These operations are performed in one CC to have three CCs latency in total to perform the CN.  Fig. 1 [1] . The architecture of the parallel pipelined presorting block is shown in Fig. 4. This architecture is inspired from [30]. Every comparator-swap receives two inputs where the one that is having minimum LLR value  After the CN processing, the VN and DM blocks operate in parallel to make the VN update and the decision on every input. There are 12 VN and 12 DM blocks associated to a CN. The VN and DM processing were described in Sect. 2.3. Figure 5 illustrates the parallel architecture of the VN. The eLLR block generates the intrinsic LLR value Finally, the normalization of Ū b,+ (9) is performed. Eight pipeline stages are inserted in the VN architecture. Figure 6 shows the parallel architecture of the DM block. It contains 2n comparators operating in parallel to check

Proposed Parallel and Pipelined Decoder
This section describes the global architecture of the decoder as well as the inputs/outputs of each block. The global decoder is based on the CN-VN unit described in Sect. 3 which has been customized to offer the best performancecomplexity trade-off for the considered code. This CN-VN unit can be modified to meet the specifications of any other NB-LDPC code and thus design the associated decoder.

Architecture Overview
The architecture of the global decoder is shown in Fig. 7. The 144 symbols of a received frame are input in 18 CCs by group of 8 symbols. The input order is given by the layer L 1 order of the parity check matrix when it is read line by line by block of 8 symbols (see Fig. 8). Thus, when the input start is set to one to indicate the arrival of a new frame, the first input Y is equal to {Y 0 , Y 12 , Y 24 , Y 36 , Y 48 , Y 60 , Y 72 , Y 84 } (all of them belong to CN 0 ). The second CC, Y is equal 25 , Y 37 } (the first 4 symbols belong to CN 0 and the last 4 symbols belong to CN 1 ) and so on. The size of the input Y is 288 bits (8 symbols, each symbol composed of m = 6 LLR values quantified on b = 6 bits). The outputs of the global architecture are the signal Ready that indicates both the end of the decoding process (output Ĉ valid) and the availability of the decoder to receive a new frame. The decoded frame Ĉ is composed of the K = 120 GF(64) information symbols of the transmitted message, which gives a total size of 120 × m = 720 bits. The output signal decod_ok is set to one when the decoding process is succeeded, i.e., when all parity checks are satisfied.
The internal structure of the decoder is composed of 6 blocks. The LLR block performs the processing described in Sect. 2.2. From the 8 input symbols, the LLR block generates the 8 associated intrinsic vector I (total size 8n I = 336 bits) that are directly stored in RAM U and the 8 associated pre-processed intrinsic vector Ĩ e (total size of 8ñI = 360 bits) that are directly stored in RAM Y. The architecture of the LLR block is composed of 8 parallel symbol LLR generators, each symbol generator receiving the 6 binary LLRs of a symbol and generating vector I and Ĩ e . The reader can refer to [26] for more details about the LLR generator architecture. From RAM Y, RAM U and ROM H (ROM H contains only the GF coefficients of the parity checks), the 12 information vectors related to a given CN are sent to the CN-VN component. After processing, vector U b is stored back into ROM U while decision vector is sent to the parity test block. This block verifies whether the current decoded frame is a codeword or not based on (1). Finally, the control unit synchronises the components and generates the read/ write instructions of memory blocks.
Let us describe in more details the internal structure of the memory blocks (RAM U, RAM Y and RAM H), the Parity Test block and the overall control block among with the timing diagram. Note that the CN-VN is already detailed in Sect. 3.

Memory Blocks
Recalling Sect. 2.4, after the extension of the prototype matrix H (see (12)), the obtained PCM H is of size (M, N) = (24, 144) with two layers L 1 and L 2 . There are three types of memories in the decoder: Extrinsic RAM (RAM U), Intrinsic RAM (RAM Y) and the ROM (ROM H) that stores the h coefficients of the PCM. Figure 8  } are written into the associated VNs: U b 0 is associated to VN 0 in the second layer and hence it will be stored in RAM 0 [12]; U b 1 is associated to VN 12 and hence it will be stored in RAM 1 [23], ...; U b 11 is associated to VN 132 and it will be stored in RAM 11 [13]. Therefore, each RAM i requires its own write address A w i , i = 0, … , 11 . Every cell in a RAM stores 42 bits: 4 GF symbols (each of 6 bits) and 3 non-zero LLR values (each of 6 bits). Furthermore, since the latency of the CN is 16 CCs, some updated message VNs in L 1 are directly used in L 2 . These messages are highlighted in grey color in Fig. 8. In other words, the decoding process is not completely flooding (recall Sect. 2.5).
The intrinsic RAMs store the information related to the intrinsic LLR messages of the N = 144 VNs. These VNs are organized in RAM blocks similarly to RAM L 1 part shown in Fig. 8. For instance, the intrinsic messages of VN 0 are stored in the first cell of the first RAM block (RAM 0 [0] ), while the intrinsic messages of VN 50 are stored in the third cell of the fifth RAM block (RAM 4 [2] ), and so on. The required information is concatenated to be stored in each cell of length ñI = 45 bits. Every intrinsic RAM has its own read address and write address.
The non-zero elements of the PCM and their inverse are stored in a ROM block. Due to the specific code construction (Sect. 2.4), the ROM has only 2 words, one for each layer, where each word is of size equal to (6 × 2) × 12 = 144 bits since every non-zero GF value h i and its inverse h −1 i consists of 6-bit words, and i = 0, … , 11.
It is interesting to evaluate the memory bandwidth of the proposed architecture per iteration and per symbol, then, per bit. In an iteration, a VN is implied in two CNs. For each CN, it reads Ĩ in the intrinsic RAM (thus ñI = 45 bits) and reads the U a message in the extrinsic RAM (42 bits from 4 GF symbols and 3 non-zero LLRs) and write back the U b message (thus 42 bits) in the extrinsic RAM. Thus, the total number of read/write operations to process a symbol during an iteration is 2(45 + 2 × 42) = 258 bits. Since a symbol contains 6 bits of information, it gives in average 43 bits of read/write memory access per codeword bit per decoding iteration. This number should be compared to a binary LDPC decoder. Assuming a d v = 3 and a soft-output based CN architecture [31], with the soft-output coded on 8 bits and the extrinsic on 6 bits, then each iteration will require d v × 2(8 + 6) = 84 bits of read/write memory access per message bit per decoding iteration. The natural conclusion that may goes against the common belief is that NB-LDPC code can decrease the memory bandwidth by almost 50% compared to binary LDPC code. The size of the memory is also reduced from (8 + 3 × 6) = 26 bits per message bits for LDPC down to (45 + 2 × 42)∕6 = 21.5 bits in average per message bits for the NB-LDPC.

Parity Test Block
The Parity Test block performs the test of all the M = 24 parity check equations based on Eq. (1). During the first 12 CCs when the CN-VN outputs the decision taken during the processing of layer L 1 , the parity checks are tested on the fly while the 144 decoded symbols of the decoded codeword Ĉ are stored in a register bank. During the next 6 CCs, the 12 parity checks of the second layer are tested thanks to two parity checks working in parallel. If the M = 24 equations are satisfied, the decoding process ends with the decoded codeword. The Boolean decod_ok is sent to the control unit to stop the decoding process of the current frame and starts a new frame. The output vector Ĉ is also available at the output of the block for the case where the maximum number of iterations is reached and the decoding process ends.

Control Unit and Decoding Scheduling
The control unit block controls the read/write operations from/to the RAM ROM Banks. A start signal indicates the arrival of the observed symbols and hence the control signals of the RAM ROM Banks are generated based on a counter in the Control Unit (CU).
The control of the decoder works with a periodicity of 2 × 24 = 48 CCs as shown in Fig. 9. In this figure, four different frames in different phases are presented. Frame k − 2 (white color) that is decoded at cycle −1 , frame k − 1 (blue color) that is still being processed, frame k (grey color) that is being received at cycle 0 just after decoding frame k − 2 and frame k + 1 (cyan color) that is being received after decoding frame k − 1 . The N = 144 received symbols Y of frame k are received in 18 CCs from cycle numbers 0 up to 17 by a group of 8 symbols and sent directly to the LLR block. After two CCs of latency, the LLR block generates all the side information related to the received symbol ( {I,Ĩ e } ). The data is stored in their appropriate location in the intrinsic memory RAMs and in the extrinsic memory RAMs. At cycle index 18, all the intrinsic information of the VNs connected to CN 0 are stored in memory. The processing of the layers L 1 and L 2 for frame k starts at cycle number 19, taking 24 cycles to complete at cycle number 19 + 24 − 1 = 42.
Then, 10 cycles after the beginning of the processing of the first CN of frame k, i.e., at CC number 18 + 9 = 27 the decision on the VNs associated to the first CN are output (X) . After 18 CCs (see Parity Test block description), a codeword is said to be decoded (decod_ok = 1) if all the decisions generated by layer L 1 verify all the M parity CNs (at CC number 47), just in time to start again the loading of a new codeword at cycle number 48. As seen in Fig. 9, the processing of a given frame requires the utilization of a given component at most 24 CCs. Among the 48 cycles of processing of a given frame, there are at least 24 cycles during which some components of the CN-VN unit are in idle mode and can be used to process another frame. Thus, two frames will be always present in the CN-VN unit to be processed in parallel. Note that, since the number of iterations may differ from one frame to another, the order of the decoded frames generated at the output of the decoder may hence differ from the order they entered the decoder.
The number of CCs to decode a frame is thus 48 × n it,f , where 0 < n it,f ≤ n max,it is the number of iterations to decode a frame. Since two frames are decoded in parallel, the average number of CC to decode a codeword is 24 × n av,it CCs.
This parallelism in the simultaneous processing of two consecutive frames requires the duplication of the intrinsic and extrinsic RAMs to store the data of two frames.
To summarize, looking at the global execution of the decoder, the 19 CCs latency for preparing the data (shown in Fig. 9) and the 16 CCs latency of the CN are not considered when evaluating the execution time of the decoder (which has a direct impact on the throughput rate). We also note that without the duplication of the RAMs, that allowed the parallel processing of two consecutive frames, the 16 CCs latency of the CN has to be considered as a part of the execution time at each iteration. This is due to the fact that CN 23 and CN 0 share the same variable VN 12 , which prevents the start of the second iteration before the processing of CN 23 is ended. Therefore, M = 24 CCs is the latency of one iteration.

Simulation Results
We consider Monte Carlo simulations under the AWGN channel, BPSK modulation and the LLR values quantized on b = 6 bits. Figure 10 shows simulation results for the BP decoder [5], the well-known FB-CN EMS decoder [32], proposed decoder, the FB-CN Min-Max decoder [33], and the binary Sum-Product (SP)-based decoder. The BP, the FB-CN EMS and the FB-CN Min-Max NB-LDPC code have the same parameters: K = 120 GF(64) symbols, N = 144 GF(64) symbols and CR = 5∕6 (equivalently, K = 720 bits and N = 864 bits). The SP-based B-LDPC code is of length N = 864 bits, K = 720 bits and CR = 5∕6 but designed over GF (2). The BP, the FB-CN EMS and the FB-CN Min-Max decoders are simulated using layered scheduling while the proposed decoder, in its hardware version, result in a "partially layered" scheduling. This partially layered scheduling is due to the fact that the new parallel decoder starts a new CN processing at each clock cycle, which leads to reach the second layer of CNs without having fully updated the VNs with the check node messages of the first layer. Note that the addition of idle clock cycles could solve the issue but would reduce the decoding throughput. The CNs are directly computed without waiting for updated data. Thus, some entries of the 12 CNs of the second layer take benefit of the updated data of the current iteration (highlighted in grey in Fig. 8) while the others take benefit of updated data from the previous iteration, as it is the case for flooding scheduling. In summary, the decoder has a convergence speed between the convergence speed of the layered scheduling and the convergence speed of the flooding scheduling.
A performance loss of 0.48 dB is observed between the proposed decoder and the references floating point BP and fixed point FB-CN decoding algorithms with 8 iterations, n m = 16 and n op = 18 . This loss is reduced to 0.08 dB when the number of iterations is increased to 30. Although the proposed decoder is implemented with a maximum number of iterations equal to 30, it is the average number of iterations that will be taken into consideration to determine the average decoding throughput. This will be discussed in more details in the next section. Comparing the proposed decoder with the FB-CN Min-Max layered decoding for n max,it = 8 , the proposed decoder shows slightly better performance than the Min-Max algorithm in the waterfall region (10 −1 ≤ FER ≤ 10 −6 ) . When compared to its binary LDPC counterpart, the proposed decoder presents a gain of 0.3 dB at a FER of 10 −3 . It is worth mentioning that the NB-LDPC code offers an important advantage in terms of spectrum efficiency since high order modulations are suitable to be used with NB-LDPC codes designed over GF(q > 2 ), where there is no need for iterative demodulation. To evaluate performance in a short time, the complete digital communication chain is implemented on an FPGA device. The source, encoder, channel and decoder are implemented using VHDL. The source generates random bits that are encoded, BPSK modulated, affected by an AWGN, then demodulated and decoded. A hardware discrete channel emulator is implemented to emulate the AWGN channel. We used the Xilinx KC705 FPGA DevKit containing a Kintex 7 where the simulation and emulation results are matched.
As Fig. 10 shows, the error floor of the proposed decoder starts from E b ∕N 0 = 5.25 dB where FER = 10 −7 . This is due to the significant simplifications that have been done on the EMS algorithm in order to reduce its complexity. These simplifications are mainly: 1) the predefined offset value; 2) the new redundant elimination process in the VN block where it is split up into two phases (sorting then redundant suppression); 3) the reduction of the considered intrinsic symbols in the DM block from 3 down to 2 symbols; 4) the significant reduction of the number of bubbles in the ECN units.

Implementation Results
This section discusses the throughput calculation and the post-synthesis results on 28-nm FDSOI technologies. We recall that the structure of the NB-LDPC code considered in this work along with all the parameters have been described in Sect. 2.4. The maximum number of iterations allowed in the decoding process is set to 30. However, since the decoding throughput is mainly dominated by the average number of iterations n av,it , we study its variation versus the SNR (i.e., E b ∕N 0 ) and then evaluate the average decoding throughput accordingly. The average decoding throughput T is expressed in Giga bits per second (Gbps) as where L CN is the latency of the CN-VN and F clk the clock frequency of the design expressed in MHz.
For FPHCN, the synthesis results gives a maximum clock frequency F clk = 900 MHz with a latency L CN = 1 . For the serial hybrid architecture, the maximum clock frequency  was given at 800 MHz in [20]. A new synthesis of the hybrid architecture performed by the authors increases the maximum clock frequency up to 1000 MHz. This updated value is considered in the comparison. The latency L CN of the hybrid architecture remains unchanged to the value L CN = 41. Table 2 compares the average decoding throughput between [20] and FPHCN at different E b ∕N 0 . Compared to [20], FHPD decoding througput can be increased by a factor between 12.3 up to 20.
A comparison of the FPHCN implementation and three state of the art decoders [7,19,20] is presented in Table 3. In order to take into account the transistor size reduction between the -nm technology and the 28-nm, the clock frequency (and thus, the decoding throughput) is scaled by a factor ∕28 . Note that, in practice, the maximum frequency of an ASIC design is limited. For example, a maximum clock frequency of 1000 MHz is considered in the European Project H2020 EPIC [34] to compare error correcting decoder architectures. Nevertheless, this limitation is not considered in this work. The hardware complexity C is expressed in millions of NAND gates, the output decoding throughput T (in Gbits/s or Gbps) and the hardware efficiency E defined as the ratio E = T∕C in Gbps per million NAND gates.
Let us first compare the FPHCN to the architecture proposed in [7]. The 1.22 Gbps throughput rate shown in [7] is obtained at E b /N 0 = 5.0 dB where n av,it = 11.71 . However, The FPHCN architecture provides higher throughput rate starting from E b /N 0 > 3.3 dB and shows a better hardware efficiency in a factor ranging from 1.05 up to 16.9. Compared to the architecture proposed in [19], the FPHCN architecture with 30 iterations is outperformed by a factor 2 of hardware efficiency. However, [19] considers a higher code rate than the proposed architecture (rate 7/8 against rate 5/6). Moreover, when considering the average number of iterations, the two solutions become similar in terms of hardware efficiency when the SNR increases. In fact, assuming a single iteration for [19], the efficiency increases from 2.15 (8 decoding iterations) up to 17.2 Gbits/s (single decoding iteration). This result is similar to the 17.7 Gbits/s obtained for the proposed architecture at an SNR of 5 dB (see Table 4).
Finally, comparing FPHCN with its serial counterpart shown in [20], the FPHCN provides much higher throughput rate where the factor gain varies from 12.3 up to 20 thanks to the high order of parallelism. The significant improvement in terms of throughput rate is reflected on the hardware efficiency in favor of the FPHCN approach as shown in Table 3. Even though the area consumption of the serial CN-VN is about 10 times lower than the parallel CN-VN, the hardware efficiency of the parallel CN-VN is higher than the serial approach presented in [20] as shown in Table 4. It is worthy to mention that the CN-VN in case of FPHCN constitutes 48.1% of the total complexity of the decoder. The remaining 51.9% are related to the LLR generator blocks, the DM blocks, the parity test blocks, the RAM/ROM blocks and the control unit.

Conclusion and Perspectives
This paper was dedicated to an ultra-high-throughput EMS NB-LDPC decoder implementation based on a Fully Parallel Hybrid Check Node architectures. We particularly focused on a GF(64) (144, 120) code with high rate ( CR = 5∕6 ). A number of architectural strategies made possible the 10 Gbps throughput at high SNR, leading to a hardware efficiency gain factor going up to 5 as compared to the serial architecture proposed in [20]. Beside the careful optimization of the number of bubbles in each ECN, several original ideas have been presented to optimize the prior hybrid architecture. The main idea is to merge the CN and the VN processing with the suppression of the sorting operation after the CN processing thanks to the use of a predefined bubble position to  get the default check to variable LLR value for the VN processing. This leads to both a reduced hardware complexity and a reduced memory bandwidth (almost 50% of reduction compared to a binary LDPC code). The two-step generation of the variable-to-check messages, i.e., the selection of the n m in + n messages with smallest LLR, then the extraction of n m in smallest LLR with distinct GF values, is also a new contribution.
As a proof of concept, the full design of a small code has been performed. Two codewords are decoded in parallel to avoid idle cycles in the hardware. Simulation results showed that the proposed decoder (partially layered scheduling and n max,it = 30 ) outperforms the (860, 720) binary-LDPC SP used in the 5G standard by 0.3 dB at FER of 10 −3 . Emulation results on FPGA show that the proposed decoder introduces only a 0.08 dB penalty loss in the waterfall region compared to the reference floating point BP layered decoder with n max,it = 8 . A drawback of the proposed architecture is the apparition of an error floor around a FER of 10 −7 .
There are many possible extensions of this work. The first one is to find a way to mitigate the error floor. Then, to determine the optimal sets of parameters of the hybrid CN-VN architecture in the general case (different code length, code rate and Galois Field order). From this study, it would be possible to design a flexible hardware parallel architecture able to decode a set of codes with different coding rates and lengths. In terms of hardware, the advantages of the proposed CN-VN unit should be even greater when the code length is high enough to fully perform the layered decoding algorithm.
Funding Open access funding provided by EPFL Lausanne.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.