A SplitDecoding Message Passing Algorithm for Low Density Parity Check Decoders
Abstract
A Split decoding algorithm is proposed which divides each row of the parity check matrix into two or multiple nearlyindependent simplified partitions. The proposed method significantly reduces the wire interconnect and decoder complexity and therefore results in fast, small, and high energy efficiency circuits. Three fullparallel decoder chips for a (2,048, 1,723) LDPC code compliant with the 10GBASET standard using MinSum normalized, MinSum Split2, and MinSum Split4 methods are designed in 65 nm, seven metal layer CMOS. The Split4 decoder occupies 6.1 mm^{2}, operates at 146 MHz, delivers 19.9 Gbps throughput, with 15 decoding iterations. At 0.79 V, it operates at 47 MHz, delivers 6.4 Gbps and dissipates 226 mW. Compared to MinSum normalized, the Split4 decoder chip is 3.3 times smaller, has a clock rate and throughput 2.5 times higher, is 2.5 times more energy efficient, and has an error performance degradation of 0.55 dB with 15 iterations.
Keywords
Low density parity check LDPC Iterative decoder SplitRow CMOS 65 nm 10GBASET VLSI1 Introduction
Low density parity check (LDPC) codes first introduced by Gallager [1] have recently received significant attention due to their error correction performance near the Shannon limit and their inherently parallelizable decoder architectures. Many recent communication standards such as 10 Gigabit Ethernet (10GBASET) [2], digital video broadcasting (DVBS2) [3], and WiMAX (IEEE 802.16e) [4] have adopted LDPC codes. Implementing high throughput and energy efficient LDPC decoders remains a challenge largely due to the high interconnect complexity and high memory bandwidth requirements of existing decoding algorithms stemming from the irregular and global communication inherent in the codes.
This paper overviews SplitRow and the more general MultiSplit, two reduced complexity decoding methods which partition each row of the parity check matrix into two or multiple nearlyindependent simplified partitions. These two methods reduce the wire interconnect complexity between row and column processors, and simplify row processors leading to an overall smaller, faster, and more energy efficient decoder. Fullparallel decoders, which are not efficient to build due to their high routing congesting and large circuit area, take the greatest benefit of Split decoding method. In this paper, we present the first complete overview of the Split decoding algorithm, architecture and VLSI implementation.
The paper is organized as follows: Section 2 reviews LDPC codes and the message passing algorithm. In Section 3, LDPC decoder architectures are explained. Sections 4 and 5 introduce SplitRow and MultiSplit decoding methods, respectively, for regular permutationbased LDPC codes. The error performance comparisons for different codes with the multiple splitting method are shown in Section 6. The mapping architecture of the multiple splitting method is presented in Section 7. In Section 8 the results of three fullparallel decoders implemented with the proposed and standard decoding techniques are presented and compared.
2 LDPC Codes and Message Passing Decoding Algorithm
LDPC codes are defined by an M×N binary matrix called the parity check matrix H. The number of columns, N, defines the code length. The number of rows in H, M, defines the number of parity check constraints for the code. The information length K is K = N − M for fullrank matrices, otherwise K = N − rank. Column weight W _{ c } is the number of ones per column and row weight W _{ r } is the number of ones per row.
LDPC codes are commonly decoded by an iterative message passing algorithm which consists of two sequential operations: row processing or check node update and column processing or variable node update. In row processing, all check nodes receive messages from neighboring variable nodes, perform parity check operations and send the results back to neighboring variable nodes. The variable nodes update soft information associated with the decoded bits using information from check nodes, then send the updates back to the check nodes, and this process continues iteratively.
SumProduct (SPA) [6] and MinSum (MS) [7] are widelyused decoding algorithms which we refer to as standard decoders in this paper. The following subsections describe these two algorithms in detail.
2.1 Sum Product Algorithm (SPA)
We assume a binary code word (x _{1},x _{2},...,x _{ N }) is transmitted using a binary phaseshift keying (BPSK) modulation. Then the sequence is transmitted over an additive white Gaussian noise (AWGN) channel and the received symbol is (y _{1},y _{2},...,y _{ N }).
 λ _{ i }
 is defined as the information derived from the loglikelihood ratio of received symbol y _{ i },$$ \lambda_{i}=ln\left(\cfrac{P\big(x_{i}=0\bigy_{i}\big)}{P\big(x_{i}=1\bigy_{i}\big)}\right) $$(1)
 α _{ ij }

is the message from check node i to variable node j. This is the row processing output.
 β _{ ij }

is the message from variable node j to check node i. This is the column processing output.
 1)
Initialization: For each i and j, initialize β_{ ij } to the value of the loglikelihood ratio of the received symbol y_{ j }, which is λ_{ j }. During each iteration, α and β messages are computed and exchanged between variable nodes and check nodes through the graph edges according to the following steps numbered 2–4.
 2)Row processing or check node update: Compute α_{ ij } messages using β messages from all other variable nodes connected to check node C_{ i }, excluding the β information from V_{ j }:where the nonlinear function \(\phi(x)=\log\left(\tanh\frac{x}{2}\right)\). The first product term in Eq. 2 is the parity (sign) bit update and the second product term is the reliability (magnitude) update.$$ \alpha_{ijSPA} = \prod\limits_{j'\in V(i\,)\backslash j} sign\big(\beta_{ij'}\big) \times \phi\left(\sum\limits_{j'\in V(i\,)\backslash j}\phi\big(\big\beta_{ij'}\big\big)\right) \label{eqn:sparow} $$(2)
 3)Column processing or variable node update: Compute β_{ ij } messages using channel information (λ_{ j }) and incoming α messages from all other check nodes connected to variable node V_{ j }, excluding check node C_{ i }.$$\beta_{ij} = \lambda_j+\!\sum\limits_{i'\in C(j\,)\backslash i} \!\alpha_{i'j} \label{eqn:spacol} $$(3)
 4)Syndrome check and early termination: When column processing is finished, every bit in column j is updated by adding the channel information (λ_{ j }) and α messages from neighboring check nodes.From the updated vector, an estimated code vector \(\hat{X}=\{\hat{x_{1}},\hat{x_{2}},...,\hat{x_{N}}\}\) is calculated by:$$z_{j} = \lambda_{j}+\sum\limits_{i'\in C(j\,)}\alpha_{i'j} \label{eqn:z} $$(4)$$\hat{x_{i}} = \begin{cases} 1, & \mbox{if }z_{i}\le 0 \\ 0, & \mbox{if }z_{i} >0 \end{cases} \label{eqn:decesion} $$(5)
If \(H \cdot \hat{X}^T=0\), then \(\hat{X}\) is a valid code word and therefore the iterative process has converged and decoding stops. Otherwise the decoding repeats from step 2 until a valid code word is obtained or the number of iterations reaches a maximum number, \(\mathit{Imax}\), which terminates the decoding process.
2.2 MinSum Algorithm (MS)
3 LDPC Decoding Architectures
The message passing algorithm is inherently parallel because row processing operations are fully independent with respect to each other, and the same is true for column processing operations.
3.1 Serial Decoders
Serial decoders process one word at a time by using one row and one column processor. Although they have minimal hardware requirements, they also have a large decoding latency and low throughput. A 4,096bit serial LDPC convolutional decoder [11] is implemented on an Altera Stratix FPGA with \(\emph{pfraction}=\big(\tfrac{3}{\text{4,096}+\text{2,048}}\big)=0.00049\). The decoder utilizes only 4K logic elements and 776 Kbit memory, runs at 180 MHz, and delivers 9 Mbps throughput.
3.2 PartialParallel Decoders
Partialparallel decoders [12, 13, 14, 15, 16, 17, 18] contain multiple processing units and shared memories. A major challenge is efficiently handling simultaneous memory accesses into the shared memories. Following are details of ten partialparallel decoders containing 3–2,112 processors with pfraction from 0.001–0.87.
Two 2,048bit partialparallel decoders compliant with 10GBASET standard are designed with a high parallelism: The first one is a 47 Gbps decoder chip designed with 2,048 column processors and 64 row processors. It has a \(\emph{pfraction}=\big(\tfrac{\text{2,048}+64}{\text{2,048}+384}\big)=0.87\) and occupies 5.35 mm^{2} in 65 nm CMOS. The second decoder is designed using a reduced routing complexity decoding method, called SlicedMessage Passing. It utilizes 512 column processor, 65 row processors, has a \(\emph{pfraction}=\big(\tfrac{512+384}{\text{2,048}+384}\big)=0.37\), occupies 14.5 mm^{2} and delivers 5.3 Gbps in 90 nm.
A multirate 2,048bit programmable partialparallel decoder chip [19] has a \(\emph{pfraction}=\big(\tfrac{64}{\text{2,048}+\text{1,024}}\big)=0.02\), utilizes about 50 Kbit memory, occupies 14.3 mm^{2} and delivers 640 Mbps in 0.18 \(\upmu\)m technology. An FPGA implementation of a 8,176bit decoder [20] has a \(\emph{pfraction}=\big(\tfrac{36}{\text{8,176}+\text{1,024}}\big)=0.004\) and achieves source decoding of 172 Mbps. A 1,536bit memorybank based decoder [13] utilizes about 540 Kbit memory and has \(\emph{pfraction}=\big(\tfrac{3}{\text{1,536}+768}\big)=0.001\). A VirtexII FPGA implementation of the decoder runs at 125 MHz and delivers 98.7 Mbps. A 64,800bit DVBS2 decoder chip in 65 nm CMOS utilizes 180 processors and 3.1 Mb of memory, attains a throughput of 135 Mbps [21], occupies 6.07 mm^{2}, handles 21 different codes, and thus its \(\emph{pfraction}\) ranges from 0.01–0.001. A 600bit LDPCCOFDM chip [22] employs 50 row processors and 150 column processors, has \(\emph{pfraction}=\big(\tfrac{200}{600+450}\big)=0.19\), delivers 480 Mbps, and occupies 21.45 mm^{2} in 0.18 \(\upmu\)m CMOS. A 6,912bit decoder [23] implemented on a Virtex4 FPGA utilizes 64 processors with 46 BlockRAMs, runs at 181 MHz, has 3.86–4.31 Gbps throughput, and has \(\emph{pfraction}\) from 0.005–0.007. A 32processor 32memory decoder [24] supports both IEEE 802.11n and IEEE 802.16e codes, occupies 3.88 mm^{2}, delivers 31.2–64.4 Mbps in 0.13 \(\upmu\)m technology, and has \(\emph{pfraction}\) of 0.003–0.1. A multirate, multilength decoder [25] has 18 processors, \(\emph{pfraction}\) 0.005–0.01, and runs at 100 MHz and delivers 60 Mbps on a VirtexII FPGA.
3.3 FullParallel Decoders
Fullparallel decoders directly map each node in the Tanner graph to a different processing unit and thus \(\mathit{pfraction}=1\). They provide the highest throughputs and require no memory to store intermediate messages. The greatest challenges in their implementation are large circuit area and routing congestion which are caused by the large number of processing units and the very large number of wires between them.
A 1,024bit fullparallel decoder chip [26] occupies 52.5 mm^{2}, runs at 64 MHz and delivers 1 Gbps throughput in 0.16 \(\upmu\)m technology. Two fullparallel decoders designed for 1,536bit and 2,048bit LDPC codes [27, 28] occupy 16.8 and 43.9 mm^{2} and deliver 5.4 and 7.1 Gbps, respectively in 0.18 \(\upmu\)m technology. A 660bit decoder chip [29] occupies 9 mm^{2}, runs at 300 MHz and obtains 3.3 Gbps throughput in 0.13 \(\upmu\)m technology. A fullparallel decoder designed for a family of codes with different rates and code lengths up to 1,024 bits, attains 2.4 Gbps decoding throughput [30].
Previous studies for reducing wire interconnect complexity are based on reformulating the message passing algorithm. The SPA decoder can be reformulated so that instead of sending different α values, each check node sends only the summation value in Eq. 2 to its variable nodes. Then the α messages are recovered by postprocessing in the variable nodes. This results in a 26% reduction in total global wire length [31]. More reformulation was performed so that both check nodes and variable nodes send the summation values in Eqs. 2 and 3, respectively to each other [32]. MinSum was reformulated so that the check node sends only the minimum values to its variable nodes which results in 90% outgoing wire reduction from check nodes [33]. These architectures require more processing in row and column processors with additional storage units to recover α and β messages and therefore unfortunately result in larger decoder areas.
4 Proposed SplitRow Decoding Method
The SplitRow decoding method is proposed to facilitate hardware implementations capable of: highthroughput, high hardware efficiency, and high energy efficiency.
This architecture has two major benefits: 1) it decreases the number of inputs and outputs per row processor, resulting in many fewer wires between row and column processors, and 2) it makes each row processor much simpler because the outputs are a function of fewer inputs. These two factors make the decoder smaller, faster, and more energy efficient. In the following subsections, we show that SplitRow introduces some error into the magnitude calculation of the row processing outputs, and that the error can be largely compensated with a correction factor.
4.1 SPA Split
 1.
α_{ ij SPASplit } and α_{ ijSPA } have the same sign, and
 2.
α _{ ij SPASplit } ≥ α _{ ijSPA }.
4.2 MinSum Split
5 MultiSplit Decoding Method
6 Correction Factor and Error Performance Simulation Results
6.1 SplitRow Correction Factors
Finding the optimal correction factor for the SplitRow algorithm that results in the best error performance requires complex analysis such as density evolution [34]. For simplicity and to account for realistic hardware effects, the correction factors presented in this paper are determined empirically based on bit error rate (BER) results for various SNR values and numbers of decoding iterations.
As the number of partitions increases, a smaller correction factor should be used to normalize the error magnitude of row processing outputs in each partition. This is because for SPA MultiSplit, as the number of partitions increases, the summation on the left side of Eq. 10 decreases in each partition and since ϕ(x) is a decreasing function, the summation on the left side of Eq. 11 becomes larger which results in larger magnitude row processing outputs in each partition. For MS MultiSplit, except for the partition which has the global minimum, the difference between local minimums in most other partitions and the global minimum becomes larger as the number of partitions increases. Thus, the average row processor output magnitude gets larger as the number of partitions increases and a smaller correction factor is required to normalize the row processing outputs in each partition.
Achieving an absolute minimum error performance would require a different correction factor for each row processor output—but this is impractical because it would require knowledge of unavailable information such as row processor inputs in other partitions. Since significant benefit comes from the minimization of communication between partitions, we assume a constant correction factor for all row processing outputs. This is the primary cause of the error performance loss and slower convergence rate of SplitRow.
Since the error performance improvements are small (≤0.07 dB) if a decoder used multiple correction factors for different SNR values, we use the average value as the correction factor for the error performance simulations in this paper.
Average optimal correction factor S for different constructed regular codes.
(N,K)  (W _{ c },W _{ r })  Average optimal correction factor S  

SP2  SP4  SP6  SP8  SP12  
(1536, 770)  (3, 6)  0.45  +  −  +  + 
(1008, 507)  (4, 8)  0.35  −  +  −  + 
(1536, 1155)  (4, 16)  0.4  0.25  +  −  + 
(8088, 6743)  (4, 24)  0.4  0.27  0.22  −  − 
(2048, 1723)  (6, 32)  0.3  0.19  +  0.15  + 
(16352, 14329)  (6, 32)  0.4  0.25  +  0.17  + 
(8176, 7156)  (4, 32)  0.4  0.24  +  0.17  + 
(5248, 4842)  (5, 64)  0.35  0.25  +  0.2  + 
(5256, 4823)  (6, 72)  0.35  0.2  0.18  0.15  0.14 
6.2 Error Performance Results
All simulations assume an additive white Gaussian noise channel with BPSK modulation. BER results presented here were made using simulation runs with more than 100 error blocks each and with a maximum of 15 iterations (\(\mathit{Imax}=15\)) or were terminated early when a zero syndrome was detected for the decoded codeword.
7 FullParallel MinSum MultiSplit Decoders
8 Decoder Implementation Example and Results
To precisely quantify the benefits of the SplitRow and MultiSplit algorithms when built into hardware, we have implemented three MinSum fullparallel decoders for the (2,048, 1,723) 10GBASET code using MinSum normalized, Split2 and Split4 methods. The decoders were developed using Verilog to describe the architecture and hardware, synthesized with Synopsys Design Compiler, and placed and routed using Cadence SOC Encounter. All designs were created in ST Microelectronics’ 65 nm, 1.3 V lowleakage, sevenmetal layer CMOS.
Summary of the key parameters of the implemented (6,32) (2,048, 1,723) 10GBASET LDPC code.
Code length, No. of columns (N)  2,048 
Information length (K)  1,723 
Parity check equations, No. of rows (M)  384 
Row weight (W _{ r })  32 
Column weight (W _{ c })  6 
Size of permutations  64 
8.1 Effects of FixedPoint Number Representation
Although there have been several studies on the quantization effects in LDPC decoders [40, 10], as a base overview of the effects of word length in a decoder’s datapath we will uniformly change the word widths of the λ, α and β messages. For a fixedpoint datapath width of q bits, the majority of the decoder’s hardware complexity can be roughly estimated by the wires going to and from column and row processors. For M row processors, the total number of word busses that pass α messages is M×W _{ r }, while N column processors that pass β messages require N×W _{ c } messages. Therefore, the total number of global communication wires is q ×( M ×W _{ r } + N ×W _{ c } ). Increasing the word width of the datapath from a 5bit to 6bit fixedpoint representation—4.1 and 4.2 formats, respectively—increases the number of global wires by M ×W _{ r } + N ×W _{ c }. However, the complexity caused by additional wires is not a simple linear relationship. When designed in a chip, every additional wire results in a superlinear increase in circuit area and delay [26].
On the other hand, using wider fixedpoint words improves the error performance. BER simulations show an approximate 0.07–0.09 dB improvement in all three decoders when using 6bit words (4.2) instead of 5bit words (4.1). To achieve this improved performance for MinSum normalized with one additional bit, the number of wires increases by M ×W _{ r } + N ×W _{ c }, but for MultiSplit the increase is only \(M \times W_r + ( N/\mathit{Spn} ) \times W_c\) per block. Synthesis results for a 6bit implementation of Split2 and Split4 show that the row and column processors have a 12% and 8% area increase respectively, without any reduction in clock rate, compared to a 5bit implementation using the same constraints. Thus, the error performance loss of the Split2 and Split4 decoders can be reduced by using a larger fixedpoint word with a small area penalty.
8.2 Area, Throughput and Power Comparison
Comparison of the three fullparallel decoders implemented in 65 nm CMOS for a (6, 32) (2048, 1723) code.
 MinSum normalized  Split2 MinSum  Split4 MinSum 

CMOS fabrication process  65 nm CMOS, 1.3 V  
Area utilization (%)  38%  50%  85% 
Average wire length \(({\upmu}{m})\)  175.2  115.5  73.8 
Area per subblock (mm^{2})  20  6.9  1.5 
Total layout area (mm^{2})  20  13.8  6.1 
% area for row processors  13.2%  19.2%  41.3% 
% area for column processors  8.0%  11.6%  26.0% 
% area for registers and clock tree  16.8%  19.2%  17.7% 
% area without standard cells  62.0%  50.0%  15.0% 
Maximum clock rate (MHz)  59  110  146 
Power dissipation (mW)  1,941  2,179  1,889 
Throughput @\(\mathit{Imax}=15\) (Gbps)  8.1  15.0  19.9 
Energy per bit @\(\mathit{Imax}=15\) (pJ/bit)  241  145  95 
Average iterations @ BER = 3×10^{ − 5}, \(\mathit{Imax}=15~(\mathit{Iavg}\))  3.8  4.8  4.9 
Throughput @\(\mathit{Iavg}\) (Gbps)  31.8  46.9  61.0 
Energy per bit @\(\mathit{Iavg}\) (pJ/bit)  61  46  31 
To achieve a fair comparison between all three architectures, a common CAD tool design flow was adopted. The synthesis, floorplan, and place and route stages of the layout were automated with minimal designer intervention.
Since SplitRow reduces row processor area and eliminates significant communication between row and column processors (causing them to operate as smaller nearlyindependent groups), layout becomes much more compact and automatic place and route tools can converge towards a better solution in a much shorter period of time.
As shown in Table 3, Split4 achieves a high area utilization (the ratio of standard cell area to total chip area) and a short average wire length compared to the MinSum normalized decoder whose many global row and column processor interconnections force the place and route tool to spread standard cells apart to provide sufficient space for routing.
As an additional illustration, Table 3 provides a breakdown of the basic contributors of layout area, which shows the dramatic decrease in % area without standard cells (i.e., chip area with only wires) with an increased level of splitting.
The critical path delay in Split4 is about 2.3 times shorter than that of MinSum normalized. Place and route timing analysis and extracted delay/parasitic annotation files (i.e., SDF) show that the critical path delay is composed primarily of a long series of buffers and wire segments. Some buffers have long RC delays due to large fanouts of their outputs. For the MinSum decoder, the sums of interconnect delays caused by buffers and wires (intrinsic gate delay and RC delay) is 13.1 ns. In Split2 and Split4, the total interconnect delays are 5.1 ns and 6.2 ns, respectively, which are 2.6 and six times smaller than that of MinSum. Thus, Split4’s speedup over MinSum normalized is due in part to its simplified row processing, but the major contributor is the significant reduction in column/row processor interconnect delay.
To summarize SplitRow’s benefits, the Split4 decoder occupies 6.1 mm^{2}, which is 3.3 times smaller than MinSum normalized. It runs at 146 MHz and with 15 iterations it attains 19.9 Gbps decoding throughput which is 2.5 times higher, while dissipating 95 pJ/bit—a factor of 2.5 times lower than MinSum normalized.
Although it is not possible to exactly quantify the benefit of chip area reductions, chip silicon area is a critical parameter in determining chip costs. For example, reducing die area by a factor of 2 results in a die cost reduction of more than two times when considering the cost of the wafer and die yield [41]. Other chip production costs such as packaging and testing are also significantly reduced with smaller chip area.
At a supply voltage of 0.79 V, the Split4 decoder runs at 47 MHz and achieves the minimum 6.4 Gbps throughput required by the 10GBASET standard [2]. Power dissipation is 226 mW at this operating point. These estimates are based on measured data from a chip that was recently fabricated on the exact same process and operates correctly down to 0.675 V [42].
8.3 Wire Statistics
The total number of signpassing wires between subblocks in the MultiSplit methods is 2(Spn − 1)M. For these decoders where M = 384, the sign wires in Split2 are only 0.12% of the total number of wires and in Split4 they are only 0.30% of the total.
The source of MultiSplit’s benefits are now clear: the method breaks row processors into multiple blocks whose internal wires are all relatively short. These blocks are interconnected by a small number of sign wires. This results in denser, faster and more energy efficient circuits.
8.4 Analysis of Maximum and Average Numbers of Decoding Iterations
The maximum number of decoding iterations strongly affects the best case error performance, the maximum achievable decoder throughput, and the worst case energy consumption. Fortunately, the majority of frames require only a few decoding iterations to converge (specially at high SNRs). By detecting early decoder convergence, throughput and energy can potentially improve significantly while maintaining the same error performance. Early convergence detection is done by a syndrome check circuit [14, 43] which checks the decoded bits every cycle (see Fig. 11) and terminates the decoding process when convergence is detected. Decoding of a new frame can begin if one is available.
Postlayout results show that the syndrome check block for a (2,048, 1,723) code occupies only approximately 0.1 mm^{2} and its maximum delay is 2 ns. By adding a pipeline stage for the syndrome check, the block’s delay does not add at all to the critical path delay of the decoder.
It is interesting to compare decoders at the same BER. From Fig. 18, Split2 at \(\mathit{Imax}=20\) and MinSum normalized at \(\mathit{Imax}=5\) both have nearly the same BER. But the Split2 implementation has 1.2 to 1.3 times higher throughput while consuming 1.1 times lower energy for SNR values larger than 4.1 dB. Similarly, Split4 at \(\mathit{Imax}=15\) and MinSum normalized at \(\mathit{Imax}=3\) have nearly equal BER, but Split4 has 1.1 to 1.3 times greater throughput and 1.1 to 1.4 times lower energy dissipation for SNR values larger than 4.1 dB.
In summary, with the same maximum number of decoding iterations (\(\mathit{Imax}\)) and at the same BER, the average number of decoding iterations (\(\mathit{Iavg}\)) of Split2 and Split4 are larger than that of MinSum normalized, but they still have larger throughput and energy efficiency at high SNR values. The maximum number of decoding iterations for MinSum normalized can be lowered until it obtains the same BER as Split2 and Split4. Even when MinSum normalized operates with a much lower number of iterations, Split2 and Split4 have higher throughput and energy efficiencies for most SNR values. In addition, Split2 and Split4 require 1.4 times and 3.3 times smaller circuit area, respectively, than the MinSum normalized decoder.
8.5 Comparison with Other Chips
Comparison of the Split4 decoder with published fullparallel LDPC decoder Chips.
The (3.25, 6.5) (1,024, 512) fullparallel decoder by Blanksby [26] (average row weight and column weight numbers are given) uses hierarchybased and handplaced design flow for routing and timing optimization. The BitSerial (4, 15) (660, 480) fullparallel decoder by Darabiha [29] uses a serial transfer of messages between the processing units to reduce routing congestion.
Although we are comparing only fullparallel decoders with each other, it is still challenging to fairly compare these decoders since they implement different LDPC codes (including code length, row weight, and column weight), different rates, and different CMOS technologies. Basic metrics such as throughput, energy, and circuit area are unfortunately complex functions of these parameters.
Table 4 gives the No. of edges in LDPC code, for each of the decoders. This value is a fraction of the number of global wires in a fullparallel decoder and it therefore gives a good firstorder estimate of a fullparallel decoder’s complexity. Using this metric, the estimated code complexity of the Split4 decoder’s code is 4.6, and 5.8 times higher compared to the other decoders’ codes.
A rough area comparison can be made by linearly normalizing the Total chip area of decoders to 61,440/No. of edges in LDPC code and normalizing quadratically with feature size, i.e., (65 nm/Min. feature size)^{2}. Scaling results in 40 mm^{2} for the (1,024, 512) fullparallel decoder [26] and 13.1 mm^{2} for the (660, 480) fullparallel decoder [29]. It is important to note that scaling linearly with the Total row/col processor input bits factor favors simpler codes since decoder circuit area grows faster than this factor due to the limited routing resources in VLSI implementations. Nevertheless, the Split4 decoder is 6.2 and two times smaller than the scaled (1,024, 512) and (660, 480) fullparallel decoders respectively.
For energy comparisons, all decoders are scaled to 65 nm operating on a 1.3 V supply voltage. Scaling linearly with feature size and quadratically with supply voltage gives energy per bit of 210.5 pJ/bit for the (1,024, 512) fullparallel decoder [26] and 213.5 pJ/bit for the (660, 480) fullparallel decoder [29]. The Split4 decoder with its more complex code operates with an energy per bit that is 2.2 times lower and has 0.55 dB error performance degradation compared to the other two decoders.
In addition, with early termination enabled, Split4 delivers 61 Gbps throughput and dissipates 31 pJ/bit at SNR=4.4 dB (see Table 3). When compared to the stateofthe art 47.7 Gbps, 58.7 pJ/bit partial parallel 10GBASET decoder [17], which is built in 65 nm and 1.2 V, Split4 has 1.3 times higher throughput is 1.9 times more energy efficient, and is 1.1 times larger with error performance degradation of 0.60 dB.
9 Conclusion
The proposed SplitRow and MultiSplit algorithms are viable approaches for high throughput, small area, and low power LDPC decoders, with a small error performance degradation that is acceptable for many applications—especially in mobile designs that typically have severe power and cost constraints. The method is especially well suited for longlength regular codes and codes with high row weights. Compared to standard (MinSum and SPA) decoding, the error performance loss of the method is about 0.35–0.65 dB for the implemented (2,048, 1,723) code, depending on the level of splitting.
The proposed algorithm and architecture break row processors into multiple blocks whose internal wires are all relatively short. These blocks are interconnected by a small number of sign wires whose lengths are almost zero. The result is decoders with denser, faster and more energy efficient circuits.
We have demonstrated the significant benefits of the splitting methods by implementing three decoders using MinSum normalized, MinSum Split2, and MinSum Split4 for the 2,048bit code used in the 10GBASET 10 Gigabit ethernet standard. Postlayout simulation results show that the Split4 decoder is 3.3 times smaller, attains 2.5 times higher throughput, and dissipates 2.5 times less energy per bit compared to a MinSum normalized decoder while performing 0.55 dB away from MinSum normalized at BER = 5 ×10^{ − 8} with 15 decoding iterations.
Using early termination circuits, the average number of decoding iterations in the Split4 decoder is about 1.3 times larger than that of the MinSum normalized decoder. With early termination enabled, the Split4 decoder’s throughput is 1.9 times higher and its energy dissipation per bit is 2.0 times lower compared to the MinSum decoder at BER = 3×10^{ − 5}.
Increasing the number of decoding iterations and increasing the fixedpoint word width reduces the error performance loss in the Split2 and Split4 decoders. With a maximum of 20 decoding iterations, the error performance loss of the Split2 decoder is reduced to 0.25 dB compared to MinSum normalized while it still achieves times higher throughput and occupies smaller circuit area.
Notes
Acknowledgements
The authors gratefully acknowledge support from ST Microelectronics, Intel, UC Micro, NSF Grant 0430090 and CAREER Award 0546907, SRC GRC Grant 1598 and CSR Grant 1659, Intellasys, Texas Instruments, IBM, SEM, and a UCD Faculty Research Grant; LDPC codes and assistance from Shu Lin and Lan Lan; and thank Zhengya Zhang, Dean Truong, Aaron Stillmaker, Lucas Stillmaker, JeanPierre Schoellkopf, Patrick Cogez, and Pascal Urard.
Open Access
This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
References
 1.Gallager, R. G. (1962). Lowdensity parity check codes. IRE Transactions on Information Theory, IT8, 21–28.CrossRefMathSciNetGoogle Scholar
 2.IEEE P802.3an, 10GBASET task force. http://www.ieee802.org/3/an.
 3.T.T.S.I. digital video broadcasting (DVB) second generation framing structure for broadband satellite applications. http://www.dvb.org.
 4.IEEE 802.16e (2005). Air interface for fixed and mobile broadband wireless access systems. IEEE p802.16e/d12 draft.Google Scholar
 5.Tanner, R. M. (1981). A recursive approach to low complexity codes. IEEE Transactions on Information Theory, 27, 533–547.MATHCrossRefMathSciNetGoogle Scholar
 6.MacKay, D. J. (1999). Good error correcting codes based on very sparse matrices. IEEE Transactions on Information Theory, 45, 399–431.MATHCrossRefMathSciNetGoogle Scholar
 7.Fossorier, M., Mihaljevic, M., & Imai, H. (1999). Reduced complexity iterative decoding of lowdensity parity check codes based on belief propagation. IEEE Transactions on Communications, 47, 673–680.CrossRefGoogle Scholar
 8.Hagenauer, J., Offer, E., & Papke, L. (1996). Iterative decoding of block and convolutional codes. IEEE Transactions on Information Theory, 42, 429–445.MATHCrossRefGoogle Scholar
 9.Chen, J., & Fossorier, M. (2002). Near optimum universal belief propagation based decoding of lowdensity parity check codes. IEEE Transactions on Communications, 50, 406–414.CrossRefGoogle Scholar
 10.Chen, J., Dholakia, A., Eleftheriou, E., & Fossorier, M. (2005). Reducedcomplexity decoding of LDPC codes. IEEE Transactions on Communications, 53, 1288–1299.CrossRefGoogle Scholar
 11.Bates, S., Chen, Z., et al. (2008). A lowcost serial decoder architecture for lowdensity paritycheck convolutional codes. IEEE Transactions on Circuits and Systems I, 55, 1967–1976.CrossRefGoogle Scholar
 12.Yang, L., Liu, H., & Shi, R. (2006). Code construction and FPGA implementation of a lowerrorfloor multirate lowdensity paritycheck decoder. IEEE Transactions on Circuits and Systems I, 53, 892.CrossRefMathSciNetGoogle Scholar
 13.Dai, Y., Chen, N., & Yan, Z. (2008). Memory efficient decoder architectures for quasicyclic LDPC codes. IEEE Transactions on Circuits and Systems I, 55, 2898–2911.CrossRefGoogle Scholar
 14.Shih, X., Zhan, C., Lin, C., & Wu, A. (2008). An 8.29 mm^{2} 52 mW multimode LDPC decoder design for mobile WiMAX system in 0.13 CMOS process. JSSC, 43, 672–683.Google Scholar
 15.Liu, C. H., et al. (2008). An LDPC decoder chip based on selfrouting network for IEEE 802.16e applications. JSSC, 43, 684–694.Google Scholar
 16.Liu, L., & Shi, R. (2008). Sliced message passing: High throughput overlapped decoding of highrate low density paritycheck codes. IEEE Transactions on Circuits and Systems I, 55, 3697–3710.CrossRefMathSciNetGoogle Scholar
 17.Zhang, Z., Dolecek, L., et al. (2009). A 47 Gb/s LDPC decoder with improved low error rate performance. In Symposium on VLSI circuits (pp. 22–23).Google Scholar
 18.Wang, Z., Li, L., et al. (2009). Efficient shuffle network architecture and application for WiMAX LDPC decoders. IEEE Transactions on Circuits and Systems II: Express Briefs, 56, 215–219.CrossRefGoogle Scholar
 19.Mansour, M., & Shanbhag, N. R. (2006). A 640Mb/s 2048bit programmable LDPC decoder chip. JSSC, 41, 684–698.Google Scholar
 20.Wang, Z., & Cui, Z. (2007). Lowcomplexity highspeed decoder design for quasicyclic LDPC codes. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 15, 104–114.CrossRefGoogle Scholar
 21.Urard, P., Paumier, L., et al. (2008). A 360mW 105Mb/s DVBS2 compliant codec based on 64800b LDPC and BCH codes enabling satellitetransmission portable devices. In ISSCC (pp. 310–311).Google Scholar
 22.Liu, H., Lin, C., et al. (2005). A 480mb/s LDPCCOFDMbased UWB baseband transceiver. In ISSCC (Vol. 1, pp. 444–445).Google Scholar
 23.Fewer, C., Flanagan, F., & Fagan, A. (2007). A versatile variable rate LDPC codec architecture. IEEE Transactions on Circuits and Systems I, 54, 2240–2251.CrossRefGoogle Scholar
 24.Masera, G., Quaglio, F., & Vacca, F. (2007). Implementation of a flexible LDPC decoder. IEEE Transactions on Circuits and Systems I, 54, 542–546.CrossRefGoogle Scholar
 25.Zhang, H., Zhu, J., Shi, H., & Wang, D. (2008). Layered approxregular LDPC code construction and encoder/decoder design. IEEE Transactions on Circuits and Systems I, 55, 572–585.CrossRefMathSciNetGoogle Scholar
 26.Blanksby, A., & Howland, C. J. (2002). A 690mW 1Gb/s 1024b, rate 1/2 lowdensity paritycheck code decoder. JSSC, 37(3), 404–412.Google Scholar
 27.Mohsenin, T., & Baas, B. (2006). SplitRow: A reduced complexity, high throughput LDPC decoder architecture. In ICCD (pp. 13–16).Google Scholar
 28.Mohsenin, T., & Baas, B. (2007). Highthroughput LDPC decoders using a multiple splitrow method. In ICASSP (Vol. 2, pp. 13–16).Google Scholar
 29.Darabiha, A., Carusone, A. C., & Kschischang, F. R. (2007). A 3.3Gbps bitserial blockinterlaced MinSum LDPC decoder in 0.13um CMOS. In IEEE custom integrated circuits conference (pp.459–462).Google Scholar
 30.Kim, E., Jayakumar, N., Bhagwat, P., & Khatri, S. P. (2006). A highspeed fullyprogrammable VLSI decoder for regular LDPC codes. In International conference on acoustics, speech, and signal processing (Vol.3, pp. 972–975).Google Scholar
 31.Darabiha, A., Carusone, A. C., & Kschischang, F. R. (2008). Blockinterlaced LDPC decoders with reduced interconnect complexity. IEEE Transactions on Circuits and Systems Part II: Express Briefs, 55, 74–78.CrossRefGoogle Scholar
 32.Kang, S., & Park, I. (2006). Loosely coupled memorybased decoding architecture for low density parity check codes. IEEE Transactions on Circuits and Systems I, 53, 1045–1056.CrossRefMathSciNetGoogle Scholar
 33.Cui, Z., & Wang, Z. (2007). Efficient message passing architecture for high throughput LDPC decoder. In ISCAS (pp. 917–920).Google Scholar
 34.Richardson, T., & Urbanke, R. (2001). The capacity of lowdensity parity check codes under messagepassing decoding. IEEE Transactions on Information Theory, 47, 599–618.MATHCrossRefMathSciNetGoogle Scholar
 35.Djurdjevic, I., Xu, J., AbdelGhaffar, K., & Lin, S. (2003). A class of lowdensity paritycheck codes constructed based on Reed–Solomon codes with two information symbols. IEEE Communications Letters, 7, 317–319.CrossRefGoogle Scholar
 36.Chen, L., Xu, J., Djurdjevic, I., & Lin, S. (2004). NearShannonlimit quasicyclic lowdensity paritycheck codes. IEEE Transactions on Communications, 52, 1038–1042.CrossRefGoogle Scholar
 37.Kou, Y., Lin, S., & Fossorier, M. P. C. (2001). Lowdensity paritycheck codes based on finite geometries: A rediscovery and new results. IEEE Transactions on Information Theory, 47(7), 2711–2736.MATHCrossRefMathSciNetGoogle Scholar
 38.Zhang, J., Fossorier, M. P. C. (2004). A modified weighted bitflipping decoding of lowdensity paritycheck codes. IEEE Communications Letters, 8, 165–167.CrossRefGoogle Scholar
 39.Gunnam, K. K., et al. (2006). Decoding of quasicyclic LDPC codes using an onthefly computation. In 40th asilomar conference on signals, systems and computers (pp. 1192–1199).Google Scholar
 40.Zhang, Z., Venkat, A., et al. (2007). Quantization effects in lowdensity paritycheck decoders. In ICC (pp. 6231–6237).Google Scholar
 41.Rabaey, J., Chandrakasan, A., & Nikolic, B. (2003). Digital integrated circuits (2nd ed.). Upper Saddle River: Prentice Hall.Google Scholar
 42.Truong, D. N., Cheng, W. H., et al. (2009). A 167processor computational platform in 65 nm CMOS. IEEE Journal of SolidState Circuits (JSSC), 44(4), 1130–1144.CrossRefGoogle Scholar
 43.Darabiha, A., Carusone, A. C., & Kschischang, F. R. (2008). Power reduction techniques for LDPC decoders. JSSC, 43, 1835–1845.Google Scholar