A Split-Decoding Message Passing Algorithm for Low Density Parity Check Decoders
- 1.1k Downloads
- 7 Citations
Abstract
A Split decoding algorithm is proposed which divides each row of the parity check matrix into two or multiple nearly-independent simplified partitions. The proposed method significantly reduces the wire interconnect and decoder complexity and therefore results in fast, small, and high energy efficiency circuits. Three full-parallel decoder chips for a (2,048, 1,723) LDPC code compliant with the 10GBASE-T standard using MinSum normalized, MinSum Split-2, and MinSum Split-4 methods are designed in 65 nm, seven metal layer CMOS. The Split-4 decoder occupies 6.1 mm^{2}, operates at 146 MHz, delivers 19.9 Gbps throughput, with 15 decoding iterations. At 0.79 V, it operates at 47 MHz, delivers 6.4 Gbps and dissipates 226 mW. Compared to MinSum normalized, the Split-4 decoder chip is 3.3 times smaller, has a clock rate and throughput 2.5 times higher, is 2.5 times more energy efficient, and has an error performance degradation of 0.55 dB with 15 iterations.
Keywords
Low density parity check LDPC Iterative decoder Split-Row CMOS 65 nm 10GBASE-T VLSI1 Introduction
Low density parity check (LDPC) codes first introduced by Gallager [1] have recently received significant attention due to their error correction performance near the Shannon limit and their inherently parallelizable decoder architectures. Many recent communication standards such as 10 Gigabit Ethernet (10GBASE-T) [2], digital video broadcasting (DVB-S2) [3], and WiMAX (IEEE 802.16e) [4] have adopted LDPC codes. Implementing high throughput and energy efficient LDPC decoders remains a challenge largely due to the high interconnect complexity and high memory bandwidth requirements of existing decoding algorithms stemming from the irregular and global communication inherent in the codes.
This paper overviews Split-Row and the more general Multi-Split, two reduced complexity decoding methods which partition each row of the parity check matrix into two or multiple nearly-independent simplified partitions. These two methods reduce the wire interconnect complexity between row and column processors, and simplify row processors leading to an overall smaller, faster, and more energy efficient decoder. Full-parallel decoders, which are not efficient to build due to their high routing congesting and large circuit area, take the greatest benefit of Split decoding method. In this paper, we present the first complete overview of the Split decoding algorithm, architecture and VLSI implementation.
The paper is organized as follows: Section 2 reviews LDPC codes and the message passing algorithm. In Section 3, LDPC decoder architectures are explained. Sections 4 and 5 introduce Split-Row and Multi-Split decoding methods, respectively, for regular permutation-based LDPC codes. The error performance comparisons for different codes with the multiple splitting method are shown in Section 6. The mapping architecture of the multiple splitting method is presented in Section 7. In Section 8 the results of three full-parallel decoders implemented with the proposed and standard decoding techniques are presented and compared.
2 LDPC Codes and Message Passing Decoding Algorithm
LDPC codes are defined by an M×N binary matrix called the parity check matrix H. The number of columns, N, defines the code length. The number of rows in H, M, defines the number of parity check constraints for the code. The information length K is K = N − M for full-rank matrices, otherwise K = N − rank. Column weight W_{c} is the number of ones per column and row weight W_{r} is the number of ones per row.
LDPC codes are commonly decoded by an iterative message passing algorithm which consists of two sequential operations: row processing or check node update and column processing or variable node update. In row processing, all check nodes receive messages from neighboring variable nodes, perform parity check operations and send the results back to neighboring variable nodes. The variable nodes update soft information associated with the decoded bits using information from check nodes, then send the updates back to the check nodes, and this process continues iteratively.
Sum-Product (SPA) [6] and MinSum (MS) [7] are widely-used decoding algorithms which we refer to as standard decoders in this paper. The following subsections describe these two algorithms in detail.
2.1 Sum Product Algorithm (SPA)
We assume a binary code word (x_{1},x_{2},...,x_{N}) is transmitted using a binary phase-shift keying (BPSK) modulation. Then the sequence is transmitted over an additive white Gaussian noise (AWGN) channel and the received symbol is (y_{1},y_{2},...,y_{N}).
- λ_{i}
- is defined as the information derived from the log-likelihood ratio of received symbol y_{i},$$ \lambda_{i}=ln\left(\cfrac{P\big(x_{i}=0\big|y_{i}\big)}{P\big(x_{i}=1\big|y_{i}\big)}\right) $$(1)
- α_{ij}
is the message from check node i to variable node j. This is the row processing output.
- β_{ij}
is the message from variable node j to check node i. This is the column processing output.
- 1)
Initialization: For each i and j, initialize β_{ij} to the value of the log-likelihood ratio of the received symbol y_{j}, which is λ_{j}. During each iteration, α and β messages are computed and exchanged between variable nodes and check nodes through the graph edges according to the following steps numbered 2–4.
- 2)Row processing or check node update: Compute α_{ij} messages using β messages from all other variable nodes connected to check node C_{i}, excluding the β information from V_{j}:where the non-linear function \(\phi(x)=-\log\left(\tanh\frac{|x|}{2}\right)\). The first product term in Eq. 2 is the parity (sign) bit update and the second product term is the reliability (magnitude) update.$$ \alpha_{ijSPA} = \prod\limits_{j'\in V(i\,)\backslash j} sign\big(\beta_{ij'}\big) \times \phi\left(\sum\limits_{j'\in V(i\,)\backslash j}\phi\big(\big|\beta_{ij'}\big|\big)\right) \label{eqn:sparow} $$(2)
- 3)Column processing or variable node update: Compute β_{ij} messages using channel information (λ_{j}) and incoming α messages from all other check nodes connected to variable node V_{j}, excluding check node C_{i}.$$\beta_{ij} = \lambda_j+\!\sum\limits_{i'\in C(j\,)\backslash i} \!\alpha_{i'j} \label{eqn:spacol} $$(3)
- 4)Syndrome check and early termination: When column processing is finished, every bit in column j is updated by adding the channel information (λ_{j}) and α messages from neighboring check nodes.From the updated vector, an estimated code vector \(\hat{X}=\{\hat{x_{1}},\hat{x_{2}},...,\hat{x_{N}}\}\) is calculated by:$$z_{j} = \lambda_{j}+\sum\limits_{i'\in C(j\,)}\alpha_{i'j} \label{eqn:z} $$(4)$$\hat{x_{i}} = \begin{cases} 1, & \mbox{if }z_{i}\le 0 \\ 0, & \mbox{if }z_{i} >0 \end{cases} \label{eqn:decesion} $$(5)
If \(H \cdot \hat{X}^T=0\), then \(\hat{X}\) is a valid code word and therefore the iterative process has converged and decoding stops. Otherwise the decoding repeats from step 2 until a valid code word is obtained or the number of iterations reaches a maximum number, \(\mathit{Imax}\), which terminates the decoding process.
2.2 MinSum Algorithm (MS)
3 LDPC Decoding Architectures
The message passing algorithm is inherently parallel because row processing operations are fully independent with respect to each other, and the same is true for column processing operations.
3.1 Serial Decoders
Serial decoders process one word at a time by using one row and one column processor. Although they have minimal hardware requirements, they also have a large decoding latency and low throughput. A 4,096-bit serial LDPC convolutional decoder [11] is implemented on an Altera Stratix FPGA with \(\emph{pfraction}=\big(\tfrac{3}{\text{4,096}+\text{2,048}}\big)=0.00049\). The decoder utilizes only 4K logic elements and 776 Kbit memory, runs at 180 MHz, and delivers 9 Mbps throughput.
3.2 Partial-Parallel Decoders
Partial-parallel decoders [12, 13, 14, 15, 16, 17, 18] contain multiple processing units and shared memories. A major challenge is efficiently handling simultaneous memory accesses into the shared memories. Following are details of ten partial-parallel decoders containing 3–2,112 processors with pfraction from 0.001–0.87.
Two 2,048-bit partial-parallel decoders compliant with 10GBASE-T standard are designed with a high parallelism: The first one is a 47 Gbps decoder chip designed with 2,048 column processors and 64 row processors. It has a \(\emph{pfraction}=\big(\tfrac{\text{2,048}+64}{\text{2,048}+384}\big)=0.87\) and occupies 5.35 mm^{2} in 65 nm CMOS. The second decoder is designed using a reduced routing complexity decoding method, called Sliced-Message Passing. It utilizes 512 column processor, 65 row processors, has a \(\emph{pfraction}=\big(\tfrac{512+384}{\text{2,048}+384}\big)=0.37\), occupies 14.5 mm^{2} and delivers 5.3 Gbps in 90 nm.
A multi-rate 2,048-bit programmable partial-parallel decoder chip [19] has a \(\emph{pfraction}=\big(\tfrac{64}{\text{2,048}+\text{1,024}}\big)=0.02\), utilizes about 50 Kbit memory, occupies 14.3 mm^{2} and delivers 640 Mbps in 0.18 \(\upmu\)m technology. An FPGA implementation of a 8,176-bit decoder [20] has a \(\emph{pfraction}=\big(\tfrac{36}{\text{8,176}+\text{1,024}}\big)=0.004\) and achieves source decoding of 172 Mbps. A 1,536-bit memory-bank based decoder [13] utilizes about 540 Kbit memory and has \(\emph{pfraction}=\big(\tfrac{3}{\text{1,536}+768}\big)=0.001\). A Virtex-II FPGA implementation of the decoder runs at 125 MHz and delivers 98.7 Mbps. A 64,800-bit DVB-S2 decoder chip in 65 nm CMOS utilizes 180 processors and 3.1 Mb of memory, attains a throughput of 135 Mbps [21], occupies 6.07 mm^{2}, handles 21 different codes, and thus its \(\emph{pfraction}\) ranges from 0.01–0.001. A 600-bit LDPC-COFDM chip [22] employs 50 row processors and 150 column processors, has \(\emph{pfraction}=\big(\tfrac{200}{600+450}\big)=0.19\), delivers 480 Mbps, and occupies 21.45 mm^{2} in 0.18 \(\upmu\)m CMOS. A 6,912-bit decoder [23] implemented on a Virtex-4 FPGA utilizes 64 processors with 46 BlockRAMs, runs at 181 MHz, has 3.86–4.31 Gbps throughput, and has \(\emph{pfraction}\) from 0.005–0.007. A 32-processor 32-memory decoder [24] supports both IEEE 802.11n and IEEE 802.16e codes, occupies 3.88 mm^{2}, delivers 31.2–64.4 Mbps in 0.13 \(\upmu\)m technology, and has \(\emph{pfraction}\) of 0.003–0.1. A multi-rate, multi-length decoder [25] has 18 processors, \(\emph{pfraction}\) 0.005–0.01, and runs at 100 MHz and delivers 60 Mbps on a Virtex-II FPGA.
3.3 Full-Parallel Decoders
Full-parallel decoders directly map each node in the Tanner graph to a different processing unit and thus \(\mathit{pfraction}=1\). They provide the highest throughputs and require no memory to store intermediate messages. The greatest challenges in their implementation are large circuit area and routing congestion which are caused by the large number of processing units and the very large number of wires between them.
A 1,024-bit full-parallel decoder chip [26] occupies 52.5 mm^{2}, runs at 64 MHz and delivers 1 Gbps throughput in 0.16 \(\upmu\)m technology. Two full-parallel decoders designed for 1,536-bit and 2,048-bit LDPC codes [27, 28] occupy 16.8 and 43.9 mm^{2} and deliver 5.4 and 7.1 Gbps, respectively in 0.18 \(\upmu\)m technology. A 660-bit decoder chip [29] occupies 9 mm^{2}, runs at 300 MHz and obtains 3.3 Gbps throughput in 0.13 \(\upmu\)m technology. A full-parallel decoder designed for a family of codes with different rates and code lengths up to 1,024 bits, attains 2.4 Gbps decoding throughput [30].
Previous studies for reducing wire interconnect complexity are based on reformulating the message passing algorithm. The SPA decoder can be reformulated so that instead of sending different α values, each check node sends only the summation value in Eq. 2 to its variable nodes. Then the α messages are recovered by post-processing in the variable nodes. This results in a 26% reduction in total global wire length [31]. More reformulation was performed so that both check nodes and variable nodes send the summation values in Eqs. 2 and 3, respectively to each other [32]. MinSum was reformulated so that the check node sends only the minimum values to its variable nodes which results in 90% outgoing wire reduction from check nodes [33]. These architectures require more processing in row and column processors with additional storage units to recover α and β messages and therefore unfortunately result in larger decoder areas.
4 Proposed Split-Row Decoding Method
The Split-Row decoding method is proposed to facilitate hardware implementations capable of: high-throughput, high hardware efficiency, and high energy efficiency.
This architecture has two major benefits: 1) it decreases the number of inputs and outputs per row processor, resulting in many fewer wires between row and column processors, and 2) it makes each row processor much simpler because the outputs are a function of fewer inputs. These two factors make the decoder smaller, faster, and more energy efficient. In the following subsections, we show that Split-Row introduces some error into the magnitude calculation of the row processing outputs, and that the error can be largely compensated with a correction factor.
4.1 SPA Split
- 1.
α_{ijSPASplit} and α_{ijSPA} have the same sign, and
- 2.
|α_{ijSPASplit}| ≥ |α_{ijSPA}|.
4.2 MinSum Split
5 Multi-Split Decoding Method
6 Correction Factor and Error Performance Simulation Results
6.1 Split-Row Correction Factors
Finding the optimal correction factor for the Split-Row algorithm that results in the best error performance requires complex analysis such as density evolution [34]. For simplicity and to account for realistic hardware effects, the correction factors presented in this paper are determined empirically based on bit error rate (BER) results for various SNR values and numbers of decoding iterations.
As the number of partitions increases, a smaller correction factor should be used to normalize the error magnitude of row processing outputs in each partition. This is because for SPA Multi-Split, as the number of partitions increases, the summation on the left side of Eq. 10 decreases in each partition and since ϕ(x) is a decreasing function, the summation on the left side of Eq. 11 becomes larger which results in larger magnitude row processing outputs in each partition. For MS Multi-Split, except for the partition which has the global minimum, the difference between local minimums in most other partitions and the global minimum becomes larger as the number of partitions increases. Thus, the average row processor output magnitude gets larger as the number of partitions increases and a smaller correction factor is required to normalize the row processing outputs in each partition.
Achieving an absolute minimum error performance would require a different correction factor for each row processor output—but this is impractical because it would require knowledge of unavailable information such as row processor inputs in other partitions. Since significant benefit comes from the minimization of communication between partitions, we assume a constant correction factor for all row processing outputs. This is the primary cause of the error performance loss and slower convergence rate of Split-Row.
Since the error performance improvements are small (≤0.07 dB) if a decoder used multiple correction factors for different SNR values, we use the average value as the correction factor for the error performance simulations in this paper.
Average optimal correction factor S for different constructed regular codes.
(N,K) | (W_{c},W_{r}) | Average optimal correction factor S | ||||
---|---|---|---|---|---|---|
SP-2 | SP-4 | SP-6 | SP-8 | SP-12 | ||
(1536, 770) | (3, 6) | 0.45 | + | − | + | + |
(1008, 507) | (4, 8) | 0.35 | − | + | − | + |
(1536, 1155) | (4, 16) | 0.4 | 0.25 | + | − | + |
(8088, 6743) | (4, 24) | 0.4 | 0.27 | 0.22 | − | − |
(2048, 1723) | (6, 32) | 0.3 | 0.19 | + | 0.15 | + |
(16352, 14329) | (6, 32) | 0.4 | 0.25 | + | 0.17 | + |
(8176, 7156) | (4, 32) | 0.4 | 0.24 | + | 0.17 | + |
(5248, 4842) | (5, 64) | 0.35 | 0.25 | + | 0.2 | + |
(5256, 4823) | (6, 72) | 0.35 | 0.2 | 0.18 | 0.15 | 0.14 |
6.2 Error Performance Results
All simulations assume an additive white Gaussian noise channel with BPSK modulation. BER results presented here were made using simulation runs with more than 100 error blocks each and with a maximum of 15 iterations (\(\mathit{Imax}=15\)) or were terminated early when a zero syndrome was detected for the decoded codeword.
7 Full-Parallel MinSum Multi-Split Decoders
8 Decoder Implementation Example and Results
To precisely quantify the benefits of the Split-Row and Multi-Split algorithms when built into hardware, we have implemented three MinSum full-parallel decoders for the (2,048, 1,723) 10GBASE-T code using MinSum normalized, Split-2 and Split-4 methods. The decoders were developed using Verilog to describe the architecture and hardware, synthesized with Synopsys Design Compiler, and placed and routed using Cadence SOC Encounter. All designs were created in ST Microelectronics’ 65 nm, 1.3 V low-leakage, seven-metal layer CMOS.
Summary of the key parameters of the implemented (6,32) (2,048, 1,723) 10GBASE-T LDPC code.
Code length, No. of columns (N) | 2,048 |
Information length (K) | 1,723 |
Parity check equations, No. of rows (M) | 384 |
Row weight (W_{r}) | 32 |
Column weight (W_{c}) | 6 |
Size of permutations | 64 |
8.1 Effects of Fixed-Point Number Representation
Although there have been several studies on the quantization effects in LDPC decoders [40, 10], as a base overview of the effects of word length in a decoder’s datapath we will uniformly change the word widths of the λ, α and β messages. For a fixed-point datapath width of q bits, the majority of the decoder’s hardware complexity can be roughly estimated by the wires going to and from column and row processors. For M row processors, the total number of word busses that pass α messages is M×W_{r}, while N column processors that pass β messages require N×W_{c} messages. Therefore, the total number of global communication wires is q ×( M ×W_{r} + N ×W_{c} ). Increasing the word width of the datapath from a 5-bit to 6-bit fixed-point representation—4.1 and 4.2 formats, respectively—increases the number of global wires by M ×W_{r} + N ×W_{c}. However, the complexity caused by additional wires is not a simple linear relationship. When designed in a chip, every additional wire results in a super-linear increase in circuit area and delay [26].
On the other hand, using wider fixed-point words improves the error performance. BER simulations show an approximate 0.07–0.09 dB improvement in all three decoders when using 6-bit words (4.2) instead of 5-bit words (4.1). To achieve this improved performance for MinSum normalized with one additional bit, the number of wires increases by M ×W_{r} + N ×W_{c}, but for Multi-Split the increase is only \(M \times W_r + ( N/\mathit{Spn} ) \times W_c\) per block. Synthesis results for a 6-bit implementation of Split-2 and Split-4 show that the row and column processors have a 12% and 8% area increase respectively, without any reduction in clock rate, compared to a 5-bit implementation using the same constraints. Thus, the error performance loss of the Split-2 and Split-4 decoders can be reduced by using a larger fixed-point word with a small area penalty.
8.2 Area, Throughput and Power Comparison
Comparison of the three full-parallel decoders implemented in 65 nm CMOS for a (6, 32) (2048, 1723) code.
| MinSum normalized | Split-2 MinSum | Split-4 MinSum |
---|---|---|---|
CMOS fabrication process | 65 nm CMOS, 1.3 V | ||
Area utilization (%) | 38% | 50% | 85% |
Average wire length \(({\upmu}{m})\) | 175.2 | 115.5 | 73.8 |
Area per sub-block (mm^{2}) | 20 | 6.9 | 1.5 |
Total layout area (mm^{2}) | 20 | 13.8 | 6.1 |
% area for row processors | 13.2% | 19.2% | 41.3% |
% area for column processors | 8.0% | 11.6% | 26.0% |
% area for registers and clock tree | 16.8% | 19.2% | 17.7% |
% area without standard cells | 62.0% | 50.0% | 15.0% |
Maximum clock rate (MHz) | 59 | 110 | 146 |
Power dissipation (mW) | 1,941 | 2,179 | 1,889 |
Throughput @\(\mathit{Imax}=15\) (Gbps) | 8.1 | 15.0 | 19.9 |
Energy per bit @\(\mathit{Imax}=15\) (pJ/bit) | 241 | 145 | 95 |
Average iterations @ BER = 3×10^{ − 5}, \(\mathit{Imax}=15~(\mathit{Iavg}\)) | 3.8 | 4.8 | 4.9 |
Throughput @\(\mathit{Iavg}\) (Gbps) | 31.8 | 46.9 | 61.0 |
Energy per bit @\(\mathit{Iavg}\) (pJ/bit) | 61 | 46 | 31 |
To achieve a fair comparison between all three architectures, a common CAD tool design flow was adopted. The synthesis, floorplan, and place and route stages of the layout were automated with minimal designer intervention.
Since Split-Row reduces row processor area and eliminates significant communication between row and column processors (causing them to operate as smaller nearly-independent groups), layout becomes much more compact and automatic place and route tools can converge towards a better solution in a much shorter period of time.
As shown in Table 3, Split-4 achieves a high area utilization (the ratio of standard cell area to total chip area) and a short average wire length compared to the MinSum normalized decoder whose many global row and column processor interconnections force the place and route tool to spread standard cells apart to provide sufficient space for routing.
As an additional illustration, Table 3 provides a breakdown of the basic contributors of layout area, which shows the dramatic decrease in % area without standard cells (i.e., chip area with only wires) with an increased level of splitting.
The critical path delay in Split-4 is about 2.3 times shorter than that of MinSum normalized. Place and route timing analysis and extracted delay/parasitic annotation files (i.e., SDF) show that the critical path delay is composed primarily of a long series of buffers and wire segments. Some buffers have long RC delays due to large fanouts of their outputs. For the MinSum decoder, the sums of interconnect delays caused by buffers and wires (intrinsic gate delay and RC delay) is 13.1 ns. In Split-2 and Split-4, the total interconnect delays are 5.1 ns and 6.2 ns, respectively, which are 2.6 and six times smaller than that of MinSum. Thus, Split-4’s speedup over MinSum normalized is due in part to its simplified row processing, but the major contributor is the significant reduction in column/row processor interconnect delay.
To summarize Split-Row’s benefits, the Split-4 decoder occupies 6.1 mm^{2}, which is 3.3 times smaller than MinSum normalized. It runs at 146 MHz and with 15 iterations it attains 19.9 Gbps decoding throughput which is 2.5 times higher, while dissipating 95 pJ/bit—a factor of 2.5 times lower than MinSum normalized.
Although it is not possible to exactly quantify the benefit of chip area reductions, chip silicon area is a critical parameter in determining chip costs. For example, reducing die area by a factor of 2 results in a die cost reduction of more than two times when considering the cost of the wafer and die yield [41]. Other chip production costs such as packaging and testing are also significantly reduced with smaller chip area.
At a supply voltage of 0.79 V, the Split-4 decoder runs at 47 MHz and achieves the minimum 6.4 Gbps throughput required by the 10GBASE-T standard [2]. Power dissipation is 226 mW at this operating point. These estimates are based on measured data from a chip that was recently fabricated on the exact same process and operates correctly down to 0.675 V [42].
8.3 Wire Statistics
The total number of sign-passing wires between sub-blocks in the Multi-Split methods is 2(Spn − 1)M. For these decoders where M = 384, the sign wires in Split-2 are only 0.12% of the total number of wires and in Split-4 they are only 0.30% of the total.
The source of Multi-Split’s benefits are now clear: the method breaks row processors into multiple blocks whose internal wires are all relatively short. These blocks are interconnected by a small number of sign wires. This results in denser, faster and more energy efficient circuits.
8.4 Analysis of Maximum and Average Numbers of Decoding Iterations
The maximum number of decoding iterations strongly affects the best case error performance, the maximum achievable decoder throughput, and the worst case energy consumption. Fortunately, the majority of frames require only a few decoding iterations to converge (specially at high SNRs). By detecting early decoder convergence, throughput and energy can potentially improve significantly while maintaining the same error performance. Early convergence detection is done by a syndrome check circuit [14, 43] which checks the decoded bits every cycle (see Fig. 11) and terminates the decoding process when convergence is detected. Decoding of a new frame can begin if one is available.
Post-layout results show that the syndrome check block for a (2,048, 1,723) code occupies only approximately 0.1 mm^{2} and its maximum delay is 2 ns. By adding a pipeline stage for the syndrome check, the block’s delay does not add at all to the critical path delay of the decoder.
It is interesting to compare decoders at the same BER. From Fig. 18, Split-2 at \(\mathit{Imax}=20\) and MinSum normalized at \(\mathit{Imax}=5\) both have nearly the same BER. But the Split-2 implementation has 1.2 to 1.3 times higher throughput while consuming 1.1 times lower energy for SNR values larger than 4.1 dB. Similarly, Split-4 at \(\mathit{Imax}=15\) and MinSum normalized at \(\mathit{Imax}=3\) have nearly equal BER, but Split-4 has 1.1 to 1.3 times greater throughput and 1.1 to 1.4 times lower energy dissipation for SNR values larger than 4.1 dB.
In summary, with the same maximum number of decoding iterations (\(\mathit{Imax}\)) and at the same BER, the average number of decoding iterations (\(\mathit{Iavg}\)) of Split-2 and Split-4 are larger than that of MinSum normalized, but they still have larger throughput and energy efficiency at high SNR values. The maximum number of decoding iterations for MinSum normalized can be lowered until it obtains the same BER as Split-2 and Split-4. Even when MinSum normalized operates with a much lower number of iterations, Split-2 and Split-4 have higher throughput and energy efficiencies for most SNR values. In addition, Split-2 and Split-4 require 1.4 times and 3.3 times smaller circuit area, respectively, than the MinSum normalized decoder.
8.5 Comparison with Other Chips
Comparison of the Split-4 decoder with published full-parallel LDPC decoder Chips.
The (3.25, 6.5) (1,024, 512) full-parallel decoder by Blanksby [26] (average row weight and column weight numbers are given) uses hierarchy-based and hand-placed design flow for routing and timing optimization. The Bit-Serial (4, 15) (660, 480) full-parallel decoder by Darabiha [29] uses a serial transfer of messages between the processing units to reduce routing congestion.
Although we are comparing only full-parallel decoders with each other, it is still challenging to fairly compare these decoders since they implement different LDPC codes (including code length, row weight, and column weight), different rates, and different CMOS technologies. Basic metrics such as throughput, energy, and circuit area are unfortunately complex functions of these parameters.
Table 4 gives the No. of edges in LDPC code, for each of the decoders. This value is a fraction of the number of global wires in a full-parallel decoder and it therefore gives a good first-order estimate of a full-parallel decoder’s complexity. Using this metric, the estimated code complexity of the Split-4 decoder’s code is 4.6, and 5.8 times higher compared to the other decoders’ codes.
A rough area comparison can be made by linearly normalizing the Total chip area of decoders to 61,440/No. of edges in LDPC code and normalizing quadratically with feature size, i.e., (65 nm/Min. feature size)^{2}. Scaling results in 40 mm^{2} for the (1,024, 512) full-parallel decoder [26] and 13.1 mm^{2} for the (660, 480) full-parallel decoder [29]. It is important to note that scaling linearly with the Total row/col processor input bits factor favors simpler codes since decoder circuit area grows faster than this factor due to the limited routing resources in VLSI implementations. Nevertheless, the Split-4 decoder is 6.2 and two times smaller than the scaled (1,024, 512) and (660, 480) full-parallel decoders respectively.
For energy comparisons, all decoders are scaled to 65 nm operating on a 1.3 V supply voltage. Scaling linearly with feature size and quadratically with supply voltage gives energy per bit of 210.5 pJ/bit for the (1,024, 512) full-parallel decoder [26] and 213.5 pJ/bit for the (660, 480) full-parallel decoder [29]. The Split-4 decoder with its more complex code operates with an energy per bit that is 2.2 times lower and has 0.55 dB error performance degradation compared to the other two decoders.
In addition, with early termination enabled, Split-4 delivers 61 Gbps throughput and dissipates 31 pJ/bit at SNR=4.4 dB (see Table 3). When compared to the state-of-the art 47.7 Gbps, 58.7 pJ/bit partial parallel 10GBASE-T decoder [17], which is built in 65 nm and 1.2 V, Split-4 has 1.3 times higher throughput is 1.9 times more energy efficient, and is 1.1 times larger with error performance degradation of 0.60 dB.
9 Conclusion
The proposed Split-Row and Multi-Split algorithms are viable approaches for high throughput, small area, and low power LDPC decoders, with a small error performance degradation that is acceptable for many applications—especially in mobile designs that typically have severe power and cost constraints. The method is especially well suited for long-length regular codes and codes with high row weights. Compared to standard (MinSum and SPA) decoding, the error performance loss of the method is about 0.35–0.65 dB for the implemented (2,048, 1,723) code, depending on the level of splitting.
The proposed algorithm and architecture break row processors into multiple blocks whose internal wires are all relatively short. These blocks are interconnected by a small number of sign wires whose lengths are almost zero. The result is decoders with denser, faster and more energy efficient circuits.
We have demonstrated the significant benefits of the splitting methods by implementing three decoders using MinSum normalized, MinSum Split-2, and MinSum Split-4 for the 2,048-bit code used in the 10GBASE-T 10 Gigabit ethernet standard. Post-layout simulation results show that the Split-4 decoder is 3.3 times smaller, attains 2.5 times higher throughput, and dissipates 2.5 times less energy per bit compared to a MinSum normalized decoder while performing 0.55 dB away from MinSum normalized at BER = 5 ×10^{ − 8} with 15 decoding iterations.
Using early termination circuits, the average number of decoding iterations in the Split-4 decoder is about 1.3 times larger than that of the MinSum normalized decoder. With early termination enabled, the Split-4 decoder’s throughput is 1.9 times higher and its energy dissipation per bit is 2.0 times lower compared to the MinSum decoder at BER = 3×10^{ − 5}.
Increasing the number of decoding iterations and increasing the fixed-point word width reduces the error performance loss in the Split-2 and Split-4 decoders. With a maximum of 20 decoding iterations, the error performance loss of the Split-2 decoder is reduced to 0.25 dB compared to MinSum normalized while it still achieves times higher throughput and occupies smaller circuit area.
Notes
Acknowledgements
The authors gratefully acknowledge support from ST Microelectronics, Intel, UC Micro, NSF Grant 0430090 and CAREER Award 0546907, SRC GRC Grant 1598 and CSR Grant 1659, Intellasys, Texas Instruments, IBM, SEM, and a UCD Faculty Research Grant; LDPC codes and assistance from Shu Lin and Lan Lan; and thank Zhengya Zhang, Dean Truong, Aaron Stillmaker, Lucas Stillmaker, Jean-Pierre Schoellkopf, Patrick Cogez, and Pascal Urard.
Open Access
This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
References
- 1.Gallager, R. G. (1962). Low-density parity check codes. IRE Transactions on Information Theory, IT-8, 21–28.CrossRefMathSciNetGoogle Scholar
- 2.IEEE P802.3an, 10GBASE-T task force. http://www.ieee802.org/3/an.
- 3.T.T.S.I. digital video broadcasting (DVB) second generation framing structure for broadband satellite applications. http://www.dvb.org.
- 4.IEEE 802.16e (2005). Air interface for fixed and mobile broadband wireless access systems. IEEE p802.16e/d12 draft.Google Scholar
- 5.Tanner, R. M. (1981). A recursive approach to low complexity codes. IEEE Transactions on Information Theory, 27, 533–547.MATHCrossRefMathSciNetGoogle Scholar
- 6.MacKay, D. J. (1999). Good error correcting codes based on very sparse matrices. IEEE Transactions on Information Theory, 45, 399–431.MATHCrossRefMathSciNetGoogle Scholar
- 7.Fossorier, M., Mihaljevic, M., & Imai, H. (1999). Reduced complexity iterative decoding of low-density parity check codes based on belief propagation. IEEE Transactions on Communications, 47, 673–680.CrossRefGoogle Scholar
- 8.Hagenauer, J., Offer, E., & Papke, L. (1996). Iterative decoding of block and convolutional codes. IEEE Transactions on Information Theory, 42, 429–445.MATHCrossRefGoogle Scholar
- 9.Chen, J., & Fossorier, M. (2002). Near optimum universal belief propagation based decoding of low-density parity check codes. IEEE Transactions on Communications, 50, 406–414.CrossRefGoogle Scholar
- 10.Chen, J., Dholakia, A., Eleftheriou, E., & Fossorier, M. (2005). Reduced-complexity decoding of LDPC codes. IEEE Transactions on Communications, 53, 1288–1299.CrossRefGoogle Scholar
- 11.Bates, S., Chen, Z., et al. (2008). A low-cost serial decoder architecture for low-density parity-check convolutional codes. IEEE Transactions on Circuits and Systems I, 55, 1967–1976.CrossRefGoogle Scholar
- 12.Yang, L., Liu, H., & Shi, R. (2006). Code construction and FPGA implementation of a low-error-floor multi-rate low-density parity-check decoder. IEEE Transactions on Circuits and Systems I, 53, 892.CrossRefMathSciNetGoogle Scholar
- 13.Dai, Y., Chen, N., & Yan, Z. (2008). Memory efficient decoder architectures for quasi-cyclic LDPC codes. IEEE Transactions on Circuits and Systems I, 55, 2898–2911.CrossRefGoogle Scholar
- 14.Shih, X., Zhan, C., Lin, C., & Wu, A. (2008). An 8.29 mm^{2} 52 mW multi-mode LDPC decoder design for mobile WiMAX system in 0.13 CMOS process. JSSC, 43, 672–683.Google Scholar
- 15.Liu, C. H., et al. (2008). An LDPC decoder chip based on self-routing network for IEEE 802.16e applications. JSSC, 43, 684–694.Google Scholar
- 16.Liu, L., & Shi, R. (2008). Sliced message passing: High throughput overlapped decoding of high-rate low density parity-check codes. IEEE Transactions on Circuits and Systems I, 55, 3697–3710.CrossRefMathSciNetGoogle Scholar
- 17.Zhang, Z., Dolecek, L., et al. (2009). A 47 Gb/s LDPC decoder with improved low error rate performance. In Symposium on VLSI circuits (pp. 22–23).Google Scholar
- 18.Wang, Z., Li, L., et al. (2009). Efficient shuffle network architecture and application for WiMAX LDPC decoders. IEEE Transactions on Circuits and Systems II: Express Briefs, 56, 215–219.CrossRefGoogle Scholar
- 19.Mansour, M., & Shanbhag, N. R. (2006). A 640-Mb/s 2048-bit programmable LDPC decoder chip. JSSC, 41, 684–698.Google Scholar
- 20.Wang, Z., & Cui, Z. (2007). Low-complexity high-speed decoder design for quasi-cyclic LDPC codes. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 15, 104–114.CrossRefGoogle Scholar
- 21.Urard, P., Paumier, L., et al. (2008). A 360mW 105Mb/s DVB-S2 compliant codec based on 64800b LDPC and BCH codes enabling satellite-transmission portable devices. In ISSCC (pp. 310–311).Google Scholar
- 22.Liu, H., Lin, C., et al. (2005). A 480mb/s LDPC-COFDM-based UWB baseband transceiver. In ISSCC (Vol. 1, pp. 444–445).Google Scholar
- 23.Fewer, C., Flanagan, F., & Fagan, A. (2007). A versatile variable rate LDPC codec architecture. IEEE Transactions on Circuits and Systems I, 54, 2240–2251.CrossRefGoogle Scholar
- 24.Masera, G., Quaglio, F., & Vacca, F. (2007). Implementation of a flexible LDPC decoder. IEEE Transactions on Circuits and Systems I, 54, 542–546.CrossRefGoogle Scholar
- 25.Zhang, H., Zhu, J., Shi, H., & Wang, D. (2008). Layered approx-regular LDPC code construction and encoder/decoder design. IEEE Transactions on Circuits and Systems I, 55, 572–585.CrossRefMathSciNetGoogle Scholar
- 26.Blanksby, A., & Howland, C. J. (2002). A 690-mW 1-Gb/s 1024-b, rate 1/2 low-density parity-check code decoder. JSSC, 37(3), 404–412.Google Scholar
- 27.Mohsenin, T., & Baas, B. (2006). Split-Row: A reduced complexity, high throughput LDPC decoder architecture. In ICCD (pp. 13–16).Google Scholar
- 28.Mohsenin, T., & Baas, B. (2007). High-throughput LDPC decoders using a multiple split-row method. In ICASSP (Vol. 2, pp. 13–16).Google Scholar
- 29.Darabiha, A., Carusone, A. C., & Kschischang, F. R. (2007). A 3.3-Gbps bit-serial block-interlaced Min-Sum LDPC decoder in 0.13-um CMOS. In IEEE custom integrated circuits conference (pp.459–462).Google Scholar
- 30.Kim, E., Jayakumar, N., Bhagwat, P., & Khatri, S. P. (2006). A high-speed fully-programmable VLSI decoder for regular LDPC codes. In International conference on acoustics, speech, and signal processing (Vol.3, pp. 972–975).Google Scholar
- 31.Darabiha, A., Carusone, A. C., & Kschischang, F. R. (2008). Block-interlaced LDPC decoders with reduced interconnect complexity. IEEE Transactions on Circuits and Systems Part II: Express Briefs, 55, 74–78.CrossRefGoogle Scholar
- 32.Kang, S., & Park, I. (2006). Loosely coupled memory-based decoding architecture for low density parity check codes. IEEE Transactions on Circuits and Systems I, 53, 1045–1056.CrossRefMathSciNetGoogle Scholar
- 33.Cui, Z., & Wang, Z. (2007). Efficient message passing architecture for high throughput LDPC decoder. In ISCAS (pp. 917–920).Google Scholar
- 34.Richardson, T., & Urbanke, R. (2001). The capacity of low-density parity check codes under message-passing decoding. IEEE Transactions on Information Theory, 47, 599–618.MATHCrossRefMathSciNetGoogle Scholar
- 35.Djurdjevic, I., Xu, J., Abdel-Ghaffar, K., & Lin, S. (2003). A class of low-density parity-check codes constructed based on Reed–Solomon codes with two information symbols. IEEE Communications Letters, 7, 317–319.CrossRefGoogle Scholar
- 36.Chen, L., Xu, J., Djurdjevic, I., & Lin, S. (2004). Near-Shannon-limit quasi-cyclic low-density parity-check codes. IEEE Transactions on Communications, 52, 1038–1042.CrossRefGoogle Scholar
- 37.Kou, Y., Lin, S., & Fossorier, M. P. C. (2001). Low-density parity-check codes based on finite geometries: A rediscovery and new results. IEEE Transactions on Information Theory, 47(7), 2711–2736.MATHCrossRefMathSciNetGoogle Scholar
- 38.Zhang, J., Fossorier, M. P. C. (2004). A modified weighted bit-flipping decoding of low-density parity-check codes. IEEE Communications Letters, 8, 165–167.CrossRefGoogle Scholar
- 39.Gunnam, K. K., et al. (2006). Decoding of quasi-cyclic LDPC codes using an on-the-fly computation. In 40th asilomar conference on signals, systems and computers (pp. 1192–1199).Google Scholar
- 40.Zhang, Z., Venkat, A., et al. (2007). Quantization effects in low-density parity-check decoders. In ICC (pp. 6231–6237).Google Scholar
- 41.Rabaey, J., Chandrakasan, A., & Nikolic, B. (2003). Digital integrated circuits (2nd ed.). Upper Saddle River: Prentice Hall.Google Scholar
- 42.Truong, D. N., Cheng, W. H., et al. (2009). A 167-processor computational platform in 65 nm CMOS. IEEE Journal of Solid-State Circuits (JSSC), 44(4), 1130–1144.CrossRefGoogle Scholar
- 43.Darabiha, A., Carusone, A. C., & Kschischang, F. R. (2008). Power reduction techniques for LDPC decoders. JSSC, 43, 1835–1845.Google Scholar