Journal of Signal Processing Systems

, Volume 89, Issue 2, pp 209–224 | Cite as

Computational Complexity Analysis of FEC Decoding on SDR Platforms

Open Access
Article
  • 1.1k Downloads

Abstract

The computational complexity evaluation is necessary for software defined Forward Error Correction (FEC) decoders. However, currently there are a limited number of literatures concerning on the FEC complexity evaluation using analytical methods. In this paper, three high efficient coding schemes including Turbo, QC-LDPC and Convolutional code (CC) are investigated. The hardware-friendly decoding pseudo-codes are provided with explicit parallel execution and memory access procedure. For each step of the pseudo-codes, the parallelism and the operations in each processing element are given. Based on it the total amount of operations is derived. The comparison of the decoding complexity among these FEC algorithms is presented, and the percentage of each computation step is illustrated. The requirements for attaining the evaluated results and reference hardware platforms are provided. The benchmarks of state-of-the-art SDR platforms are compared with the proposed evaluations. The analytical FEC complexity results are beneficial for the design and optimization of high throughput software defined FEC decoding platforms.

Keywords

SDR FEC GOPS Computational complexity Convolutional code Turbo LDPC Viterbi Layered decoding BCJR 

1 Introduction

High throughput Software Defined Radio (SDR) platform encounters extremely high computational complexity in FEC decoding. C.H. Kees [1] presented a roughly estimation on FEC workload for nowadays standards, and shows in graph that approximately 100-3000 Operations per Bit (OP/bit) is needed, occupying approximately 40 % of workload in baseband. The attainable throughputs for LDPC are revealed with GPU [2, 3], IBM Cell processor [4] and DSP [5]. Turbo decoding throughputs are provided with GPU platforms [6, 7] and DSP platform [8]. The cycle counts for Viterbi decoding are provided for general purpose processor [9, 10], SORA platform [9], SODA platform [10] and DVB-T receiver [11]. The arithmetic operations for Quasi-Cyclic Low-Density Parity-Check (LDPC) and Turbo are provided and compared in [12], which shows that QC-LDPC codes have significantly lower complexity than Turbo codes. However these results are provided without sufficient proof, especially for arbitrary configurations. Apart from SDR evaluations, the hardware based FEC complexity is given by [13] and [14], whereas the individual hardware result cases may be inaccurate for Software Defined Radio (SDR) platforms.

Among these proposals, few theoretical analysis on complexity is provided, therefore the relationship between the complexity and the algorithm is still unclear. Moreover, because most of cases are implementations with dedicated configurations, the general complexity expression with arbitrary configurations (such as code length, parallelism) has not been exposed clearly, which hinders the designs and optimizations of the SDR platforms. For the algorithm researchers, the complexity results are also necessary for designing high efficient low complexity codes.

Based on these requirements, this research work aims to obtain a likely Giga Operations per Second (GOPS) or OP/bit low bound using analytical methods. The ‘likely low bound’ refers to an approximately low bound including the most probably memory access procedures and parallel execution procedures, which is more valuable for platform designers. The work consists of several aspects as following.

Firstly most of the papers addressing FEC decoding only concern on the algorithm level description which lacks of accuracy for the complexity evaluation. The addressing, memory access and shuffling network besides the arithmetic computation are all necessary to be considered. In addition, the parallel execution procedure for high throughput decoding needs to be taken into consideration as well. For these reasons, hardware-friendly highly accurate pseudo-codes are thus required and provided in this work.

Secondly there are many different decoding algorithm alternatives, and it is hard to cover all during evaluations. In this paper, the mainstream kernel functions and alternative decoding algorithms are listed and compared, and the ones with minimum operations are selected. Different from other papers providing the experimental results based on the hardware architecture details, this work tries to minimize the hardware specifics, and provides a platform independent result. The evaluation is performed on basic operation level (basic operations will be given in a following table) instead of the instruction level or hardware circuit level.

Thirdly, after the complexity is provided for each algorithm, different coding schemes are necessary to be compared using the parameters from wireless standards, which offer a general overview of the decoding efficiency. In addition, the operations for each decoding step are figured out as well.

The paper is organized as following. Section 2 lists all the pseudo-codes and the decoding procedure descriptions. Section 3 provides the operation analysis for each algorithm. Section 4 provides the comparison among the algorithms, and the operation cost on percentage for each decoding steps. Section 5 illustrates the reliable prototypes and assumptions for attaining the proposed results. Section 6 gives comparisons of proposed results with state-of-the-art SDR platforms.

2 Pseudo-Codes Analysis

To calculate the operations for each decoding step, the hardware friendly pseudo-codes are provided firstly. The parts of codes denoting with ‘for x do in parallel’ are able to be executed by independent Processing Elements (PEs) in parallel. The variables stored in memory are denoted with a ‘bar’ over the variable name, they need load and store procedure to access. Other variables without ‘bar’ are register variables which are only valid in the current parallel block and usually in small size. For a multiple PEs platform, except for the shared memory variable \(\overline {L}(i, n)\) which is accessed by all PEs, other memory variables can be stored in each PE locally.

Let x=[x 0,x 1,⋯ ,x N−1] be the transmitted information codeword and y=[y 0,y 1,⋯ ,y N−1] be the received codeword. Let L L R(n)=l o g(P(x n =1∣y n )/P(x n =0∣y n )) be the likelihood-ratio (LLR) of each received message, which is the input of the algorithm. The decoded bits can be viewed as an estimation of x which is the output of the algorithm, and denoted as \(\hat {\mathbf {x}}=[\hat {x}_{0},\hat {x}_{1},\cdots ,\hat x_{N-1}]\).

2.1 LDPC Decoding

Many high-throughput wireless standards such as IEEE 802.16e and IEEE 802.11n apply Quasi-cyclic LDPC (QC-LDPC) because of its ability for intrinsic highly parallel execution. Hence we confine our research on QC-LDPC mainly. The efficient algorithms for QC-LDPC decoding are Belief-Propagation (also called Message Passing) and layered decoding [15]. Layered decoding method is proposed by updating the variable-node as soon as the check node updates for the current row are calculated instead of all the equations, and about half of the iterations can be saved. Hence the layered decoding is adopted for the following analysis, which is shown in Algorithm 1.

In Algorithm 1, A QC-LDPC code Q C(J,N) is processed, which has N decoded bits, J variable nodes (channel messages) and M check nodes (M=JN). It can be described by a parity-check matrix H M×J . The matrix H can be divided into L×C sub-matrices of degree Z, in which L=M/Z and C=J/Z. H B L×C represents the base matrix with elements either zero or equals to the cyclic-shift value of the identity matrix. Let S r,k represents the kth non-zero element (NZE) in row r of H B, and the shift value can be obtained as H B r,S s r,k , simplified as R r,k . S r denotes the number of NZE in row r. A p-layer consists of Z layers which can be processed in parallel.

The decoding is performed in an iterative manner, and the maximum number of iterations is i t m a x . In each iteration L p-layers are processed one by one, where r denotes the index of the current p-layer being processed. In each p-layer, the NZEs are processed one by one sequentially, shown in Fig. 1.
Figure 1

QC-LDPC parity check matrix and layered decoding routine.

The check-node update can be realized by Forward-Backward Recursion (FBR), and the principle can be briefly summarized as following. The LLR result of modulo-2 addition of two LLR values is given by L L R(xy)=f(x,y)=l o g(1+e x+y )−l o g(e x +e y ) [16]. The LLR for multiple input elements can be calculated using 2-input f(x,y) function as f(x 1,x 2,⋯ ,x n )=L L R(x 1x 2⊕⋯⊕x n )=L L R(((x 1x 2)⊕x 3)⊕⋯⊕x n )=f(f(f(x 1,x 2),x 3),⋅,x n ). The check-node update can be represented by \({\Lambda }_{r,k}^{z} = LLR({\gamma _{1}^{z}} \oplus {\gamma _{2}^{z}} \oplus \cdots \oplus \gamma _{k-1}^{z} \oplus \gamma _{k+1}^{z} \oplus {\cdots } \oplus \gamma _{S_{r}-1}^{z}) = f({\gamma _{1}^{z}}, {\gamma _{2}^{z}}, \cdots , \gamma _{k-1}^{z}, \gamma _{k+1}^{z}, \cdots , \gamma _{S_{r}-1}^{z})\). The LLR value of the first k check nodes are calculated by \({\alpha _{k}^{z}}=f(f(f({\gamma _{1}^{z}}, {\gamma _{2}^{z}}), {\gamma _{3}^{z}}),\cdot , {\gamma _{k}^{z}})\). It can be calculated by Forward Recursion (FR) using \({\alpha _{k}^{z}}= f(\alpha _{k-1}^{z}, {\gamma _{k}^{z}})\). The last k check nodes is originally calculated by \({\beta _{k}^{z}}=f(f(f(\gamma _{S_{r}-1}^{z}, \gamma _{S_{r}-2}^{z}), \gamma _{S_{r}-3}^{z})..., {\gamma _{k}^{z}})\). They can be obtained by Backward Recursion (BR) following \({\beta _{k}^{z}}= f(\beta _{k+1}^{z}, {\gamma _{k}^{z}})\), and finally the check node update process can be performed by \({\Lambda }_{r,k}^{z} = f({\alpha _{k}^{z}}, \beta _{k+1}^{z})\). Further information related to FBR can be obtained from [17] and [16].

In FR part, Line 7 and Line 8 can be executed by a block data read operation and a block data permutation instead, where Z data are fetched together in normal sequence. The data are then shifted by a permutation network (cyclic-shifters) which is given by \(I_{org}^{z} \leftarrow \overline {L}(z, \mathbf {S}_{r,k}), z\in [0, Z-1]\), and \(I^{z} \leftarrow {\Phi }(I_{org}^{z}, \mathbf {R}_{r, k})\). Line 24 (along with 17) can be alternatively executed by a block transmission also, given by \(O_{org}^{z} \leftarrow {\Phi }^{-1}(O^{z}, \mathbf {R}_{r,k})\), and \(\overline {L}(z,\mathbf {S}_{r,k}) \leftarrow O_{org}^{z}\).

In Algorithm 1 and following Algorithms in this article, note that:
  1. (1).

    The notation \({\leftarrow }\) represents ‘assigned to’, which means the calculation result on right side updates the variable on the left side.

     
  2. (2).

    The register update procedure and memory access procedure are separately represented. For example, the forward path update \(\overline {\alpha }_{k}^{z}\leftarrow f(\overline {\alpha }_{k}^{z},\gamma ^{z})\) is split to \(\overline {\alpha }_{k}^{z} \leftarrow \alpha ^{z}\) and \(\alpha ^{z} \leftarrow f(\alpha ^{z},\gamma ^{z})\), where \(\overline {\alpha }_{k}^{z}\) is a memory variable with k be the address. If there is no need to store to memory (such as β), then no memory access procedure is needed.

     
  3. (3).

    The foot mark k may be omitted for variables within the kth recursion body without confusion.

     
There is an alternative decoding approach named improved Min-Sum algorithm. In this method, the iteration process in Algorithm 1 is substituted by Algorithm 2. In the forward recursion, only five values are recorded, including the maximum and secondary maximum a-priori messages, the index of the maximum value, the sign of each message, and the product of all signs. The backward recursion part is redesigned in a non-recursion way. The a-posteriori messages are assigned as the maximum value or the secondary maximum value, with the product of signs from all other messages. It is a reduced complexity method and benefits for hardware design. However, for a programmable platform, the branch operation and sign operations may consume more resources. Therefore Algorithm 2 introduces more workload compared to Algorithm 1. Hence, Algorithm 1 is applied for the complexity evaluation.

2.2 Turbo Decoding

Turbo is an efficient coding technique approaching to Shannon limit. Turbo meets the need of high-throughput wireless applications by its parallel decoding ability, and has been widely accepted in many standards such as 3GPP-LTE(A), HSPA, CDMA2000, and IEEE 802.16e. In this work, only 8-state Parallel Concatenated Convolutional Code (PCCC) Turbo is considered because most of the widely adopted standards (listed above) are based on it.

In the following, the BCJR (Bahl-Cocke-Jelinek-Raviv) algorithm with the parallel processing is investigated, shown in Algorithm 3. Let N denotes the information message length, which is encoded to 3N transmitted bits. Let \(\overline {L}(i, n),i=0,1,2\) denotes the 3nth to (3n+2)th data in L L R(n). Because each iteration consists of two constituent maximum a-posteriori (MAP) decoding, we introduce a variable r=0,1 to distinguish the first half iteration (MAP1) and the second (MAP2). The first MAP decoding uses \(\overline {L}(0, k), \overline {L}(1, k)\) and a-posteriori messages from MAP2 \(\overline {L}(i_{0}, k)\) as input, \(\overline {L}(i_{1}, k)\) and \(\overline {L}(i_{L}, k)\) as the output. Where k is the current FBR step. The second MAP decoding uses \(\overline {L}(0, k_{inv}), \overline {L}(2, k)\) and \(\overline {L}(i_{1}, k_{inv})\) as the input, \(\overline {L}(i_{0}, k)\) and \(\overline {L}(i_{L}, k)\) as the output. i 0 and i 1 are arbitrary constants used for bank labels, such as i 0=3, i 1=4. \(\overline {L}(i_{L}, k)\) is only required at the last MAP2 procedure, in such occasion Line 37 is applied instead of Line 36 in Algorithm 3, otherwise Line 36 is skipped.

In parallel decoding, the received codewords are split into P slices with each length of L. P a-posteriori messages are read from a group of memory banks with both sequential and interleaved address. The data are arranged in a sequential pattern, referring that \(\overline {L}(i, n)\) is stored in ⌊n/L⌋th bank with address m o d(n,L). If accessed with interleaved address, the output should be reordered by a permuter. The interleaved address may also incur memory access conflict when more than one data being fetched are located in thesame memory bank. A contention-free interleaver, such as Quadratic Permutation Polynomial (QPP)-interleaver applied in 3GPP-LTE(A), can avoid such occasions when N is divisible with the parallelism P [18, 19]. In this case, the interleaving for each memory data is \({\prod }_{r}(n)=mod(Q_{r}(n), L)\), and the permutation route for each data is \({\coprod }_{r}(n)= Q_{r}(n)/L\). Where Q 1(n)=n, and Q 2(n) is the QPP-interleaving function. For the case of other interleavers, the data can be rearranged to achieve conflict-free using graph coloring algorithm [20] or annealing procedure [21]. From them the addresses \({\prod }_{r}(n)\) and permutation routes \({\coprod }_{r}(n)\) are obtained as well. The permutation routes are needed for both MAP1 and MAP2. The memory accesses and permutation procedures are still the same as conflict-free case.

The BCJR algorithm is applied for each MAP half iteration. It consists of three main steps: FR, BR and extrinsic a-posteriori calculation. In the forward recursion, the branch metrics are calculated, and previous α metrics are fetched. Then the α for each trellis step k is calculated using kernel function and then stored. In addition, the input data for all k are stored in local in order to be supplied in BR. BR begins at the end of FR and β is calculated. The a-posteriori and extrinsic messages for the next half iteration are obtained. In each step, S represents the current trellis state, and \(S^{\prime }_{i},i=0,1\) represent the previous states connected to the current state. The function \(Trellis(S^{\prime }_{i}, S)\) outputs the information sequence u, the first output branch v 0 (v 0 equals to u when component code is a Systematic Convolutional Code) and second output branch v 1. \((S,S^{\prime })\in \sum +\) denotes all the branches which output bit 1, and \((S,S^{\prime })\in \sum -\) denotes all the branches which output bit 0.

For parallel decoding, α and β are discontinuous when separated into SISOs and decoded in parallel. Therefore Next Iteration Initialization (NII) [7] also called State Metric Propagation (SMP) [19] method is applied. The FR initial value \(\overline {\alpha }_{0}^{p}\) is the final FR value \(\overline {\alpha }_{L-1}^{p-1}\) of the previous iteration, and the backward initial value \(\overline {\beta }_{L-1}^{p}\) is the final BR value \(\overline {\beta }_{0}^{p+1}\) at the previous iteration, depicted in Fig. 2. If there is no previous iteration, the initial values are set to zeros. The messages may come from neighbour PEs, which lead to inter-PE data transfers or global memory accesses.
Figure 2

Turbo decoding NII message passing among processing elements.

2.3 Convolutional Code Viterbi Decoding

Let C C(m,1,M) denote a CC with code rate 1/m, information bits (decoded bits) N and constraint length M (T s =2 M−1 states). The algorithm description is shown in Algorithm 4. The channel messages are split to several blocks with overlapped area with the size of the traceback (TB) length, and decoded by separate PEs. The overlapped area is for traceback procedure for each PE.

Branch Metric Update (BMU) and Add-Compare-Select (ACS) are the two main kernels in FR. In each FR step k, the BMU is performed to obtain the Euclidean Distance between the received messages and local trellis outputs. The distances (branch metrics) are added with the previous path metrics. Two branches \(S_{0}^{\prime }\) and \(S_{1}^{\prime }\) connecting to current state S are compared and the one has smaller metric is selected (the transition bit is recorded as sel). After k reaches L+L T B , the traceback procedure starts. The survival state calculated in previous step is applied to address the survival path memory and get the transition bit β. The survival state is updated by using the previous state and current β. This procedure performs recursively. Finally the Least Significant Bit (LSB) of S p is stored back as the decoded bits.

2.4 Low Complexity Kernel Functions

The original kernel function f(x,y) for LDPC and Turbo are hard to be implemented for the fixed data format and sensitive to the quantization error. Therefore several frequently used low complexity approximations are discussed here in detail. For LDPC, f(x,y) can be the addition of a linear part and a correction part, given by f(x,y)=f b (x,y)+f c (x,y). Following Jacobian approach, f b (x,y)=s i g n(x)⋅s i g n(y)⋅m i n(|x|,|y|)=m a x(x,y)−m a x(x+y,0), and f c (x,y)=l o g(1+e −|x+y|)−l o g(1+e −|xy|). f c (x,y) can be implemented by a Look-Up-Table (LUT) or polynomial approximation, shown as f c L1 and f c L2 in Table 2. Turbo kernel function is part of the LDPC kernel, and the low complexity methods applied for LDPC is also suitable for Turbo.

If only the linear part is used with no correction part, it is called min-sum algorithm. A refined version of min-sum is the offset-min-sum algorithm, which is defined as
$${\Lambda}_{k}=\prod\limits_{i \neq k, i\in 0:S_{r}-1}sign(\gamma_{i})\cdot \max\left\{\min\limits_{i\neq k,i\in 0:S_{r}-1}|\gamma_{i}|-\mu,0\right\}$$
Where μ is a positive small constant. Similar to it, scaled-min-sum algorithm is an alternative approach, which is given by
$${\Lambda}_{k}=\prod\limits_{i \neq k, i\in 0:S_{r}-1}\mu \cdot sign(\gamma_{i})\cdot \min\limits_{i\neq k,i\in 0:S_{r}-1}|\gamma_{i}|$$
Where μ is a scaling factor (less than 1). With these methods, the extrinsic messages Λ are updated with offset-/scaled-min-sum function, whereas FR messages α and BR messages β are updated with the original min-sum function.
In summary, the base part f b (x,y) for the three algorithms can be chosen from Table 1, and the correction part f c (x,y) can be chosen from Table 2.
Table 1

Base part alternatives in kernel function.

Algorithm

Name

Function

Turbo

f b T

m a x(x,y)

LDPC

f b L1

m a x(x+y,0)−m a x(x,y) [15]

 

f b L2

s i g n(x)⋅s i g n(y)⋅m i n(|x|,|y|) [22]

 

f b L3

s i g n(x)⋅s i g n(y)⋅

  

\(max\left (min(|x|, |y|)-\mu , 0\right )\)

 

f b L4

μs i g n(x)⋅s i g n(y)⋅m i n(|x|,|y|)

Viterbi

f b V

m i n(x,y)

Table 2

Correction part alternatives in kernel function.

Algorithm

Name

Function

Turbo

f c T1

\(max(\frac {5}{8}-\frac {|x+y|}{4}, 0)\)

 

f c T2

l o g(1+e −|xy|) [23]

 

f c T3

0

LDPC

f c L1

\(max(\frac {5}{8}-\frac {|x+y|}{4}, 0)-max(\frac {5}{8}-\frac {|x-y|}{4}, 0)\) [15]

 

f c L2

l o g(1+e −|x+y|)−l o g(1+e −|xy|) [22]

 

f c L3

0

Viterbi

f c V

0

3 Computational Analysis

3.1 Proposed Evaluation Method and Platform Independent Assumptions

The computational complexity is evaluated by Giga Operations per Second (GOPS) or Operations per decoded bit (OP/bit), where the ‘Operation’ refers to the basic operations listed in Table 3. They are the hardware architecture unrelated basic computation units for constructing the pseudo-codes, and all of them cost one operation. Therefore the number of operations can be derived from the pseudo-codes based on these unit operations. One operation may need several instructions to be performed according to the instruction-set specification. The ‘Complex arithmetic’ category in the table is a special class of computations which requires much more computational resources and floating point support. They are applied for comparing kernel function alternatives only.
Table 3

List of unit operations for the computational complexity evaluation.

Category

Operation definition

Arithmetic computation

c=a+b, c=ab, c=ab, c=|a|

 

{c,i d x}=m a x(a,b), {c,i d x}=m i n(a,b)

 

c=−a, c=s i g n(a), c=ac o n s t

 

c=a/2, c=a/4

Complex arithmetic

c=l o g(x), c=e x

LUT access

c=L U T(a,b), c=L U T(a)

Memory access

load a datum, store a datum,

 

permute a datum

Branch

if-else

Note*: Although const multiplication may require multiple pipeline stages, when data dependency is proper sheduled, one equivalent clock cycle (one operation) processing is attainable.

There are several assumptions needed to achieve a platform independent evaluation.
  1. (1).

    To unify the evaluation of branch cost, it is assumed that all the loops are unrolled and the loop branch overhead is zero;

     
  2. (2).

    The memory capacity is assumed to be sufficient;

     
  3. (3).

    The permutation can be realized by a likely ‘load’ instruction with the permute route as the offset address, which cost one operation. The permutation for a set of data may also be executed by a dedicated hardware such as crossbar network;

     
  4. (4).

    The data are assumed in floating format or fixed format with enough datawidth, hence no overflow protection is included.

     

In the following analysis, the number of operations of kernel function f(x,y) alternatives is analysed at first, then the number of operations of single FR/BR step is evaluated. With the number of FR/BR iterations, the total decoding complexity is derived at last.

3.2 LDPC Decoding Computational Complexity Analysis

  1. (1).

    Kernel operation

    The operations for the possible kernel functions are listed in Table 4. It shows that the minimum number of operations is 4, whereas all the kernels can be calculated within 9 operations. The original function needs exponential and logarithm calculation, hence approximation kernels with simple operations and LUTs are recommended to be applied. For calculating f c L1 and f c L2, LUTs for function \(max(\frac {5}{8}-\frac {|x+y|}{4}, 0)\), \(max(\frac {5}{8}-\frac {|x-y|}{4}, 0)\), l o g(1+e −|x+y|), l o g(1+e −|xy|) are assumed to be provided. The addition operation for connecting f b (linear part) and f c (correction part) are counted in. In following, the ‘min-sum’ solution with f=4 is chosen for the overall decoding procedure complexity estimation.

     
  2. (2).

    Recursion kernel

    We divide the Algorithm into FR part and BR part, and then estimate them separately. In Algorithm 1, Line 7 to Line 15 is the FR part. In Table 5, the operations for FR part are summarized, where block transmission is applied. The total operations in FR part are calculated as F R=2+(6+f)⋅Z.

    From Line 16 to Line 26 in Algorithm 1 is the BR part. Table 6 shows the summary of operations, and the number of operations in BR part is B R=(6+2f)⋅Z.

     
  3. (3).

    Loop structure and total operations

    There are up to i t m a x iterations. In each iteration L p-layers (rows in H B) are processed sequentially. In each row, there are totally S r NZEs. For an irregular code, S r is aligned to the maximum number of NZEs among all p-layers. This approximation is beneficial for parallel FR/BR alignment, and also with the consideration that the S r difference among p-layers is at most one for wireless standards IEEE 802.11n and IEEE 802.16e. Therefore the total decoding operations is
    $$\begin{array}{@{}rcl@{}} OP_{LDPC}&=&it_{max}\cdot L \cdot S_{r} \cdot (FR+BR) \\ &=&it_{max}\cdot L \cdot S_{r} \cdot (2+(12+3f)\cdot Z) \end{array} $$
    (1)
     
Table 4

Operations for calculating LDPC f(x,y) kernels.

Name

Kernel

Summary

Total

Original

f(x,y)

add/sub=4, exp=3, log=2

9

Min-sum

f b L1(+f c L3)

max=2, add/sub=2

4

LUT-min-sum

f b L2(+f c L3)

abs=2, sign=2, min=1, mul=2

7

 

f b L1+f c L1

base=4, LUT=2, add/sub=2

8

 

f b L1+f c L2

base=4, LUT=2, add/sub=2

8

Offset

f b L3

abs=2, sign=2, max=1

9

-min-sum

 

min=1, sub=1, mul=2

 

Scaled

f b L4

abs=2, sign=2, min=1

8

-min-sum

 

mul=3

 
Table 5

Operations for LDPC forward recursion kernel.

Type

Equation

Operations

Address LUT

R r,k , S r,k

2

Address comp.

-

0

Memory read

\(I_{org}^{z} \leftarrow \overline {L}(z, \mathbf {S}_{r,k})\), \(\rho ^{z} \leftarrow \overline {\Lambda }_{r,k}^{z}\)

Z×2

Memory write

\(\overline {\gamma }_{k}^{z} \leftarrow \gamma ^{z}\), \(\overline {\alpha }_{k}^{z} \leftarrow \alpha ^{z}\)

Z×2

Permutation

\(I^{z} \leftarrow {\Phi }(I_{org}^{z}, \mathbf {R}_{r,k})\)

Z

Kernel

\(\alpha ^{z} \leftarrow f(\alpha ^{z},\gamma ^{z})\)

Z×f

Other arith. comp.

\(\gamma ^{z} \leftarrow I^{z} - \rho ^{z}\)

Z×1

Note: (1). In LDPC decoding, the minimum kernel complexity is f=4; (2). The operations is represented as the number of PEs times the operations in each PE.

Table 6

Operations for LDPC backward recursion kernel.

Type

Equation

Operations

Address LUT

-

0

Address comp.

-

0

Memory read

\(\gamma ^{z} \leftarrow \overline {\gamma }_{k}^{z}\), \(\alpha ^{z} \leftarrow \overline {\alpha }_{k}^{z}\)

Z×2

Memory write

\(\overline {\Lambda }_{r,k}^{z} \leftarrow {\Lambda }^{z}\)

Z×2

 

\(\overline {L}(z,\mathbf {S}_{r,k}) \leftarrow O_{org}^{z}\)

 

Permutation

\(O_{org}^{z} \leftarrow {\Phi }^{-1}(O^{z}, \mathbf {R}_{r,k})\)

Z

f(x,y) Kernel

\(\beta ^{z} \leftarrow f(\beta ^{z},\gamma ^{z})\)

Z×2f

 

\({\Lambda }^{z} \leftarrow f(\alpha ^{z}, \beta ^{z})\)

 

Other arith. comp.

\(O^{z} \leftarrow \gamma ^{z}+{\Lambda }^{z}\)

Z×1

Note that the statistics include the decoding kernel only. Early termination and correction check are not counted in. The data input and output procedure is assumed to be finished by DMA instead of processors. Therefore the data load/store operation cost is not included. It is assumed in the deduction that Z PEs are available. However as long as the number of PEs P is less than or equal to Z, the total number of operations is the same, whereas the Z layers may be updated in partially parallel if PEs are not abundant.

For the case of the maximum throughput in IEEE 802.11n standard, the configuration is L=4, S r =20, and Z=81. Along with the selected parameters f=4 and i t m a x =6, there are 934,080 operations for the 1620 decoded bits (1944 channel bits). Hence the computational complexity is 577 OP/bit. For a throughput of 450 Mb/s, 259 GOPS would be consumed.

Following similar way the computational complexity for Algorithm 2 can be calculated. It shows that Algorithm 2 consumes 6Z additional operations than Algorithm 1 in the Check Node Update kernel (FBR kernel). Therefore we recommend Algorithm 1 for software defined (SD) decoding.

3.3 Turbo Decoding Computational Complexity Analysis

Table 7 shows the alternatives of Turbo kernel function f(x,y). In this table, we assume that LUTs are available for \(max(\frac {5}{8}-\frac {|x+y|}{4}, 0)\) and l o g(1+e −|xy|). In most of Turbo implementations the max-log-MAP algorithm is applied, and is therefore selected for the following decoding complexity evaluation.
Table 7

Operations for calculating turbo f(x,y) kernels.

Name

Kernel

Summary

Total

log-MAP

f(x,y)

exp=2, add=1, log=1

4

Max-log-MAP

f b T (+f c T3)

max=1

1

LUT-log-MAP

f b T +f c T1/2

base=1, LUT=1, add=1

3

In Algorithm 3, forward recursion is in Line 9 to Line 22. It contains the accesses of memory with permutation, branch metric update and forward recursions. Table 8 summarized the operations in detail.
Table 8

Operations for turbo forward recursion kernel.

Type

Equation

PNum. × OP/Par.

Address LUT

\(k_{inv}^{p}\leftarrow {\prod }_{r} k^{p}\), \(k_{pmu}^{p}\leftarrow {\coprod }_{r} k^{p}\)

P×2

Addr. Comp.

\(k^{p}\leftarrow k+pL\)

P×1

Memory read

\({M_{1}^{p}} \leftarrow \overline {L}(i_{r}, k_{inv}^{p})\), \({M_{2}^{p}} \leftarrow \overline {L}(0, k_{inv}^{p})\),

P×3

 

\({M_{3}^{p}} \leftarrow \overline {L}(1+r,k^{p})\)

 

Memory write

\(\overline {\{L_{a}, r_{0}, r_{1}\}}_{k}^{p} \leftarrow \{L_{a}, r_{0}, r_{1}\}^{p}\)

P×3

 

\(\overline {\alpha }_{k}^{p}(S) \leftarrow \alpha ^{p}(S)\)

PT s ×1

Permutation

\(\{L_{a}, r_{0}, r_{1}\}^{p}\leftarrow \{\Phi ({M_{1}^{p}}, k_{pmu}^{p}), {\Phi }({M_{2}^{p}}, k_{pmu}^{p}), {M_{3}^{p}})\}\)

P

Kernel BM

\(S^{\prime }_{0}\leftarrow S/2\), \(S^{\prime }_{1}\leftarrow S/2+T_{s}/2\)

0

 

\(\{u,v_{0},v_{1}\}^{p}\leftarrow Trellis(S^{\prime }_{i},S)\)

0

 

\(\gamma ^{p}(S^{\prime }_{i},S)\leftarrow \frac {1}{2}\left (u^{p}\cdot {L_{a}^{p}}+{r_{0}^{p}}\cdot {v_{0}^{p}} + {r_{1}^{p}} \cdot {v_{1}^{p}}\right )\)

P×8

Kernel FR

\(b^{p}(S_{i}^{\prime },S) \leftarrow \gamma ^{p}(S^{\prime }_{i},S)+\alpha ^{p}(S^{\prime }_{i})\), i=0,1

2PT s ×1

 

\(\alpha ^{p}(S)\leftarrow f\left (b^{p}(S_{0}^{\prime },S), b^{p}(S_{1}^{\prime },S)\right )\)

PT s ×f

Note: (1). PNum. refers to the number of PEs, OP/Par. refers to the operations in each PE; (2). In Turbo decoding, the minimum kernel complexity is f=1.

When the constituent code is systematic such as in 3GPP standard, where u=v 0, a simplified solution for BM calculation is shown following.

Step 1.

Calculating the trellis constraints, following t 0[0]=S[0]⊕S[2], t 0[1]=S[0]⊕S[1], t 1[0]=S[0]⊕S[2]⊕1, and t 1[1]=S[0]⊕S[1]⊕1. Where ⊕ represents XOR logic.

Step 2.

Obtaining g 0 and g 1 by p=L a +r 0, g 0=(p+r 1)/2, and g 1=(−p+r 1)/2.

Step 3.

Output the branch messages following the Table 10.

Although this method has the similar number of operations as the original method, for all 16 branches (T s =8) there are only four independent branch metric values associated with all possible t i . In addition, two of them are the negative numbers of the other two. Therefore the BM calculation in FR/BR consumes 8 operations for each step in each PE.

It is derived from Tables 8 and 9 that the total number of FR operations is (19+(3+f)⋅T s )⋅P, and BR is (17+(7+3f)⋅T s )⋅P. With the number of half iterations (2i t m a x ), the recursion window length L, and the parallelism P (where LP=N), the total operations are calculated by
$$\begin{array}{@{}rcl@{}} OP_{Turbo}&=&2it_{max}\cdot L \cdot (FR+BR) \\ &=&2it_{max}\cdot N \cdot (36+(10+4f)\cdot T_{s}) \end{array} $$
(2)
Table 9

Operations for turbo backward recursion kernel.

Type

Equation

PNum. × OP/Par.

Addr. Comp.

\(k^{p}\leftarrow k+pL\)

P×1

Memory read

\(\{L_{a}, r_{0}, r_{1}\}^{p} \leftarrow \overline {\{L_{a}, r_{0}, r_{1}\}}_{k}^{p}\)

P×3

 

\(\alpha ^{p}(S) \leftarrow \overline {\alpha }_{k}^{p}(S)\)

PT s ×1

Memory write

\(\overline {L}(i_{1-r}, k_{inv}^{p}) \leftarrow {L_{w}^{p}}\), or \(\overline {L}(i_{L}, k_{inv}^{p}) \leftarrow {L_{w}^{p}}\)

P×1

Permutation

\({L_{w}^{p}} \leftarrow {\Phi }^{-1}({L_{e}^{p}}, k_{pmu}^{p})\), or \({L_{w}^{p}} \leftarrow {\Phi }^{-1}({L_{o}^{p}}, k_{pmu}^{p})\)

P

Kernel BM

\(S^{\prime }_{0}\leftarrow 2S\), \(S^{\prime }_{1}\leftarrow 2S+1\)

0

 

\(\{u,v_{0},v_{1}\}^{p} \leftarrow Trellis(S,S^{\prime }_{i})\)

0

 

\(\gamma ^{p}(S,S^{\prime }_{i})\leftarrow \frac {1}{2}\left (u^{p}\cdot {L_{a}^{p}}+{r_{0}^{p}}\cdot {v_{0}^{p}} + {r_{1}^{p}} \cdot {v_{1}^{p}}\right )\)

P×8

Kernel BR

\(b^{p}(S, S^{\prime }_{i}) \leftarrow \gamma ^{p}(S,S^{\prime }_{i})+\beta ^{p}(S^{\prime }_{i})\), i=0,1

2PT s ×1

 

\(\beta ^{p}(S)\leftarrow f\left (b^{p}(S, S^{\prime }_{0}), b^{p}(S, S^{\prime }_{1})\right )\)

PT s ×f

Kernel LLR

\({L_{o}^{p}}\leftarrow f_{(S,S^{\prime })\in \sum +}W^{p}(S^{\prime }, S) -f_{(S,S^{\prime })\in \sum -}W^{p}(S^{\prime }, S)\)

2PT s ×f,

  

P×1

 

\(W^{p}(S^{\prime }, S) \leftarrow \beta _{old}^{p}(S^{\prime })+\gamma ^{p}(S,S^{\prime })+\alpha ^{p}(S)\)

2PT s ×2

 

\({L_{e}^{p}} \leftarrow {L_{o}^{p}}-L_{c}{r_{0}^{p}}-{L_{a}^{p}}\), with L c =1

P×2

Take 3GPP-LTE(A) Turbo (T s =8) with N=6144 decoded bits as an example. When max-log-MAP kernel function (f=1) and i t m a x =6 are selected, the total operations for the decoding are 10,911,744. Therefore the computational complexity is 1776 OP/bit (Table 10). For a decoding throughput of 150 Mbit/s, the computation would be 266 GOPS. Note that in this evaluation, no pipeline stall caused by data dependency or memory access conflict is counted in. The Inter-PE message passing only happens at the border of the recursion shown in Fig. 2. The complexity contribution of it is negligible, and therefore not included.
Table 10

Complexity reduced turbo branch metric calculation.

t i [0]

t i [1]

\(b(S^{\prime }_{i}, S)\)

0

0

g 0

0

1

g 1

1

0

g 1

1

1

g 0

If parallelism P increases, the decoding latency would be reduced. In such case the amount of complexity increasing is negligible whereas the bit error rate performance may degrade due to the discontinuity of recursion messages. More iterations can be applied to reduce the degradation which results in the linear increasing of total operations.

3.4 CC Decoding Computational Complexity Analysis

The distance function d i s t(r,v) is originally implemented by Euclidean distance function \(dist(r,v) = {\sum }_{i} (r_{i}-v_{i})^{2},i=0,\cdots ,m-1\). Because \({\sum }_{i} {r_{i}^{2}}\) and \({\sum }_{i} {v_{i}^{2}}\) are the same for all transition branches. After dropping these terms, the remaining part is \(-2{\sum }_{i} r_{i}\cdot v_{i}\), and −2 is a constant which can be avoided without changing the relative value. In addition, v i is the m local trellis transition bits containing at most 2 m combinations, and half of them can be obtained by a negative operation from the other half (consumes totally 2 m−1 operations). Therefore the 2 m−1 possible metrics are calculated in advance. For each metric calculation, m−1 additions/subtractions are required. Therefore the d i s t(r,v) calculation for all branches in a trellis step consumes (m−1)⋅2 m−1+2 m−1=m⋅2 m−1 in total. If the pre-calculation is not applied, m multiplications (or selections) and m−1 additions are needed for each branch metric calculation.

From Tables 11 and 12, we can conclude that for each step, there are P⋅(1+m+m⋅2 m−1+(3+f)⋅T s ) operations to be applied in ACS and 6.5P operations in TB. For a codeword with length N=LP, The decoding procedure contains L+L T B ACS and TB steps, therefore the overall operations are
$$\begin{array}{@{}rcl@{}} OP_{CC}\! &=&\!(7.5{\kern-.8pt}+{\kern-.8pt}m{\kern-.8pt}+{\kern-.8pt}m{\kern-.8pt}\cdot{\kern-.8pt} 2^{m-1}{\kern-.8pt}+{\kern-.8pt}({\kern-.8pt}3{\kern-.8pt}+{\kern-.8pt}f{\kern-.8pt}){\kern-.8pt} \cdot{\kern-.8pt} T_{s}{\kern-.8pt}){\kern-.7pt}\cdot{\kern-.7pt} P{\kern-.7pt} \cdot{\kern-.7pt} ({\kern-.7pt}L{\kern-.7pt}+{\kern-.7pt}L_{TB}{\kern-.7pt})\\ &=&\!(7.5+m+m\cdot 2^{m-1}+4 T_{s}) \cdot P \cdot (L+L_{TB}) \end{array} $$
(3)
Table 11

Operations for CC forward path.

Type

Equation

PNum. × OP/Par.

Address comp.

\(k^{p}\leftarrow k+pL\)

P×1

Memory read

\({r_{i}^{p}} \leftarrow \overline {L}(i,k^{p}),i=0,\cdot ,m-1\)

Pm×1

Memory write

\(\overline {TB}_{k}^{p}(S) \leftarrow sel^{p}(S)\)

PT s ×1

Kernel BMU

\(\mathbf {v}(S^{\prime }_{i}, S) \leftarrow Trellis(S^{\prime }_{i}, S)\)

0

 

\(\gamma ^{p}(S^{\prime }_{i}, S)\leftarrow dist(\mathbf {l}^{p},\mathbf {v}(S^{\prime }_{i}, S)),i=0,1\)

P×m⋅2 m−1

kernel ACS

\(S^{\prime }_{i}\leftarrow S/2+iT_{s}/2,i=0,1\)

0

 

\(b^{p}(S^{\prime }_{i},S) \leftarrow \gamma ^{p}(S^{\prime }_{i},S)+\alpha ^{p}(S^{\prime }_{i}),i=0,1\)

P⋅2T s ×1

 

\(\{\alpha ^{p}(S), sel^{p}(S)\}\leftarrow f(b^{p}(S^{\prime }_{0},S), b^{p}(S^{\prime }_{1},S))\)

PT s ×f

Note: In CC decoding, the minimum f is 1.

Table 12

Operations for CC backward path.

Type

Equation

PNum. × OP/Par.

Address comp.

\(k^{p}\leftarrow k+pL\)

P×1

Memory read

\(\beta ^{p} \leftarrow \overline {TB}_{k}^{p}(S^{p})\)

P×1

Memory write

\(\hat {x}(k^{p})\leftarrow LSB(S^{p})\) when k<L

P×1

Kernel TB

if β p =1 then \(S^{p} \leftarrow S^{p}/2\)

P×3.5

 

else \(S^{p} \leftarrow S^{p}/2+T_{s}/2\)

 

For a CC(2, 1, 7) (m=2, T s =64), with decoded bits N=2048, traceback length L T B =35 and parallelism P=8, the total operations are 627,396, equivalent to 306 OP/bit. In the case of larger trellis constraint length code such as CC(3, 1, 9) (m=3, T s =256), with N=2048, L T B =45 and P=8, the total operations are 2,519,972, and equivalent to 1230 OP/bit.

Because there is an overlapped area in each PE for the traceback computing which costs redundant computations, the parallelism P impacts on the computational complexity obviously. For the previous CC(3,1,9) example, the decoding needs extra 15 % of operations than the single PE (P=1) configuration.

4 Complexity Comparison

To reveal the relationship of computational complexity among these algorithms, the number of operations is shown in Fig. 3, wherein all the LDPC configurations in IEEE 802.11n and IEEE 802.16e standards (R=1/2 to R=5/6), 8-state Turbo (3GPP-LTE Turbo R=1/3) and CC are compared. Six iterations are applied for LDPC and Turbo decoding. Several conclusions are summarized from the comparison: (1). For all modes, the decoding complexity is approximately proportional to the number of the decoded bits; (2). When the cost of operations is the same, LDPC may offer even 2-3 times of throughput than that of Turbo; (3). The complexity of CC(3, 1, 9) is slightly higher than LDPC with 6 iterations; (4). IEEE 802.11n and IEEE 802.16e LDPC have the similar complexity; (5). CC with small constraint length has the minimum complexity, whereas Turbo decoding consumes much more operations than other coding types. In Fig. 4, the relationship of throughput and computing cost as GOPS is provided for several typical codes. The 1 Gbps Turbo consumes approximately 2000 GOPS, along with other auxiliary workload it would be 2-3 times more, which is currently difficult to be performed in a single chip. However 1Gbps LDPC requires less than 600 GOPS which is easier to be realized in SDR platforms. Less iterations can linearly reduce the complexity, which can be achieved with the early termination.
Figure 3

Computational complexity comparison among FEC algorithms with different decoded codeword length.

Figure 4

Evaluated total computing cost with respect to throughput.

The computation is composed by several kernel tasks, and the operations for each step are summarized, listed in Table 13. CC(2, 1, 7), 3GPP-LTE Turbo and Z=96 QC-LDPC codes are shown as examples. It is concluded that for Turbo and CC, arithmetic computation occupies 75 % around of workload, whereas memory access consumes 25 % around. For LDPC, 42 % of workload belongs to memory access, and only 58 % belongs to arithmetic computation. One of the reasons is that LDPC outputs Z a-posteriori messages in a recursion step, whereas Viterbi or (Binary) Turbo only outputs one bit. Considered that LDPC Layered decoding consumes approximately 1/3 of total computation to Turbo BCJR, Layered Decoding requires only 25 % message update computations per bit than that of BCJR (with the same number of iterations). Viterbi algorithm consumes most of the computation on ACS because T s states need to be processed one by one. For an even larger T s such as in CC(2, 1, 9), the percentage of computation for ACS is even larger.
Table 13

Percentage of component computations.

Kernel

Operations

Percentage

Convolutional code

  

Address transform

2

0.7 %

Memory read

3

1.1 %

Memory write

65

24.1 %

Kernel BMU

4

1.5 %

Kernel ACS

192

71.2 %

Kernel TB

3.5

1.3 %

Sum

269.5

100.0 %

Turbo (Half iteration)

  

Address LUT

2

1.4 %

Address transform

2

1.4 %

Memory read

14

9.5 %

Memory write

12

8.1 %

Permutation

3

2.0 %

Kernel BM

16

10.8 %

Kernel FR

24

16.2 %

Kernel BR

24

16.2 %

Kernel LLR

51

34.5 %

Sum

148

100.0 %

LDPC (One iteration)

  

Address LUT

2

0.1 %

Memory read

384

16.7 %

Memory write

384

16.7 %

Permutation

192

8.3 %

f(x,y) kernel

1152

50.0 %

Arithmetic

192

8.3 %

Sum

2306

100.0 %

It is revealed from Table 13 that several kernel functions consume a large portion of computing resources. The operations would be reduced dramatically if the kernels are accelerated by hardware circuits and operation-fusion instructions. Such as the f(x,y) function for LDPC layered decoding, the LLR calculation in Turbo decoding, and the ACS kernel in Viterbi algorithm.

5 Specific Platform Design with the Proposed Evaluations

For highly parallel platforms, the platform related overhead needs to be considered in, which includes the inter-core communication, the core workload balancing, and the synchronization between cores. The decoding algorithms also require for a conflict-free or conflict-minimum memory access. It is advisable to construct a many core platform based on the ASIC FEC implementations. In ASIC (Application Specific Integrated Circuit) or ASIP (Application Specific Instruction-set Processor) implementations of the proposed pseudo-codes [19, 24, 25, 26], the inter-core data passing are operation-free, the core tasks are well balanced and the synchronization between cores are not necessary. The memory banks are small sized on-chip scratch-pad modules, therefore no cache is needed. All the memory access conflicts can be avoided and therefore not considered in. These optimizations for ASIC decoders can be applied for designing an SD FEC platform.

We also constructed a tri-mode unified ASIP decoder [27] following these pseudo-codes. (Additional sliding windows are added for reducing the buffer size). The inter-connection network and memory subsystem can be borrowed to construct the fully programmable platform. The difference is that the processor cores are introduced to substitute the arithmetic circuits in the ASIP prototype. Higher flexible inter-core network with more redundancy can be introduced without performance degradation if the network and memory structure in the ASIP prototype are included.

6 Compare the Evaluated Results with General Software Defined Decoding Platforms

Apart from the theoretical complexity results and the supporting hardware platforms which can reach these low bounds, the FEC benchmarks are provided for revealing the attainable complexity in feasible processors. Currently, high throughput decoding mainly relies on the General Purpose GPU (GPGPU) platforms because of its highly parallel architecture. In such platforms, the peak floating point operations (GFLOPS) are accessible by manufactures. An alternative choice is general DSP platforms with Very Long Instruction Word (VLIW) architecture, where several instructions are possibly processed simultaneously, such as Texas Instrument (TI) TMS320Cx series DSPs. The peak MIPS can be derived from the device datasheets. Apart from the peak performance, the decoding throughput, iteration numbers, and code length are available in the reference papers, therefore the operations per bit per iteration can be obtained following O P/N/i t m a x . Meanwhile the evaluated results are shown with OP given by Eqs. 12 and 3.

Figure 5 shows the software defined (SD) Turbo in general purpose platforms. The proposed guideline is approximately 312 for all sizes of decoded bits. Most of the reference approaches are in GPU platform. Wolf et al. [28] proposed a Design Space Exploration method for SD Turbo and four platforms are tested. With the codeword size of 5000, the complexity is 10 KOP/bit - 100 KOP/bit around. Other proposals target on N=6144 3GPP-LTE Turbo, which reveal a complexity of 7 KOP/bit - 45 KOP/bit. Proposal [6] reported that the efficiency can be further improved with multi-codeword parallel decoding, which can fully utilize the GPU resources, and the efficiency improves from 9 KOP/bit to 1.8 KOP/bit. It also introduces highly parallel number to make full use of core resources, and finally 122 Mb/s throughput is derived. Apart from that, a TMS320C6201 DSP approach [8] reveal that approximately 6400 OP/bit is required for HSPA N=5114 code, which is similar to GPU approaches. For this TMS320C6201 DSP platform and following TMS320C64x platform, the peak operations are evaluated by eight times of its peak Million Instruction Per Second (MIPS) because the processor has eight processing units. The actual workload derived from these implementations is higher than the proposed results because of several reasons. (1). An operation defined in Table 3 may be mapped into several processor operations (instructions). (2). In our evaluation, extra tasks such as data management, memory conflict management, controlling overhead, and thread synchronization are not taken into account. (3). The GPU device peak GFLOPS is assumed as the cost used for the decoding procedure, however making full use of all the computation resources on a chip is unrealistic. Nevertheless, the reference designs reveal the nowadays attainable complexity. It is also hopeful that using an alternative hardware architecture the decoding efficiency can be improved further.
Figure 5

Computational complexity of SD Turbo implementations. The solution references are M. Wolf [28], D. Yoge [29], M. Wu [30], Y. Song [8], X. Jiao [6], and G. Wang [7].

According to the code rates and how sparse the base matrices are, the LDPC decoding complexity varies in different configurations. However they most locate in a ‘band’, proposed in Fig. 6. It shows the complexity ranging from 88 OP/bit to 162 OP/bit for all the configurations in IEEE 802.11n and IEEE 802.16e. Most of proposed solutions realized a complexity of 500 OP/bit to 2300 OP/bit. Among them, Wen et al. [31] proposed a min-sum layered decoder reaching up to 507 Mbit/s (2 iterations) with early termination, which is the highest throughput among all SD LDPC. It reaches a complexity of 1062 OP/bit. G. Wang et al. proposed a 304 Mbps (10 iterations, 50 codewords in GTX TITAN GPU) solution which reaches the lowest computational complexity (493 OP/bit). K.K.Abburi et al. [2] proposed another high efficient solution which reaches a complexity of 881 OP/bit. The TI TMS320C64x DSP solution [5] shows its similar complexity to GPU solutions.

In case of Viterbi decoding, the benchmarks for DSP, ARM and Intel Processor are summarized. For TMS320C62x series DSP, the number of instructions for GSM CC(2, 1, 5) decoding is given by (38⋅N+12+N/4)/N [36]. For a larger N, approximately 38 instructions per bit are required, which equivalent to 306 operations/bit due to the parallel architecture with eight processing units. The SPIRAL Viterbi decoding code [38] is applied for evaluating the overall decoding complexity in ARM processor (ARM Cortex A7) and Intel processor (Intel Core i7-2600). The code is complied by GNU GCC with ‘-O3’ optimization level. For Core i7 implementation, only one of the 8 cores is utilized. The evaluation results are shown in Table 14, wherein the peak MIPS is derived in [39]. Because 16-way SSE (Streaming SIMD Extensions) vector instructions can perform up to 16 calculations per instruction, therefore the number of instructions is lower than the proposed guideline. It reveals that VLIW DSP is efficient for Viterbi processing. For high throughput decoding in Intel processors, a much higher decoding throughput is achieved by enabling SSE acceleration [9]. The complexity of reference implementations are approximately 2-3 times higher than the proposed complexity guideline.
Figure 6

Computational complexity of SD LDPC implementations. The reference implementations are G. Falcao [4], GTX 470 G. Wang [32] [33], K.K.Abburi [2], S. Kang [34], G. Falcao (K=4000) [35], GTXTITAN G. Wang [3], X. Wen [31], and TMS320C64x [5].

Table 14

Viterbi computational complexity benchmarks.

Code type

Platform

Peak performance

Benchmark

Equivalent complexity

CC(2, 1, 5) N=189 for GSM

TMS320C62x DSP [36]

1600 MIPS, 8 PEs.

38.3 Inst/bit

306 OP/bit (8-parallel VLIW)

CC(2, 1, 5) N=189 for GSM

Proposed guideline

  

87 OP/bit

CC(2, 1, 7)

Cell Processor [37]

102.4 GOPS

30 Mbit/s

3413 Inst/bit

CC(2, 1, 7) N=2048 SPIRAL

Core i7 2600

128,300 MIPS (16037/core),

4.1 Mbit/s

3887 Inst/bit

  

without SSE

  

CC(2, 1, 7) N=2048 SPIRAL

Core i7 2600

128,300 MIPS (16037/core),

26.4 Mbit/s

607 Inst/bit (16-parallel SSE)

  

with SSE

  

CC(2, 1, 7) N=2048 SPIRAL

ARM Cortex A7

2850 MIPS/core

786 Kbit/s

3627 Inst/bit

CC(2, 1, 7)

Proposed guideline

  

1252 OP/bit

7 Conclusion

In this work the complexity evaluations for LDPC Layered Decoding, Turbo BCJR decoding and CC Viterbi decoding are provided. Closed expressions for these coding types are offered with variety of configurable parameters. The complexity of these algorithms are compared with the configurations in wireless standards. The reference implementations are compared with the proposed results, which shows that current SDR platforms still have possibilities for achieving higher decoding efficiency. The proposed pseudo-codes, parallel schemes and opeartion results may promote the architecture selection and software design for further software defined FEC platforms.

Notes

References

  1. 1.
    Van Berkel, C.H. (2009). Multi-core for mobile phones. In Design, automation test in Europe conference exhibition, 2009. DATE ’09 (pp. 1260–1265).Google Scholar
  2. 2.
    Abburi, K.K. (2011). A scalable LDPC decoder on GPU. In 24th international conference on VLSI design (VLSI Design), 2011 (pp. 183–188).Google Scholar
  3. 3.
    Wang, G., Wu, M., Yin, B., & Cavallaro, J.R. (2013). High throughput low latency LDPC decoding on GPU for SDR systems. In Proceedings of the IEEE global conference on signal and information processing (GlobalSIP).Google Scholar
  4. 4.
    Falcao, G., Sousa, L., & Silva, V. (2011). Massively LDPC decoding on multicore architectures. IEEE Transactions on Parallel and Distributed Systems, 22(2), 309–322.CrossRefGoogle Scholar
  5. 5.
    Lechner, Gottfried, Sayir, J., & Rupp, M. (2004). Efficient DSP implementation of an LDPC decoder. In Proceedings of IEEE international conference on acoustics, speech, and signal processing, 2004. (ICASSP ’04), (Vol. 4 pp. iv–665–iv–668).Google Scholar
  6. 6.
    Xianjun, Jiao, Canfeng, Chen, Jaaskelainen, P., Guzma, V., & Berg, H. (2013). A 122mb/s turbo decoder using a mid-range GPU. In 9th international wireless communications and mobile computing conference (IWCMC), 2013 (pp. 1090–1094).Google Scholar
  7. 7.
    Wu, M., Sun, Y., Wang, G., & Cavallaro, J.R. (2011). Implementation of a high throughput 3GPP turbo decoder on GPU. Journal of Signal Processing System, 65(2), 171–183.CrossRefGoogle Scholar
  8. 8.
    Song, Y., Liu, G., & et al. (2005). The implementation of turbo decoder on DSP in W-CDMA system.Google Scholar
  9. 9.
    Tan, K., He, L., Zhang, J., Zhang, Y., Ji, F., & Voelker, G.M. (2011). Sora: high-performance software radio using general-purpose multi-core processors. Communications of the ACM, 54(1), 99–107.CrossRefGoogle Scholar
  10. 10.
    Lin, Yuan, Lee, Hyunseok, Woh, M., Harel, Y., Mahlke, S., Mudge, T., Chakrabarti, C., & SODA, K. Flautner. (2006). A low-power architecture for software radio. In 33rd international symposium on computer architecture, 2006. ISCA ’06 (pp. 89–101).Google Scholar
  11. 11.
    Chengzhi, P., Bagherzadeh, N., Kamalizad, A.H., & Koohi, A. (2003). Design and analysis of a programmable single-chip architecture for DVB-T base-band receiver. In Design, automation and test in Europe conference and exhibition, 2003 (pp. 468– 473).Google Scholar
  12. 12.
    Ohkubo, N., Miki, N., Kishiyama, Y., Higuchi, K., & Sawahashi, M. (2006). Performance comparison between turbo code and rate-compatible LDPC code for evolved utra downlink OFDM radio access. In Military communications conference, 2006. MILCOM 2006 (pp. 1–7): IEEE.Google Scholar
  13. 13.
    Kienle, F., Wehn, N., & Meyr, H. (2011). On complexity, energy- and implementation-efficiency of channel decoders. IEEE Transactions on Communications, 59(12), 3301–3310.CrossRefGoogle Scholar
  14. 14.
    Dielissen, J., Engin, N., Sawitzki, S., & van Berkel, K. (2008). Multistandard FEC decoders for wireless devices. IEEE Transactions on Circuits and Systems 284–288.Google Scholar
  15. 15.
    Mansour, M.M. (2006). A turbo-decoding message-passing algorithm for sparse parity-check matrix codes. IEEE Transactions on Signal Processing, 54(11), 4376–4392.CrossRefGoogle Scholar
  16. 16.
    Hu, X.-Y., Eleftheriou, E., Arnold, D.-M., & Dholakia, A. (2001). Efficient implementations of the sum-product algorithm for decoding LDPC codes. In Global telecommunications conference, 2001. GLOBECOM ’01, (Vol. 2 pp. 1036–1036E): IEEE.Google Scholar
  17. 17.
    Wang, Z., & Cui, Z. (2007). Low-complexity high-speed decoder design for quasi-cyclic LDPC codes. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 15(1), 104–114.MathSciNetCrossRefGoogle Scholar
  18. 18.
    Takeshita, O.Y. (2006). On maximum contention-free interleavers and permutation polynomials over integer rings. IEEE Transactions on Information Theory, 52(3), 1249–1253.MathSciNetCrossRefMATHGoogle Scholar
  19. 19.
    Sun, Y., & Cavallaro, J.R. (2011). Efficient hardware implementation of a highly-parallel 3GPP LTE/LTE-advance turbo decoder. Integration, the VLSI Journal, 44(4), 305–315.CrossRefGoogle Scholar
  20. 20.
    Sani, A.H., Coussy, P., & Chavet, C. (2013). A first step toward on-chip memory mapping for parallel turbo and LDPC decoders: a polynomial time mapping algorithm. IEEE Transactions on Signal Processing, 61(16), 4127–4140.MathSciNetCrossRefGoogle Scholar
  21. 21.
    Tarable, A., Benedetto, S., & Montorsi, G. (2004). Mapping interleaving laws to parallel turbo and LDPC decoder architectures. IEEE Transactions on Information Theory, 50(9), 2002– 2009.MathSciNetCrossRefMATHGoogle Scholar
  22. 22.
    Chen, J., Dholakia, A., Eleftheriou, E., Fossorier, M.P.C., & Hu, X.-Y. (2005). Reduced-complexity decoding of LDPC codes. IEEE Transactions on Communications, 53(8), 1288– 1299.CrossRefGoogle Scholar
  23. 23.
    Lin, S., & Costello, D.J. (2004). Error Control Coding Vol. 123. Englewood Cliffs: Prentice-hall.Google Scholar
  24. 24.
    Sun, Y., & Cavallaro, J.R. (2008). A low-power 1-Gbps reconfigurable LDPC decoder design for multiple 4G wireless standards. In IEEE international SOC conference, 2008 (pp. 367– 370).Google Scholar
  25. 25.
    Cavallaro, J.R., & Vaya, M. (2003). Viturbo: a reconfigurable architecture for Viterbi and turbo decoding.Google Scholar
  26. 26.
    Gentile, G., Rovini, M., & Fanucci, L. (2010). A multi-standard flexible turbo/LDPC decoder via ASIC design. In 6th international symposium on turbo codes and iterative information processing (ISTC), 2010 (pp. 294–298).Google Scholar
  27. 27.
    Zhenzhi, W., & Liu, D. (June 2014). Flexible multistandard FEC processor design with ASIP methodology. In IEEE 25th international conference on application-specific systems, architectures and processors (ASAP), 2014 (pp. 210–218).Google Scholar
  28. 28.
    Lee, D., Wolf, M., & Kim, H. (2010). Design space exploration of the turbo decoding algorithm on GPUs. In Proceedings of the 2010 international conference on compilers, architectures and synthesis for embedded systems, CASES ’10 (pp. 217–226). New York: ACM.Google Scholar
  29. 29.
    Reddy, D., YOge, N., & Chandrachoodan, N. (2012). GPU implementation of a programmable turbo decoder for software defined radio applications. In VLSI Design’12 (pp. 149–154).Google Scholar
  30. 30.
    Wu, M., Sun, Y., & Cavallaro, J.R. (2010). Implementation of a 3GPP LTE turbo decoder accelerator on GPU (pp. 192– 197).Google Scholar
  31. 31.
    Wen, X., Xianjun, J., Jaaskelainen, P., Kultala, H., Canfeng, C., Berg, H., & Zhisong, B. (2014). A high throughput LDPC decoder using a mid-range GPU. In IEEE international conference on acoustics, speech and signal processing (ICASSP), 2014 (pp. 7515–7519).Google Scholar
  32. 32.
    Wang, G., Wu, M., Sun, Y., & Cavallaro, J.R. (2011). A massively parallel implementation of QC-LDPC decoder on GPU. In SASP’11 (pp. 82–85).Google Scholar
  33. 33.
    Wang, G., Wu, M., Sun, Y., & Cavallaro, J.R. (2011). GPU accelerated scalable parallel decoding of LDPC codes. In Conference record of the 45th Asilomar conference on signals, systems and computers (ASILOMAR), 2011 (pp. 2053–2057).Google Scholar
  34. 34.
    Kang, S., & Moon, J. (2012). Parallel LDPC decoder implementation on GPU based on unbalanced memory coalescing. In IEEE international conference on communications (ICC), 2012 (pp. 3692–3697).Google Scholar
  35. 35.
    Falcao, G., Silva, V., Sousa, L., & Andrade, J. (2012). Portable LDPC decoding on multicores using OpenCL [applications corner]. IEEE Signal Processing Magazine, 29(4), 81–109.CrossRefGoogle Scholar
  36. 36.
    TEXAS INSTRUMENTS. C6000 Benchmarks., http://www.ti.com/sc/docs/products/dsp/c6000/62bench.htm. Accessed: 2016-6-10.
  37. 37.
    Lai, J., & Chen, J. (2008). High performance viterbi decoder on cell BE. In Proceedings of the 1st international workshop on software radio technology (SRT2008).Google Scholar
  38. 38.
    SPIRAL. Viterbi decoder software generator., http://www.spiral.net/software/viterbi.html. Accessed: 2016-6-10.
  39. 39.
    Wikipedia. Instructions per second., http://en.wikipedia.org/wiki/Instructions_per_second. Accessed: 2016-6-10.

Copyright information

© The Author(s) 2016

Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Authors and Affiliations

  1. 1.School of Information and ElectronicsBeijing Institute of TechnologyBeijingChina
  2. 2.Institute of Electrical EngineeringLinköping UniversityLinköpingSweden

Personalised recommendations