1 Introduction

High throughput Software Defined Radio (SDR) platform encounters extremely high computational complexity in FEC decoding. C.H. Kees [1] presented a roughly estimation on FEC workload for nowadays standards, and shows in graph that approximately 100-3000 Operations per Bit (OP/bit) is needed, occupying approximately 40 % of workload in baseband. The attainable throughputs for LDPC are revealed with GPU [2, 3], IBM Cell processor [4] and DSP [5]. Turbo decoding throughputs are provided with GPU platforms [6, 7] and DSP platform [8]. The cycle counts for Viterbi decoding are provided for general purpose processor [9, 10], SORA platform [9], SODA platform [10] and DVB-T receiver [11]. The arithmetic operations for Quasi-Cyclic Low-Density Parity-Check (LDPC) and Turbo are provided and compared in [12], which shows that QC-LDPC codes have significantly lower complexity than Turbo codes. However these results are provided without sufficient proof, especially for arbitrary configurations. Apart from SDR evaluations, the hardware based FEC complexity is given by [13] and [14], whereas the individual hardware result cases may be inaccurate for Software Defined Radio (SDR) platforms.

Among these proposals, few theoretical analysis on complexity is provided, therefore the relationship between the complexity and the algorithm is still unclear. Moreover, because most of cases are implementations with dedicated configurations, the general complexity expression with arbitrary configurations (such as code length, parallelism) has not been exposed clearly, which hinders the designs and optimizations of the SDR platforms. For the algorithm researchers, the complexity results are also necessary for designing high efficient low complexity codes.

Based on these requirements, this research work aims to obtain a likely Giga Operations per Second (GOPS) or OP/bit low bound using analytical methods. The ‘likely low bound’ refers to an approximately low bound including the most probably memory access procedures and parallel execution procedures, which is more valuable for platform designers. The work consists of several aspects as following.

Firstly most of the papers addressing FEC decoding only concern on the algorithm level description which lacks of accuracy for the complexity evaluation. The addressing, memory access and shuffling network besides the arithmetic computation are all necessary to be considered. In addition, the parallel execution procedure for high throughput decoding needs to be taken into consideration as well. For these reasons, hardware-friendly highly accurate pseudo-codes are thus required and provided in this work.

Secondly there are many different decoding algorithm alternatives, and it is hard to cover all during evaluations. In this paper, the mainstream kernel functions and alternative decoding algorithms are listed and compared, and the ones with minimum operations are selected. Different from other papers providing the experimental results based on the hardware architecture details, this work tries to minimize the hardware specifics, and provides a platform independent result. The evaluation is performed on basic operation level (basic operations will be given in a following table) instead of the instruction level or hardware circuit level.

Thirdly, after the complexity is provided for each algorithm, different coding schemes are necessary to be compared using the parameters from wireless standards, which offer a general overview of the decoding efficiency. In addition, the operations for each decoding step are figured out as well.

The paper is organized as following. Section 2 lists all the pseudo-codes and the decoding procedure descriptions. Section 3 provides the operation analysis for each algorithm. Section 4 provides the comparison among the algorithms, and the operation cost on percentage for each decoding steps. Section 5 illustrates the reliable prototypes and assumptions for attaining the proposed results. Section 6 gives comparisons of proposed results with state-of-the-art SDR platforms.

2 Pseudo-Codes Analysis

To calculate the operations for each decoding step, the hardware friendly pseudo-codes are provided firstly. The parts of codes denoting with ‘for x do in parallel’ are able to be executed by independent Processing Elements (PEs) in parallel. The variables stored in memory are denoted with a ‘bar’ over the variable name, they need load and store procedure to access. Other variables without ‘bar’ are register variables which are only valid in the current parallel block and usually in small size. For a multiple PEs platform, except for the shared memory variable \(\overline {L}(i, n)\) which is accessed by all PEs, other memory variables can be stored in each PE locally.

Let x=[x 0,x 1,⋯ ,x N−1] be the transmitted information codeword and y=[y 0,y 1,⋯ ,y N−1] be the received codeword. Let L L R(n)=l o g(P(x n =1∣y n )/P(x n =0∣y n )) be the likelihood-ratio (LLR) of each received message, which is the input of the algorithm. The decoded bits can be viewed as an estimation of x which is the output of the algorithm, and denoted as \(\hat {\mathbf {x}}=[\hat {x}_{0},\hat {x}_{1},\cdots ,\hat x_{N-1}]\).

2.1 LDPC Decoding

Many high-throughput wireless standards such as IEEE 802.16e and IEEE 802.11n apply Quasi-cyclic LDPC (QC-LDPC) because of its ability for intrinsic highly parallel execution. Hence we confine our research on QC-LDPC mainly. The efficient algorithms for QC-LDPC decoding are Belief-Propagation (also called Message Passing) and layered decoding [15]. Layered decoding method is proposed by updating the variable-node as soon as the check node updates for the current row are calculated instead of all the equations, and about half of the iterations can be saved. Hence the layered decoding is adopted for the following analysis, which is shown in Algorithm 1.

figure d

In Algorithm 1, A QC-LDPC code Q C(J,N) is processed, which has N decoded bits, J variable nodes (channel messages) and M check nodes (M=JN). It can be described by a parity-check matrix H M×J . The matrix H can be divided into L×C sub-matrices of degree Z, in which L=M/Z and C=J/Z. H B L×C represents the base matrix with elements either zero or equals to the cyclic-shift value of the identity matrix. Let S r,k represents the kth non-zero element (NZE) in row r of H B, and the shift value can be obtained as H B r,S s r,k , simplified as R r,k . S r denotes the number of NZE in row r. A p-layer consists of Z layers which can be processed in parallel.

The decoding is performed in an iterative manner, and the maximum number of iterations is i t m a x . In each iteration L p-layers are processed one by one, where r denotes the index of the current p-layer being processed. In each p-layer, the NZEs are processed one by one sequentially, shown in Fig. 1.

Figure 1
figure 1

QC-LDPC parity check matrix and layered decoding routine.

The check-node update can be realized by Forward-Backward Recursion (FBR), and the principle can be briefly summarized as following. The LLR result of modulo-2 addition of two LLR values is given by L L R(xy)=f(x,y)=l o g(1+e x+y)−l o g(e x+e y) [16]. The LLR for multiple input elements can be calculated using 2-input f(x,y) function as f(x 1,x 2,⋯ ,x n )=L L R(x 1x 2⊕⋯⊕x n )=L L R(((x 1x 2)⊕x 3)⊕⋯⊕x n )=f(f(f(x 1,x 2),x 3),⋅,x n ). The check-node update can be represented by \({\Lambda }_{r,k}^{z} = LLR({\gamma _{1}^{z}} \oplus {\gamma _{2}^{z}} \oplus \cdots \oplus \gamma _{k-1}^{z} \oplus \gamma _{k+1}^{z} \oplus {\cdots } \oplus \gamma _{S_{r}-1}^{z}) = f({\gamma _{1}^{z}}, {\gamma _{2}^{z}}, \cdots , \gamma _{k-1}^{z}, \gamma _{k+1}^{z}, \cdots , \gamma _{S_{r}-1}^{z})\). The LLR value of the first k check nodes are calculated by \({\alpha _{k}^{z}}=f(f(f({\gamma _{1}^{z}}, {\gamma _{2}^{z}}), {\gamma _{3}^{z}}),\cdot , {\gamma _{k}^{z}})\). It can be calculated by Forward Recursion (FR) using \({\alpha _{k}^{z}}= f(\alpha _{k-1}^{z}, {\gamma _{k}^{z}})\). The last k check nodes is originally calculated by \({\beta _{k}^{z}}=f(f(f(\gamma _{S_{r}-1}^{z}, \gamma _{S_{r}-2}^{z}), \gamma _{S_{r}-3}^{z})..., {\gamma _{k}^{z}})\). They can be obtained by Backward Recursion (BR) following \({\beta _{k}^{z}}= f(\beta _{k+1}^{z}, {\gamma _{k}^{z}})\), and finally the check node update process can be performed by \({\Lambda }_{r,k}^{z} = f({\alpha _{k}^{z}}, \beta _{k+1}^{z})\). Further information related to FBR can be obtained from [17] and [16].

In FR part, Line 7 and Line 8 can be executed by a block data read operation and a block data permutation instead, where Z data are fetched together in normal sequence. The data are then shifted by a permutation network (cyclic-shifters) which is given by \(I_{org}^{z} \leftarrow \overline {L}(z, \mathbf {S}_{r,k}), z\in [0, Z-1]\), and \(I^{z} \leftarrow {\Phi }(I_{org}^{z}, \mathbf {R}_{r, k})\). Line 24 (along with 17) can be alternatively executed by a block transmission also, given by \(O_{org}^{z} \leftarrow {\Phi }^{-1}(O^{z}, \mathbf {R}_{r,k})\), and \(\overline {L}(z,\mathbf {S}_{r,k}) \leftarrow O_{org}^{z}\).

In Algorithm 1 and following Algorithms in this article, note that:

  1. (1).

    The notation \({\leftarrow }\) represents ‘assigned to’, which means the calculation result on right side updates the variable on the left side.

  2. (2).

    The register update procedure and memory access procedure are separately represented. For example, the forward path update \(\overline {\alpha }_{k}^{z}\leftarrow f(\overline {\alpha }_{k}^{z},\gamma ^{z})\) is split to \(\overline {\alpha }_{k}^{z} \leftarrow \alpha ^{z}\) and \(\alpha ^{z} \leftarrow f(\alpha ^{z},\gamma ^{z})\), where \(\overline {\alpha }_{k}^{z}\) is a memory variable with k be the address. If there is no need to store to memory (such as β), then no memory access procedure is needed.

  3. (3).

    The foot mark k may be omitted for variables within the kth recursion body without confusion.

There is an alternative decoding approach named improved Min-Sum algorithm. In this method, the iteration process in Algorithm 1 is substituted by Algorithm 2. In the forward recursion, only five values are recorded, including the maximum and secondary maximum a-priori messages, the index of the maximum value, the sign of each message, and the product of all signs. The backward recursion part is redesigned in a non-recursion way. The a-posteriori messages are assigned as the maximum value or the secondary maximum value, with the product of signs from all other messages. It is a reduced complexity method and benefits for hardware design. However, for a programmable platform, the branch operation and sign operations may consume more resources. Therefore Algorithm 2 introduces more workload compared to Algorithm 1. Hence, Algorithm 1 is applied for the complexity evaluation.

figure e

2.2 Turbo Decoding

Turbo is an efficient coding technique approaching to Shannon limit. Turbo meets the need of high-throughput wireless applications by its parallel decoding ability, and has been widely accepted in many standards such as 3GPP-LTE(A), HSPA, CDMA2000, and IEEE 802.16e. In this work, only 8-state Parallel Concatenated Convolutional Code (PCCC) Turbo is considered because most of the widely adopted standards (listed above) are based on it.

In the following, the BCJR (Bahl-Cocke-Jelinek-Raviv) algorithm with the parallel processing is investigated, shown in Algorithm 3. Let N denotes the information message length, which is encoded to 3N transmitted bits. Let \(\overline {L}(i, n),i=0,1,2\) denotes the 3nth to (3n+2)th data in L L R(n). Because each iteration consists of two constituent maximum a-posteriori (MAP) decoding, we introduce a variable r=0,1 to distinguish the first half iteration (MAP1) and the second (MAP2). The first MAP decoding uses \(\overline {L}(0, k), \overline {L}(1, k)\) and a-posteriori messages from MAP2 \(\overline {L}(i_{0}, k)\) as input, \(\overline {L}(i_{1}, k)\) and \(\overline {L}(i_{L}, k)\) as the output. Where k is the current FBR step. The second MAP decoding uses \(\overline {L}(0, k_{inv}), \overline {L}(2, k)\) and \(\overline {L}(i_{1}, k_{inv})\) as the input, \(\overline {L}(i_{0}, k)\) and \(\overline {L}(i_{L}, k)\) as the output. i 0 and i 1 are arbitrary constants used for bank labels, such as i 0=3, i 1=4. \(\overline {L}(i_{L}, k)\) is only required at the last MAP2 procedure, in such occasion Line 37 is applied instead of Line 36 in Algorithm 3, otherwise Line 36 is skipped.

figure f

In parallel decoding, the received codewords are split into P slices with each length of L. P a-posteriori messages are read from a group of memory banks with both sequential and interleaved address. The data are arranged in a sequential pattern, referring that \(\overline {L}(i, n)\) is stored in ⌊n/L⌋th bank with address m o d(n,L). If accessed with interleaved address, the output should be reordered by a permuter. The interleaved address may also incur memory access conflict when more than one data being fetched are located in thesame memory bank. A contention-free interleaver, such as Quadratic Permutation Polynomial (QPP)-interleaver applied in 3GPP-LTE(A), can avoid such occasions when N is divisible with the parallelism P [18, 19]. In this case, the interleaving for each memory data is \({\prod }_{r}(n)=mod(Q_{r}(n), L)\), and the permutation route for each data is \({\coprod }_{r}(n)= Q_{r}(n)/L\). Where Q 1(n)=n, and Q 2(n) is the QPP-interleaving function. For the case of other interleavers, the data can be rearranged to achieve conflict-free using graph coloring algorithm [20] or annealing procedure [21]. From them the addresses \({\prod }_{r}(n)\) and permutation routes \({\coprod }_{r}(n)\) are obtained as well. The permutation routes are needed for both MAP1 and MAP2. The memory accesses and permutation procedures are still the same as conflict-free case.

The BCJR algorithm is applied for each MAP half iteration. It consists of three main steps: FR, BR and extrinsic a-posteriori calculation. In the forward recursion, the branch metrics are calculated, and previous α metrics are fetched. Then the α for each trellis step k is calculated using kernel function and then stored. In addition, the input data for all k are stored in local in order to be supplied in BR. BR begins at the end of FR and β is calculated. The a-posteriori and extrinsic messages for the next half iteration are obtained. In each step, S represents the current trellis state, and \(S^{\prime }_{i},i=0,1\) represent the previous states connected to the current state. The function \(Trellis(S^{\prime }_{i}, S)\) outputs the information sequence u, the first output branch v 0 (v 0 equals to u when component code is a Systematic Convolutional Code) and second output branch v 1. \((S,S^{\prime })\in \sum +\) denotes all the branches which output bit 1, and \((S,S^{\prime })\in \sum -\) denotes all the branches which output bit 0.

For parallel decoding, α and β are discontinuous when separated into SISOs and decoded in parallel. Therefore Next Iteration Initialization (NII) [7] also called State Metric Propagation (SMP) [19] method is applied. The FR initial value \(\overline {\alpha }_{0}^{p}\) is the final FR value \(\overline {\alpha }_{L-1}^{p-1}\) of the previous iteration, and the backward initial value \(\overline {\beta }_{L-1}^{p}\) is the final BR value \(\overline {\beta }_{0}^{p+1}\) at the previous iteration, depicted in Fig. 2. If there is no previous iteration, the initial values are set to zeros. The messages may come from neighbour PEs, which lead to inter-PE data transfers or global memory accesses.

Figure 2
figure 2

Turbo decoding NII message passing among processing elements.

2.3 Convolutional Code Viterbi Decoding

Let C C(m,1,M) denote a CC with code rate 1/m, information bits (decoded bits) N and constraint length M (T s =2M−1 states). The algorithm description is shown in Algorithm 4. The channel messages are split to several blocks with overlapped area with the size of the traceback (TB) length, and decoded by separate PEs. The overlapped area is for traceback procedure for each PE.

figure g

Branch Metric Update (BMU) and Add-Compare-Select (ACS) are the two main kernels in FR. In each FR step k, the BMU is performed to obtain the Euclidean Distance between the received messages and local trellis outputs. The distances (branch metrics) are added with the previous path metrics. Two branches \(S_{0}^{\prime }\) and \(S_{1}^{\prime }\) connecting to current state S are compared and the one has smaller metric is selected (the transition bit is recorded as sel). After k reaches L+L T B , the traceback procedure starts. The survival state calculated in previous step is applied to address the survival path memory and get the transition bit β. The survival state is updated by using the previous state and current β. This procedure performs recursively. Finally the Least Significant Bit (LSB) of S p is stored back as the decoded bits.

2.4 Low Complexity Kernel Functions

The original kernel function f(x,y) for LDPC and Turbo are hard to be implemented for the fixed data format and sensitive to the quantization error. Therefore several frequently used low complexity approximations are discussed here in detail. For LDPC, f(x,y) can be the addition of a linear part and a correction part, given by f(x,y)=f b (x,y)+f c (x,y). Following Jacobian approach, f b (x,y)=s i g n(x)⋅s i g n(y)⋅m i n(|x|,|y|)=m a x(x,y)−m a x(x+y,0), and f c (x,y)=l o g(1+e −|x+y|)−l o g(1+e −|xy|). f c (x,y) can be implemented by a Look-Up-Table (LUT) or polynomial approximation, shown as f c L1 and f c L2 in Table 2. Turbo kernel function is part of the LDPC kernel, and the low complexity methods applied for LDPC is also suitable for Turbo.

If only the linear part is used with no correction part, it is called min-sum algorithm. A refined version of min-sum is the offset-min-sum algorithm, which is defined as

$${\Lambda}_{k}=\prod\limits_{i \neq k, i\in 0:S_{r}-1}sign(\gamma_{i})\cdot \max\left\{\min\limits_{i\neq k,i\in 0:S_{r}-1}|\gamma_{i}|-\mu,0\right\}$$

Where μ is a positive small constant. Similar to it, scaled-min-sum algorithm is an alternative approach, which is given by

$${\Lambda}_{k}=\prod\limits_{i \neq k, i\in 0:S_{r}-1}\mu \cdot sign(\gamma_{i})\cdot \min\limits_{i\neq k,i\in 0:S_{r}-1}|\gamma_{i}|$$

Where μ is a scaling factor (less than 1). With these methods, the extrinsic messages Λ are updated with offset-/scaled-min-sum function, whereas FR messages α and BR messages β are updated with the original min-sum function.

In summary, the base part f b (x,y) for the three algorithms can be chosen from Table 1, and the correction part f c (x,y) can be chosen from Table 2.

Table 1 Base part alternatives in kernel function.
Table 2 Correction part alternatives in kernel function.

3 Computational Analysis

3.1 Proposed Evaluation Method and Platform Independent Assumptions

The computational complexity is evaluated by Giga Operations per Second (GOPS) or Operations per decoded bit (OP/bit), where the ‘Operation’ refers to the basic operations listed in Table 3. They are the hardware architecture unrelated basic computation units for constructing the pseudo-codes, and all of them cost one operation. Therefore the number of operations can be derived from the pseudo-codes based on these unit operations. One operation may need several instructions to be performed according to the instruction-set specification. The ‘Complex arithmetic’ category in the table is a special class of computations which requires much more computational resources and floating point support. They are applied for comparing kernel function alternatives only.

Table 3 List of unit operations for the computational complexity evaluation.

There are several assumptions needed to achieve a platform independent evaluation.

  1. (1).

    To unify the evaluation of branch cost, it is assumed that all the loops are unrolled and the loop branch overhead is zero;

  2. (2).

    The memory capacity is assumed to be sufficient;

  3. (3).

    The permutation can be realized by a likely ‘load’ instruction with the permute route as the offset address, which cost one operation. The permutation for a set of data may also be executed by a dedicated hardware such as crossbar network;

  4. (4).

    The data are assumed in floating format or fixed format with enough datawidth, hence no overflow protection is included.

In the following analysis, the number of operations of kernel function f(x,y) alternatives is analysed at first, then the number of operations of single FR/BR step is evaluated. With the number of FR/BR iterations, the total decoding complexity is derived at last.

3.2 LDPC Decoding Computational Complexity Analysis

  1. (1).

    Kernel operation

    The operations for the possible kernel functions are listed in Table 4. It shows that the minimum number of operations is 4, whereas all the kernels can be calculated within 9 operations. The original function needs exponential and logarithm calculation, hence approximation kernels with simple operations and LUTs are recommended to be applied. For calculating f c L1 and f c L2, LUTs for function \(max(\frac {5}{8}-\frac {|x+y|}{4}, 0)\), \(max(\frac {5}{8}-\frac {|x-y|}{4}, 0)\), l o g(1+e −|x+y|), l o g(1+e −|xy|) are assumed to be provided. The addition operation for connecting f b (linear part) and f c (correction part) are counted in. In following, the ‘min-sum’ solution with f=4 is chosen for the overall decoding procedure complexity estimation.

  2. (2).

    Recursion kernel

    We divide the Algorithm into FR part and BR part, and then estimate them separately. In Algorithm 1, Line 7 to Line 15 is the FR part. In Table 5, the operations for FR part are summarized, where block transmission is applied. The total operations in FR part are calculated as F R=2+(6+f)⋅Z.

    From Line 16 to Line 26 in Algorithm 1 is the BR part. Table 6 shows the summary of operations, and the number of operations in BR part is B R=(6+2f)⋅Z.

  3. (3).

    Loop structure and total operations

    There are up to i t m a x iterations. In each iteration L p-layers (rows in H B) are processed sequentially. In each row, there are totally S r NZEs. For an irregular code, S r is aligned to the maximum number of NZEs among all p-layers. This approximation is beneficial for parallel FR/BR alignment, and also with the consideration that the S r difference among p-layers is at most one for wireless standards IEEE 802.11n and IEEE 802.16e. Therefore the total decoding operations is

    $$\begin{array}{@{}rcl@{}} OP_{LDPC}&=&it_{max}\cdot L \cdot S_{r} \cdot (FR+BR) \\ &=&it_{max}\cdot L \cdot S_{r} \cdot (2+(12+3f)\cdot Z) \end{array} $$
    (1)
Table 4 Operations for calculating LDPC f(x,y) kernels.
Table 5 Operations for LDPC forward recursion kernel.
Table 6 Operations for LDPC backward recursion kernel.

Note that the statistics include the decoding kernel only. Early termination and correction check are not counted in. The data input and output procedure is assumed to be finished by DMA instead of processors. Therefore the data load/store operation cost is not included. It is assumed in the deduction that Z PEs are available. However as long as the number of PEs P is less than or equal to Z, the total number of operations is the same, whereas the Z layers may be updated in partially parallel if PEs are not abundant.

For the case of the maximum throughput in IEEE 802.11n standard, the configuration is L=4, S r =20, and Z=81. Along with the selected parameters f=4 and i t m a x =6, there are 934,080 operations for the 1620 decoded bits (1944 channel bits). Hence the computational complexity is 577 OP/bit. For a throughput of 450 Mb/s, 259 GOPS would be consumed.

Following similar way the computational complexity for Algorithm 2 can be calculated. It shows that Algorithm 2 consumes 6Z additional operations than Algorithm 1 in the Check Node Update kernel (FBR kernel). Therefore we recommend Algorithm 1 for software defined (SD) decoding.

3.3 Turbo Decoding Computational Complexity Analysis

Table 7 shows the alternatives of Turbo kernel function f(x,y). In this table, we assume that LUTs are available for \(max(\frac {5}{8}-\frac {|x+y|}{4}, 0)\) and l o g(1+e −|xy|). In most of Turbo implementations the max-log-MAP algorithm is applied, and is therefore selected for the following decoding complexity evaluation.

Table 7 Operations for calculating turbo f(x,y) kernels.

In Algorithm 3, forward recursion is in Line 9 to Line 22. It contains the accesses of memory with permutation, branch metric update and forward recursions. Table 8 summarized the operations in detail.

Table 8 Operations for turbo forward recursion kernel.

When the constituent code is systematic such as in 3GPP standard, where u=v 0, a simplified solution for BM calculation is shown following.

Step 1.:

Calculating the trellis constraints, following t 0[0]=S[0]⊕S[2], t 0[1]=S[0]⊕S[1], t 1[0]=S[0]⊕S[2]⊕1, and t 1[1]=S[0]⊕S[1]⊕1. Where ⊕ represents XOR logic.

Step 2.:

Obtaining g 0 and g 1 by p=L a +r 0, g 0=(p+r 1)/2, and g 1=(−p+r 1)/2.

Step 3.:

Output the branch messages following the Table 10.

Although this method has the similar number of operations as the original method, for all 16 branches (T s =8) there are only four independent branch metric values associated with all possible t i . In addition, two of them are the negative numbers of the other two. Therefore the BM calculation in FR/BR consumes 8 operations for each step in each PE.

It is derived from Tables 8 and 9 that the total number of FR operations is (19+(3+f)⋅T s )⋅P, and BR is (17+(7+3f)⋅T s )⋅P. With the number of half iterations (2i t m a x ), the recursion window length L, and the parallelism P (where LP=N), the total operations are calculated by

$$\begin{array}{@{}rcl@{}} OP_{Turbo}&=&2it_{max}\cdot L \cdot (FR+BR) \\ &=&2it_{max}\cdot N \cdot (36+(10+4f)\cdot T_{s}) \end{array} $$
(2)
Table 9 Operations for turbo backward recursion kernel.

Take 3GPP-LTE(A) Turbo (T s =8) with N=6144 decoded bits as an example. When max-log-MAP kernel function (f=1) and i t m a x =6 are selected, the total operations for the decoding are 10,911,744. Therefore the computational complexity is 1776 OP/bit (Table 10). For a decoding throughput of 150 Mbit/s, the computation would be 266 GOPS. Note that in this evaluation, no pipeline stall caused by data dependency or memory access conflict is counted in. The Inter-PE message passing only happens at the border of the recursion shown in Fig. 2. The complexity contribution of it is negligible, and therefore not included.

Table 10 Complexity reduced turbo branch metric calculation.

If parallelism P increases, the decoding latency would be reduced. In such case the amount of complexity increasing is negligible whereas the bit error rate performance may degrade due to the discontinuity of recursion messages. More iterations can be applied to reduce the degradation which results in the linear increasing of total operations.

3.4 CC Decoding Computational Complexity Analysis

The distance function d i s t(r,v) is originally implemented by Euclidean distance function \(dist(r,v) = {\sum }_{i} (r_{i}-v_{i})^{2},i=0,\cdots ,m-1\). Because \({\sum }_{i} {r_{i}^{2}}\) and \({\sum }_{i} {v_{i}^{2}}\) are the same for all transition branches. After dropping these terms, the remaining part is \(-2{\sum }_{i} r_{i}\cdot v_{i}\), and −2 is a constant which can be avoided without changing the relative value. In addition, v i is the m local trellis transition bits containing at most 2m combinations, and half of them can be obtained by a negative operation from the other half (consumes totally 2m−1 operations). Therefore the 2m−1 possible metrics are calculated in advance. For each metric calculation, m−1 additions/subtractions are required. Therefore the d i s t(r,v) calculation for all branches in a trellis step consumes (m−1)⋅2m−1+2m−1=m⋅2m−1 in total. If the pre-calculation is not applied, m multiplications (or selections) and m−1 additions are needed for each branch metric calculation.

From Tables 11 and 12, we can conclude that for each step, there are P⋅(1+m+m⋅2m−1+(3+f)⋅T s ) operations to be applied in ACS and 6.5P operations in TB. For a codeword with length N=LP, The decoding procedure contains L+L T B ACS and TB steps, therefore the overall operations are

$$\begin{array}{@{}rcl@{}} OP_{CC}\! &=&\!(7.5{\kern-.8pt}+{\kern-.8pt}m{\kern-.8pt}+{\kern-.8pt}m{\kern-.8pt}\cdot{\kern-.8pt} 2^{m-1}{\kern-.8pt}+{\kern-.8pt}({\kern-.8pt}3{\kern-.8pt}+{\kern-.8pt}f{\kern-.8pt}){\kern-.8pt} \cdot{\kern-.8pt} T_{s}{\kern-.8pt}){\kern-.7pt}\cdot{\kern-.7pt} P{\kern-.7pt} \cdot{\kern-.7pt} ({\kern-.7pt}L{\kern-.7pt}+{\kern-.7pt}L_{TB}{\kern-.7pt})\\ &=&\!(7.5+m+m\cdot 2^{m-1}+4 T_{s}) \cdot P \cdot (L+L_{TB}) \end{array} $$
(3)
Table 11 Operations for CC forward path.
Table 12 Operations for CC backward path.

For a CC(2, 1, 7) (m=2, T s =64), with decoded bits N=2048, traceback length L T B =35 and parallelism P=8, the total operations are 627,396, equivalent to 306 OP/bit. In the case of larger trellis constraint length code such as CC(3, 1, 9) (m=3, T s =256), with N=2048, L T B =45 and P=8, the total operations are 2,519,972, and equivalent to 1230 OP/bit.

Because there is an overlapped area in each PE for the traceback computing which costs redundant computations, the parallelism P impacts on the computational complexity obviously. For the previous CC(3,1,9) example, the decoding needs extra 15 % of operations than the single PE (P=1) configuration.

4 Complexity Comparison

To reveal the relationship of computational complexity among these algorithms, the number of operations is shown in Fig. 3, wherein all the LDPC configurations in IEEE 802.11n and IEEE 802.16e standards (R=1/2 to R=5/6), 8-state Turbo (3GPP-LTE Turbo R=1/3) and CC are compared. Six iterations are applied for LDPC and Turbo decoding. Several conclusions are summarized from the comparison: (1). For all modes, the decoding complexity is approximately proportional to the number of the decoded bits; (2). When the cost of operations is the same, LDPC may offer even 2-3 times of throughput than that of Turbo; (3). The complexity of CC(3, 1, 9) is slightly higher than LDPC with 6 iterations; (4). IEEE 802.11n and IEEE 802.16e LDPC have the similar complexity; (5). CC with small constraint length has the minimum complexity, whereas Turbo decoding consumes much more operations than other coding types. In Fig. 4, the relationship of throughput and computing cost as GOPS is provided for several typical codes. The 1 Gbps Turbo consumes approximately 2000 GOPS, along with other auxiliary workload it would be 2-3 times more, which is currently difficult to be performed in a single chip. However 1Gbps LDPC requires less than 600 GOPS which is easier to be realized in SDR platforms. Less iterations can linearly reduce the complexity, which can be achieved with the early termination.

Figure 3
figure 3

Computational complexity comparison among FEC algorithms with different decoded codeword length.

Figure 4
figure 4

Evaluated total computing cost with respect to throughput.

The computation is composed by several kernel tasks, and the operations for each step are summarized, listed in Table 13. CC(2, 1, 7), 3GPP-LTE Turbo and Z=96 QC-LDPC codes are shown as examples. It is concluded that for Turbo and CC, arithmetic computation occupies 75 % around of workload, whereas memory access consumes 25 % around. For LDPC, 42 % of workload belongs to memory access, and only 58 % belongs to arithmetic computation. One of the reasons is that LDPC outputs Z a-posteriori messages in a recursion step, whereas Viterbi or (Binary) Turbo only outputs one bit. Considered that LDPC Layered decoding consumes approximately 1/3 of total computation to Turbo BCJR, Layered Decoding requires only 25 % message update computations per bit than that of BCJR (with the same number of iterations). Viterbi algorithm consumes most of the computation on ACS because T s states need to be processed one by one. For an even larger T s such as in CC(2, 1, 9), the percentage of computation for ACS is even larger.

Table 13 Percentage of component computations.

It is revealed from Table 13 that several kernel functions consume a large portion of computing resources. The operations would be reduced dramatically if the kernels are accelerated by hardware circuits and operation-fusion instructions. Such as the f(x,y) function for LDPC layered decoding, the LLR calculation in Turbo decoding, and the ACS kernel in Viterbi algorithm.

5 Specific Platform Design with the Proposed Evaluations

For highly parallel platforms, the platform related overhead needs to be considered in, which includes the inter-core communication, the core workload balancing, and the synchronization between cores. The decoding algorithms also require for a conflict-free or conflict-minimum memory access. It is advisable to construct a many core platform based on the ASIC FEC implementations. In ASIC (Application Specific Integrated Circuit) or ASIP (Application Specific Instruction-set Processor) implementations of the proposed pseudo-codes [19, 2426], the inter-core data passing are operation-free, the core tasks are well balanced and the synchronization between cores are not necessary. The memory banks are small sized on-chip scratch-pad modules, therefore no cache is needed. All the memory access conflicts can be avoided and therefore not considered in. These optimizations for ASIC decoders can be applied for designing an SD FEC platform.

We also constructed a tri-mode unified ASIP decoder [27] following these pseudo-codes. (Additional sliding windows are added for reducing the buffer size). The inter-connection network and memory subsystem can be borrowed to construct the fully programmable platform. The difference is that the processor cores are introduced to substitute the arithmetic circuits in the ASIP prototype. Higher flexible inter-core network with more redundancy can be introduced without performance degradation if the network and memory structure in the ASIP prototype are included.

6 Compare the Evaluated Results with General Software Defined Decoding Platforms

Apart from the theoretical complexity results and the supporting hardware platforms which can reach these low bounds, the FEC benchmarks are provided for revealing the attainable complexity in feasible processors. Currently, high throughput decoding mainly relies on the General Purpose GPU (GPGPU) platforms because of its highly parallel architecture. In such platforms, the peak floating point operations (GFLOPS) are accessible by manufactures. An alternative choice is general DSP platforms with Very Long Instruction Word (VLIW) architecture, where several instructions are possibly processed simultaneously, such as Texas Instrument (TI) TMS320Cx series DSPs. The peak MIPS can be derived from the device datasheets. Apart from the peak performance, the decoding throughput, iteration numbers, and code length are available in the reference papers, therefore the operations per bit per iteration can be obtained following O P/N/i t m a x . Meanwhile the evaluated results are shown with OP given by Eqs. 12 and 3.

Figure 5 shows the software defined (SD) Turbo in general purpose platforms. The proposed guideline is approximately 312 for all sizes of decoded bits. Most of the reference approaches are in GPU platform. Wolf et al. [28] proposed a Design Space Exploration method for SD Turbo and four platforms are tested. With the codeword size of 5000, the complexity is 10 KOP/bit - 100 KOP/bit around. Other proposals target on N=6144 3GPP-LTE Turbo, which reveal a complexity of 7 KOP/bit - 45 KOP/bit. Proposal [6] reported that the efficiency can be further improved with multi-codeword parallel decoding, which can fully utilize the GPU resources, and the efficiency improves from 9 KOP/bit to 1.8 KOP/bit. It also introduces highly parallel number to make full use of core resources, and finally 122 Mb/s throughput is derived. Apart from that, a TMS320C6201 DSP approach [8] reveal that approximately 6400 OP/bit is required for HSPA N=5114 code, which is similar to GPU approaches. For this TMS320C6201 DSP platform and following TMS320C64x platform, the peak operations are evaluated by eight times of its peak Million Instruction Per Second (MIPS) because the processor has eight processing units. The actual workload derived from these implementations is higher than the proposed results because of several reasons. (1). An operation defined in Table 3 may be mapped into several processor operations (instructions). (2). In our evaluation, extra tasks such as data management, memory conflict management, controlling overhead, and thread synchronization are not taken into account. (3). The GPU device peak GFLOPS is assumed as the cost used for the decoding procedure, however making full use of all the computation resources on a chip is unrealistic. Nevertheless, the reference designs reveal the nowadays attainable complexity. It is also hopeful that using an alternative hardware architecture the decoding efficiency can be improved further.

Figure 5
figure 5

Computational complexity of SD Turbo implementations. The solution references are M. Wolf [28], D. Yoge [29], M. Wu [30], Y. Song [8], X. Jiao [6], and G. Wang [7].

According to the code rates and how sparse the base matrices are, the LDPC decoding complexity varies in different configurations. However they most locate in a ‘band’, proposed in Fig. 6. It shows the complexity ranging from 88 OP/bit to 162 OP/bit for all the configurations in IEEE 802.11n and IEEE 802.16e. Most of proposed solutions realized a complexity of 500 OP/bit to 2300 OP/bit. Among them, Wen et al. [31] proposed a min-sum layered decoder reaching up to 507 Mbit/s (2 iterations) with early termination, which is the highest throughput among all SD LDPC. It reaches a complexity of 1062 OP/bit. G. Wang et al. proposed a 304 Mbps (10 iterations, 50 codewords in GTX TITAN GPU) solution which reaches the lowest computational complexity (493 OP/bit). K.K.Abburi et al. [2] proposed another high efficient solution which reaches a complexity of 881 OP/bit. The TI TMS320C64x DSP solution [5] shows its similar complexity to GPU solutions.

In case of Viterbi decoding, the benchmarks for DSP, ARM and Intel Processor are summarized. For TMS320C62x series DSP, the number of instructions for GSM CC(2, 1, 5) decoding is given by (38⋅N+12+N/4)/N [36]. For a larger N, approximately 38 instructions per bit are required, which equivalent to 306 operations/bit due to the parallel architecture with eight processing units. The SPIRAL Viterbi decoding code [38] is applied for evaluating the overall decoding complexity in ARM processor (ARM Cortex A7) and Intel processor (Intel Core i7-2600). The code is complied by GNU GCC with ‘-O3’ optimization level. For Core i7 implementation, only one of the 8 cores is utilized. The evaluation results are shown in Table 14, wherein the peak MIPS is derived in [39]. Because 16-way SSE (Streaming SIMD Extensions) vector instructions can perform up to 16 calculations per instruction, therefore the number of instructions is lower than the proposed guideline. It reveals that VLIW DSP is efficient for Viterbi processing. For high throughput decoding in Intel processors, a much higher decoding throughput is achieved by enabling SSE acceleration [9]. The complexity of reference implementations are approximately 2-3 times higher than the proposed complexity guideline.

Figure 6
figure 6

Computational complexity of SD LDPC implementations. The reference implementations are G. Falcao [4], GTX 470 G. Wang [32] [33], K.K.Abburi [2], S. Kang [34], G. Falcao (K=4000) [35], GTXTITAN G. Wang [3], X. Wen [31], and TMS320C64x [5].

Table 14 Viterbi computational complexity benchmarks.

7 Conclusion

In this work the complexity evaluations for LDPC Layered Decoding, Turbo BCJR decoding and CC Viterbi decoding are provided. Closed expressions for these coding types are offered with variety of configurable parameters. The complexity of these algorithms are compared with the configurations in wireless standards. The reference implementations are compared with the proposed results, which shows that current SDR platforms still have possibilities for achieving higher decoding efficiency. The proposed pseudo-codes, parallel schemes and opeartion results may promote the architecture selection and software design for further software defined FEC platforms.