Memory Access Optimized Implementation of Cyclic and Quasi-Cyclic LDPC Codes on a GPGPU

Ji, Hyunwoo; Cho, Junho; Sung, Wonyong

doi:10.1007/s11265-010-0547-9

Memory Access Optimized Implementation of Cyclic and Quasi-Cyclic LDPC Codes on a GPGPU

Published: 03 November 2010

Volume 64, pages 149–159, (2011)
Cite this article

Journal of Signal Processing Systems Aims and scope Submit manuscript

Hyunwoo Ji¹,
Junho Cho¹ &
Wonyong Sung¹

447 Accesses
19 Citations
Explore all metrics

Abstract

Software based decoding of low-density parity-check (LDPC) codes frequently takes very long time, thus the general purpose graphics processing units (GPGPUs) that support massively parallel processing can be very useful for speeding up the simulation. In LDPC decoding, the parity-check matrix H needs to be accessed at every node updating process, and the size of the matrix is often larger than that of GPU on-chip memory especially when the code length is long or the weight is high. In this work, the parity-check matrix of cyclic or quasi-cyclic (QC) LDPC codes is greatly compressed by exploiting the periodic property of the matrix. Also, vacant elements are eliminated from the sparse message arrays to utilize the coalesced access of global memory supported by GPGPUs. Regular projective geometry (PG) and irregular QC LDPC codes are used for sum-product algorithm based decoding with the GTX-285 NVIDIA graphics processing unit (GPU), and considerable speed-up results are obtained.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Can GPU performance increase faster than the code error rate?

Article Open access 18 April 2024

Shared Memory Parallelism in Modern C++ and HPX

Article 20 April 2024

A new distributed graph coloring algorithm for large graphs

Article 23 March 2023

Notes

Segment size is 32, 64, and 128 bytes for 8-bit, 16-bit, and 32-, 64- and 128-bit data, respectively
The compute capability of a device is defined by a major and minor revision number. Devices with the same major revision number are of the same core architecture. The minor revision number corresponds to an incremental improvement to the core architecture, possibly including new features. The version of GTX-200 series is 1.3.
Block dimension is the number of threads that constitute one thread block.
The maximum number of threads per thread block is 512.
The index calculation is described in Section 3.3 in detail.

References

Gallager, R. G. (1963). Low density parity check codes. Cambridge: MIT.
Google Scholar
The Digital Video Broadcasting Standard [Online]. Available: www.dvb.org
The IEEE 802.16 Working Group [Online]. Available: http://www.ieee802.org/16/
The IEEE 802.11n Working Group [Online]. Available: http://www.ieee802.org/11/
Falcão, G., Silva, V., & Sousa L. (2009). How GPUs can outperform ASICSs for fast LDPC decoding. In Proc. of the 2third International Conference on Supercomputing, New York, USA, pp. 390–399
Falcão, G., Yamagiwa, S., Silva, V., & Sousa, L. (2009). Parallel LDPC decoding on GPUs using a stream-based computing approach. Journal of Computer Science and Technology, 24, 913–924.
Article Google Scholar
Tanner, R. M. (1981). A recursive approach to low complexity codes. IEEE Transactions on Information Theory, IT-27, 533–547.
Article MathSciNet Google Scholar
Kou, Y., Lin, S., & Fossorier, M. (2001). Low density parity check codes based on finite geometries: a rediscovery and more. IEEE Transactions on Information Theory, 47, 2711–2736.
Article MathSciNet MATH Google Scholar
MacKay, D. J. C. (1999). Good error-correcting codes based on very sparse matrices. IEEE Transactions on Information Theory, 45, 399–431.
Article MathSciNet MATH Google Scholar
Chen, J., Dholakia, A., Eleftheriou, E., Fossorier, M., & Hu, X. Y. (2002). Near optimal reduced-complexity decoding algorithms for LDPC codes. In Proc. IEEE Int. Symp. Information Theory, Lausanne, Switzerland, p. 455
The CUDA Programming Guide [Online]. Available: http://developer.NVIDIA.com/object/cuda.html
Bell, N., & Garland, M. (2008). Efficient Sparse Matrix-Vector Multiplication on CUDA. NVIDIA Technical Report NVR-2008-004, NVIDIA Corporation.
Im, E. (2000). Optimizing the performance of sparse matrix-vector multiplication. Technical Report, UMI Order Number: CSD-00-1104., University of California at Berkeley.

Download references

Acknowledgements

This work was supported in part by the National Research Foundation (NRF) grant funded by the Korea government (MEST) (No. 20090075770 and No. 20090084804) and in part by the MEST under the Brain Korea 21 Project.

Author information

Authors and Affiliations

School of Electrical Engineering, Seoul National University, Seoul, South Korea
Hyunwoo Ji, Junho Cho & Wonyong Sung

Authors

Hyunwoo Ji
View author publications
You can also search for this author in PubMed Google Scholar
Junho Cho
View author publications
You can also search for this author in PubMed Google Scholar
Wonyong Sung
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hyunwoo Ji.

Additional information

This work is an improved version of the “Massively parallel implementation of cyclic LDPC codes on a general purpose graphic processing unit,” which was presented in the IEEE Workshop on Signal Processing Systems (SiPS) held in Tampere (Finland) in 2009. Implementation results of standardized irregular QC LDPC codes for Wi-Fi and WiMax are added, and a two-dimensional message array compression technique is included.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ji, H., Cho, J. & Sung, W. Memory Access Optimized Implementation of Cyclic and Quasi-Cyclic LDPC Codes on a GPGPU. J Sign Process Syst 64, 149–159 (2011). https://doi.org/10.1007/s11265-010-0547-9

Download citation

Received: 17 January 2010
Revised: 20 September 2010
Accepted: 21 September 2010
Published: 03 November 2010
Issue Date: July 2011
DOI: https://doi.org/10.1007/s11265-010-0547-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Memory Access Optimized Implementation of Cyclic and Quasi-Cyclic LDPC Codes on a GPGPU

Abstract

Access this article

Similar content being viewed by others

Can GPU performance increase faster than the code error rate?

Shared Memory Parallelism in Modern C++ and HPX

A new distributed graph coloring algorithm for large graphs

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Memory Access Optimized Implementation of Cyclic and Quasi-Cyclic LDPC Codes on a GPGPU

Abstract

Access this article

Similar content being viewed by others

Can GPU performance increase faster than the code error rate?

Shared Memory Parallelism in Modern C++ and HPX

A new distributed graph coloring algorithm for large graphs

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation