Exploring the parallel capabilities of GPU: Berlekamp-Massey algorithm case study

  • Hanan Ali
  • Ghada M. FathyEmail author
  • Zeinab Fayez
  • Walaa Sheta


Graphics processors Unit (GPU) architectures are becoming increasingly programmable, offering the potential for dramatic speedups for a variety of general purpose applications compared to contemporary general- purpose processors (CPUs). However, there are several optimization techniques which are used to maximize the benefit of the GPU resources. This research exploits optimization techniques for CUDA enabled GPU architecture in order to achieve the best possible performance for Berlekamp-Massey Algorithm (BMA) as a case study. Berlekamp-Massey Algorithm (BMA) is one of the best solutions to find the shortest linear feedback shift register which is very important for several applications such as digital processing and cryptography. The experimental results show that the optimized BMA implementation is almost 160 × faster than non-bit CPU serial implementation, 7 × faster than bit serial implementation and 4 × faster than an initial parallel bit implementation.


Linear complexity BerlekampMassey algorithm Parallel computing GPU CUDA optimization techniques 



  1. 1.
    Ali, H., Ouyang, M., Soliman, A., Sheta, W.: Parallelizing the Berlekamp-Massey algorithm. Int. J. Comput. Sci. Inf. Secur. 13(11), 42 (2015)Google Scholar
  2. 2.
    Berlekamp, E.: Algebraic Coding Theory. World Scientific Publishing, Singapore (2015)CrossRefzbMATHGoogle Scholar
  3. 3.
    Bradley, T.: Assess, parallelize, optimize, deploy. (2012)
  4. 4.
    Chen, N., Yan, Z.: Complexity analysis of Reed-Solomon decoding over GF (2 m) without using syndromes. EURASIP J. Wirel. Commun. Netw. 2008(1), 843634 (2008)CrossRefGoogle Scholar
  5. 5.
    Didier, F.: Efficient erasure decoding of Reed-Solomon codes. (2009)
  6. 6.
    Elsaid, H.A.E.A.: Design and Implementation of Reed-Solomon Decoder Using Decomposed Inversion less Berlekamp-Massey Algorithm. Faculty of Engineering, Cairo University, Giza (2010)Google Scholar
  7. 7.
    Greenberg, S., Feldblum, N., Melamed, G.: Implementation of the Berlekamp-Massey algorithm using a DSP. In: Proceedings of the 2004 11th IEEE International Conference on Electronics, Circuits and Systems. ICECS, pp. 358–361. IEEE (2004)Google Scholar
  8. 8.
    Harris, M.: Optimizing CUDA. SC07: High performance computing with CUDA (2007)Google Scholar
  9. 9.
    Henkel, W.: Another description of the Berlekamp-Massey algorithm. IEE Proc. I-Commun. Speech Vision 136(3), 197–200 (1989)MathSciNetCrossRefGoogle Scholar
  10. 10.
    Katz, J., Shacham, H.: Advances in cryptology–CRYPTO 2017. In: Proceedings of the 37th Annual International Cryptology Conference, Santa Barbara, CA, USA, vol. 10401, 20–24 Aug 2017, Springer (2017)Google Scholar
  11. 11.
    Kotter, R.: A fast parallel implementation of a Berlekamp-Massey algorithm for algebraic-geometric codes. IEEE Trans. Inf. Theory 44(4), 1353–1368 (1998)MathSciNetCrossRefzbMATHGoogle Scholar
  12. 12.
    Mark, H.: Optimizing parallel reduction in CUDA. NVIDIA CUDA SDK 2, 15 (2008)Google Scholar
  13. 13.
    Massey, J.: Shift-register synthesis and bch decoding. IEEE Trans. Inf. Theory 15(1), 122–127 (1969)MathSciNetCrossRefzbMATHGoogle Scholar
  14. 14.
    Mohebbi, H.: Parallel SIMD CPU and GPU implementations of Berlekamp–Massey algorithm and its error correction application. Int. J. Parallel Program. 47(1), 137–160 (2018)MathSciNetCrossRefGoogle Scholar
  15. 15.
    Nvidia, C.: Programming guide (2010)Google Scholar
  16. 16.
    Nvidia, W.: Whitepaper NVIDIAS next generation CUDA compute architecture. ReVision, pp. 1–22 (2009)Google Scholar
  17. 17.
    Sarwate, D.V., Shanbhag, N.R.: High-speed architectures for Reed-Solomon decoders. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 9(5), 641–655 (2001)CrossRefGoogle Scholar
  18. 18.
    Schmidt, G., Sidorenko, V.R., Bossert, M.: Syndrome decoding of Reed–Solomon codes beyond half the minimum distance based on shift-register synthesis. IEEE Trans. Inf. Theory 56(10), 5245–5252 (2010)MathSciNetCrossRefzbMATHGoogle Scholar
  19. 19.
    Spinner, J., Freudenberger, J.: A decoder with soft decoding capability for high-rate generalized concatenated codes with applications in non-volatile flash memories. In: Proceedings of the 30th Symposium on Integrated Circuits and Systems Design (SBCCI), pp. 185–190. IEEE (2017)Google Scholar
  20. 20.
    Tilavat, V., Shukla, Y.: Simplification of procedure for decoding reed-solomon codes using various algorithms: an introductory survey. Int. J. Eng. Dev. Res. 2(1) (2014)Google Scholar
  21. 21.
    Xiao, S., Feng, W.C.: Inter-block GPU communication via fast barrier synchronization. In: Proceedings of the IEEE International Symposium on Parallel & Distributed Processing (IPDPS), pp. 1–12. IEEE (2010)Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Informatic Research Institute, City for Scientific ResearchAlexandriaEgypt

Personalised recommendations