Skip to main content

cuGimli: optimized implementation of the Gimli authenticated encryption and hash function on GPU for IoT applications


Recently, National Institute of Standards and Technology (NIST) in the U.S. had initiated a global-scale competition to standardize the lightweight authenticated encryption with associated data (AEAD) and hash function. Gimli is one of the Round 2 candidates that is designed to be efficiently implemented across various platforms, including hardware (VLSI and FPGA), microprocessors, and microcontrollers. However, the performance of Gimli in massively parallel architectures like Graphics Processing Units (GPU) is still unknown. A high performance Gimli implementation on GPU can be especially useful to Internet of Things (IoT) applications, wherein the gateway devices and cloud servers need to handle a massive number of communications protected by AEAD. In this paper, we show that with careful optimization, Gimli can be efficiently implemented in desktop and embedded GPU to achieve extremely high throughput. Our experiments show that the proposed Gimli implementation can achieve 661.44 KB/s (encryption), 892.24 KB/s (decryption), and 4344.46 KB/s (hashing) in state-of-the-art GPUs.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20

Data availibility

This paper uses the code from NIST Lightweight Cryptography Standardization as a starting point to develop the optimized implementation on GPU. The code for Gimli hash and authenticated encryption can be found here:


  1. 1.

    Cao, J., Yu, P., Xiang, X., Ma, M., Li, H.: Anti-quantum fast authentication and data transmission scheme for massive devices in 5G NB-IoT system. IEEE Internet Things J. 6(6), 9794–9805 (2019)

    Article  Google Scholar 

  2. 2.

    Hammi, B., Fayad, A., Khatoun, R., Zeadally, S., Begriche, Y.: A lightweight ECC-based authentication scheme for Internet of Things (IoT). IEEE Syst. J. 14(3), 3440–3450 (2020)

    Article  Google Scholar 

  3. 3.

    NIST Lightweight Cryptography Standardization Round 2 Candidates.: [Online] Available: Accessed at 13 (2020)

  4. 4.

    Mohajerani, K., Haeussler, R., Nagpal, R., Farahmand, F., Abdulgadir, A., Kaps, J. P., and Gaj, K. FPGA Benchmarking of round 2 candidates in the NIST lightweight cryptography standardization process: methodology, metrics, tools, and results. Cryptol. ePrint Archive (2020)

  5. 5.

    Bernstein, D.J., Kölbl, S., Lucks, S., Massolino, P. M. C., Mendel, F., Nawaz, K., Schneider, T., Schwabe, P., Standaert, F.X., Todo, Y., and Viguier, B.: Gimli specification, Gimli submission to NIST lightweight cryptography standardization round 2. [Online] Available: spec-round2.pdf. Accessed at 15 (2020)

  6. 6.

    Bernstein, D.J., Kölbl, S., Lucks, S., Massolino, P.M.C., Mendel, F., Nawaz, K., Schneider, T., Schwabe, P., Standaert, F.X., Todo, Y., Viguier, B.: Gimli: A cross-platform permutation, Cryptographic Hardware and Embedded Systems-CHES 2017. Taipei (2017)

  7. 7.

    Aslam, M., Riaz, O., Mumtaz, S., Asif, A.D.: Performance comparison of GPU-based jacobi solvers using CUDA provided synchronization methods. IEEE Access 8, 31792–31812 (2020)

    Article  Google Scholar 

  8. 8.

    Peng, S., Tan, S.X.D.: GLU3. :0: fast GPU-based parallel sparse LU factorization for circuit simulation. IEEE Des. Test 37(32), 78–90 (2020)

    Article  Google Scholar 

  9. 9.

    Chen, X., Chen, D.Z., Han, Y., Hu, X.S.: moDNN: memory optimal deep neural network training on graphics processing units. IEEE Trans. Parallel Distrib. Syst. 30(3), 646–661 (2018)

    Article  Google Scholar 

  10. 10.

    Pan, W., Zheng, F., Zhao, Y., Zhu, W.T., Jing, J.: An efficient elliptic curve cryptography signature server with GPU acceleration. IEEE Trans. Inf. Forensics Secur. 12(1), 111–122 (2016)

    Article  Google Scholar 

  11. 11.

    Ochoa-Jiménez, E., Rivera-Zamarripa, L., Cruz-Cortés, N., Rodríguez-Henríquez, F.: Implementation of RSA signatures on GPU and CPU architectures. IEEE Access 8, 9928–9941 (2020)

    Article  Google Scholar 

  12. 12.

    Shor, P.: Algorithms for quantum computation: discrete logarithm and factoring, IEEE Proceedings of the 35th Annual Symposium on Foundations of Computer Science, Santa Fe (1994)

  13. 13.

    Gupta, N., Jati, A., Chauhan, A.K., Chattopadhyay, A.: PQC acceleration using GPUs: FrodoKEM, NewHope, and Kyber. IEEE Trans. Parallel Distrib. Syst. 32(3), 575–586 (2020)

    Article  Google Scholar 

  14. 14.

    Duong-Ngoc, P., Tan, T.N., Lee, H.: Efficient NewHope cryptography based facial security system on a GPU. IEEE Access 8, 108158–108168 (2020)

    Article  Google Scholar 

  15. 15.

    Lee, W.K., Phan, R.C.W., Goi, B.M., Chen, L., Zhang, X., Xiong, N.N.: Parallel and high speed hashing in GPU for telemedicine applications. IEEE Access 6, 37991–38002 (2018)

    Article  Google Scholar 

  16. 16.

    Lee, W.K., Goi, B.M., Phan, R.C.W.: Tera-bit encryption in a second: performance evaluation of block ciphers in GPU With Kepler, Maxwell and Pascal architectures. Concurr. Comput. 31(11), e5048 (2019)

    Article  Google Scholar 

  17. 17.

    Hajihassani, O., Monfared, S.K., Khasteh, S.H., Gorgin, S.: Fast AES implementation: a high-throughput bitsliced approach. IEEE Trans. Parallel Distrib. Syst. 30(10), 2211–2222 (2019)

    Article  Google Scholar 

  18. 18.

    Pooranian, Z., Chen, K.C., Yu, C.M., Conti, M.: RARE: Defeating side channels based on data-deduplication in cloud storage, IEEE INFOCOM 2018-IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), pp. 444–449 (2018)

  19. 19.

    Pooranian, Z., Shojafar, M., Garg, S., Taheri, R., Tafazolli, R.: LEVER: secure deduplicated cloud storage with encrypted two-party interactions in cyber-physical systems. IEEE Trans. Ind. Inform. 17(8), 5759–5768 (2020)

    Article  Google Scholar 

Download references


This work was supported by a National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (2019H1D3A1A01102607, 2020R1A2B5B01002145, 2021R1A6A3A13038773).


This work was supported by a National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (2019H1D3A1A01102607, 2020R1A2B5B01002145, 2021R1A6A3A13038773).

Author information




All authors contributed equally to the final dissemination of the research investigation as a full article. All authors have read and agreed to the published version of the manuscript.

Corresponding author

Correspondence to Seong Oun Hwang.

Ethics declarations

Conflicts of interest

There is no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Han, K., Lee, WK. & Hwang, S.O. cuGimli: optimized implementation of the Gimli authenticated encryption and hash function on GPU for IoT applications. Cluster Comput (2021).

Download citation


  • CUDA
  • GPU
  • Gimli
  • Authenticated encryption