Parallel bitsliced AES through PHAST: a single-source high-performance library for multi-cores and GPUs


PHAST library is a high-level heterogeneous STL-like C\(++\) library that can be targeted on multi-core processors and Nvidia GPUs. It permits to exploit the performance of modern parallel architectures without the complexity of parallel programming. The library manages the programming and critical fine tuning of the parallel execution on both platforms without interfering with the application code structure, while maintaining the possibility to use architecture-specific features and instructions. In cryptography, performance and architectural efficiency of software implementations is crucial. This is witnessed by the extensive research in highly optimized and specialized versions of many protocols. In this paper, we assess the performance overhead and productivity improvement achievable through the PHAST library. We implement a pseudo random number generator (PRNG) based on cache-timing-attack resistant AES. We compare it with the fastest implementations in both CPU and Nvidia GPU domains. Achieved results show that the PHAST code is shorter and simpler than the state-of-the-art implementations. Its source length is 59.59% of the reference CUDA C implementation and 88.18% of the sequential C\(++\) version for CPUs, despite being the same for both targets. It is also far less complex in terms of McCabe’s and Halstead’s metrics. Results show that these productivity improvements induce a limited performance overhead of the library layer: less than 5% on single-thread execution for CPUs and around 10% on Nvidia GPUs. Furthermore, performance of the PHAST PRNG automatically scales with the available cores in a nearly linear fashion, allowing programmers to fully exploit multi-core resources with the same source code.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6


  1. 1.

    The exact amount of these components depends on the generation and the model of the graphic card.

  2. 2.

    Within the global memory of the video card.


  1. 1.

    Boyar, J., Peralta, R.: A New Combinational Logic Minimization Technique with Applications to Cryptology, pp. 178–189. Springer, Berlin (2010).

  2. 2.

    Canright, D.: A very compact S-box for AES. In: Proceedings of the 7th International Conference on Cryptographic Hardware and Embedded Systems, CHES ’05, pp. 441–455. Springer, Berlin (2005).

  3. 3.

    Dagum, L., Menon, R.: OpenMP: an industry-standard API for shared-memory programming. IEEE Comput. Sci. Eng. 5(1), 46–55 (1998).

    Article  Google Scholar 

  4. 4.

    Edwards, H.C., Trott, C.R.: Kokkos: enabling performance portability across manycore architectures. In: 2013 Extreme Scaling Workshop (xsw 2013), pp. 18–24 (2013).

  5. 5.

    Enmyren, J., Kessler, C.W.: SkePU: a multi-backend skeleton programming library for multi-GPU systems. In: Proceedings of the Fourth International Workshop on High-Level Parallel Programming and Applications, HLPP ’10, pp. 5–14. ACM, New York (2010).

  6. 6.

    Gepner, P., Kowalik, M.F.: Multi-core processors: new way to achieve high system performance. In: International Symposium on Parallel Computing in Electrical Engineering (PARELEC’06), pp. 9–13 (2006).

  7. 7.

    Gregory, K., Miller, A.: C++ AMP: Accelerated Massive Parallelism with Microsoft Visual C++. O’Reilly, Sebastopol (2012)

    Google Scholar 

  8. 8.

    Haidl, M., Gorlatch, S.: PACXX: towards a unified programming model for programming accelerators using C++14. In: Proceedings of the 2014 LLVM Compiler Infrastructure in HPC, LLVM-HPC ’14, pp. 1–11. IEEE Press, Piscataway (2014).

  9. 9.

    Halstead, M.H.: Elements of Software Science (Operating and Programming Systems Series). Elsevier Science Inc., New York (1977)

    Google Scholar 

  10. 10.

    Han, T.D., Abdelrahman, T.S.: Reducing branch divergence in GPU programs. In: Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, GPGPU-4, pp. 3:1–3:8. ACM, New York (2011).

  11. 11.

    Hellekalek, P., Wegenkittl, S.: Empirical evidence concerning AES. ACM Trans. Model. Comput. Simul. 13(4), 322–333 (2003).

    Article  Google Scholar 

  12. 12.

    Hennessy, J.L., Patterson, D.A.: Computer Architecture: A Quantitative Approach, 5th edn. Morgan Kaufmann Publishers Inc., San Francisco (2011)

    Google Scholar 

  13. 13.

    Hunt, A., Thomas, D.: The Pragmatic Programmer. Addison-Wesley, Boston (2000)

    Google Scholar 

  14. 14.

    Intel: Intel 64 and IA-32 Architectures Software Developer’s Manual—Volume 1: Basic Architecture. (2011). Accessed 17 Sept 2016

  15. 15.

    ISO: ISO/IEC 14882:2011—Information technology—Programming languages—C++. Standard, International Organization for Standardization, Geneva (2011)

  16. 16.

    Käsper, E., Schwabe, P.: Faster and timing-attack resistant AES-GCM. In: Proceedings of the 11th International Workshop on Cryptographic Hardware and Embedded Systems, CHES ’09, pp. 1–17. Springer, Berlin (2009).

  17. 17.

    Khronos OpenCL Working Group: SYCL Provisional Specification, version 2.2. (2016). Accessed 17 Sept 2016

  18. 18.

    Khronos OpenCL Working Group: The OpenCL Specification, version 2.2. (2016). Accessed 17 Sept 2016

  19. 19.

    Kim, C., Burger, D., Keckler, S.W.: Nonuniform cache architectures for wire-delay dominated on-chip caches. IEEE Micro 23(6), 99–107 (2003).

    Article  Google Scholar 

  20. 20.

    Knuth, D.E.: The Art of Computer Programming. Seminumerical Algorithms, vol. 2, 3rd edn. Addison-Wesley Longman Publishing Co., Inc., Boston (1997)

    Google Scholar 

  21. 21.

    Lim, R.K., Petzold, L.R., Koç, Ç.K.: Bitsliced high-performance AES-ECB on GPUs. In: Ryan, A.P.Y., Naccache, D., Quisquater, J.J. (eds.) The New Codebreakers: Essays Dedicated to David Kahn on the Occasion of His 85th Birthday, pp. 125–133. Springer, Berlin (2016).

  22. 22.

    Lutz, K.: Boost.Compute. (2016). Accessed 17 Sept 2016

  23. 23.

    McCabe, T.J.: A complexity measure. IEEE Trans. Softw. Eng. 2(4), 308–320 (1976).

    MathSciNet  Article  MATH  Google Scholar 

  24. 24.

    Microsoft: Multithreading with C and Win32. Accessed 17 Sept 2016

  25. 25.

    Miller, R., Stout, Q.F.: Algorithmic techniques for networks of processors. In: Atallah, M.J. (ed.) Algorithms and Theory of Computation Handbook, 2nd edn., Chap. 46, pp. 46:1–46:18. CRC Press, Boca Raton (1999)

  26. 26.

    National Institute of Standards and Technology (NIST): FIPS PUB 197: Announcing the ADVANCED ENCRYPTION STANDARD (AES). National Institute for Standards and Technology, Gaithersburg (2001)

  27. 27.

    Nichols, B., Buttlar, D., Farrell, J.P.: Pthreads Programming—A POSIX Standard for Better Multiprocessing. O’Reilly, Sebastopol (1996)

    Google Scholar 

  28. 28.

    NVIDIA: NVIDIA GF100 Whitepaper. (2010). Accessed 17 Sept 2016

  29. 29.

    NVIDIA: CUDA C Best Practices Guide. (2015). Accessed 17 Sept 2016

  30. 30.

    NVIDIA: CUDA C Programming Guide. (2015). Accessed 17 Sept 2016

  31. 31.

    NVIDIA: NVIDIA GeForce GTX 1080 Whitepaper. (2016). Accessed 17 Sept 2016

  32. 32.

    OpenACC: OpenACC Programming and Best Practices Guide. (2015). Accessed 17 Sept 2016

  33. 33.

    Perkins, H.: EasyCL—easy to run kernels using OpenCL. (2016). Accessed 17 Sept 2016

  34. 34.

    Reinders, J.: Intel Threading Building Blocks: Outfitting C++ for Multi-core Processor Parallelism. O’Reilly Media Inc, Sebastopol (2007)

    Google Scholar 

  35. 35.

    Schäling, B.: The Boost C++ Libraries, 2nd edn. XML Press, Laguna Hills (2014)

    Google Scholar 

  36. 36.

    Steuwer, M., Kegel, P., Gorlatch, S.: SkelCL—a portable skeleton library for high-level GPU programming. In: Proceedings of the 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and PhD Forum, IPDPSW ’11, pp. 1176–1182. IEEE Computer Society, Washington (2011).

  37. 37.

    Sutter, H.: The free lunch is over: a fundamental turn toward concurrency in software. Dr. Dobb’s J. 30(3), 202–210 (2005)

    Google Scholar 

  38. 38.

    Yalamanchili, P., Arshad, U., Mohammed, Z., Garigipati, P., Entschev, P., Kloppenborg, B., Malcolm, J., Melonakos, J.: ArrayFire—a high performance software library for parallel computing with an easy-to-use API. (2015)

Download references


We would like to thank Rone Kwei Lim for sharing with us the source code of his CUDA AES-based PRNG, which constituted a valuable reference for the experimental work described in this paper.

Author information



Corresponding author

Correspondence to Biagio Peccerillo.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Peccerillo, B., Bartolini, S. & Koç, Ç.K. Parallel bitsliced AES through PHAST: a single-source high-performance library for multi-cores and GPUs. J Cryptogr Eng 9, 159–171 (2019).

Download citation


  • Heterogeneous programming
  • Multi-cores
  • GPUs