Skip to main content

Evaluating the performance of FFT library implementations on modern hybrid computing systems


Fast Fourier transform is widely used to solve numerous scientific and engineering problems. In particular, this transform is behind the software dealing with speech and image recognition, signal analysis, modeling of properties of new materials and substances, etc. Newly emerging high-performance hybrid computing systems, as well as systems with alternative architectures, require research on discrete Fourier transform computation efficiency on these new platforms. The results of such research allow assessing the feasibility of certain solutions for building modern computing and data processing centers. This paper presents the results of such research covering modern hybrid computing systems based on the IBM POWER and Intel Xeon processors, as well as on NVIDIA Tesla co-processors. The analysis is carried out, and conclusions are presented on their performance when executing fast Fourier transforms. The impact of the existing architectural aspects of the hardware (CPU simultaneous multithreading mode, GPU data transfer bus, etc.) on the transform performance efficiency is assessed. The obtained results are used to provide recommendations on the optimal operation modes and settings of the considered mathematical libraries.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18


  1. 1.

    Brodtkorb AR, Dyken C, Hagen TR, Hjelmervik JM, Storaasli OO (2010) State-of-the-art in heterogeneous computing. Sci Progr 18(1):1–33.

    Article  Google Scholar 

  2. 2.

    Cooley JW, Tukey JW (1965) An algorithm for the machine calculation of complex Fourier series. Math Comput 19(90):297–301

    MathSciNet  Article  Google Scholar 

  3. 3.

    Stanković D, Jovanović P, Jović A, Slavnić V, Vudragović D, Balaž A (2014) Implementation and Benchmarking of New FFT Libraries in Quantum ESPRESSO. In: Dulea M, Karaivanova A, Oulas A, Liabotis I, Stojiljkovic D, Prnjat O (eds) High-Performance Computing Infrastructure for South East Europe’s Research Communities, Modeling and Optimization in Science and Technologies, vol 2. Springer, Cham.

  4. 4.

    Wende F, Marsman M, Steinke T (2016) On Enhancing 3D-FFT Performance in VASP. In: CUG proceedings

  5. 5.

    Bailey DH, Barszcz E, Barton JT, Browning DS, Carter RL, Dagum L, Fatoohi RA, Frederickson PO, Lasinski TA, Schreiber RS, Simon HD, Venkatakrishnan V, Weeratunga SK (1991) The Nas parallel benchmarks. Int J Supercomput Appl 5(3):63–73.

    Article  Google Scholar 

  6. 6.

    Luszczek P, Dongarra J, Koester D, Rabensiefner R, Lucas B, Kepner J, McCalpin J, Bailey D, Takahashi D (2005) Introduction to the HPC Challenge Benchmark Suite. Lawrence Berkeley National Laboratory. Paper LBNL-57493, 12p

  7. 7.

    Park Y-S, Park K-R, Kim J-M, Jeong H-Y (2017) Fast Fourier transform benchmark on X86 Xeon system for multimedia data processing. Multimed Tools Appl 76(4):6015–6030.

    Article  Google Scholar 

  8. 8.

    Jodra JL, Gurrutxaga I, Muguerza J (2015) A study of memory consumption and execution performance of the cufft library. In: 2015 10th International Conference on P2P, Parallel, Grid, Cloud and Internet Computing (3PGCIC). IEEE, pp 323–327.

  9. 9.

    Střelák D, Filipovič J (2018) Performance analysis and autotuning setup of the cuFFT library. In: Proceedings of the 2nd Workshop on Autotuning and Adaptivity Approaches for Energy Efficient HPC Systems—ANDARE ’18. ACM Press, New York, pp 1–6.

  10. 10.

    Govindaraju NK, Lloyd B, Dotsenko Y, Smith B, Manferdelli J (2008) High performance discrete Fourier transforms on graphics processors. In: 2008 SC—International Conference for High Performance Computing, Networking, Storage and Analysis IEEE, pp 1-12.

  11. 11.

    Smagin SI, Sorokin AA, Malkovsky SI, Korolev SP, Lukyanova OA, Nikitin OY, Kondrashev VA, Chernykh VY (2019) The organization of effective multi-user operation of hybrid computing systems. Comput Technol 5(24):49–60.

    Article  Google Scholar 

  12. 12.

    Mal’kovskii SI, Sorokin AA, Korolev SP, Zatsarinnyi AA, Tsoi GI (2019) Performance evaluation of a hybrid computer cluster built on IBM POWER8 microprocessors. Progr Comput Softw 45:324–332.

    Article  Google Scholar 

  13. 13.

    Sorokin A, Malkovsky S, Tsoy G, Zatsarinnyy A, Volovich K (2020) Comparative performance evaluation of modern heterogeneous. High-performance computing systems CPUs. Electronics 9(6):1035.

    Article  Google Scholar 

  14. 14.

    ESSL Guide and Reference, IBM (2019). Accessed 17 Aug 2020

  15. 15.

    Frigo M, Johnson SG (2005) The design and implementation of FFTW3. Proc IEEE 93(2):216–231.

    Article  Google Scholar 

  16. 16.

    Sinharoy B, Van Norstrand JA, Eickemeyer RJ, Le HQ, Leenstra J, Nguyen DQ, Konigsburg B, Ward K, Brown MD, Moreira JE, Levitan D, Tung S, Hrusecky D, Bishop JW, Gschwind M, Boersma M, Kroener M, Kaltenbach M, Karkhanis T, Fernsler KM (2015) IBM POWER8 processor core microarchitecture. IBM J Res Dev 59(1):2:1–2:21

    Article  Google Scholar 

  17. 17.

    Sadasivam SK, Thompto BW, Kalla R, Starke WJ (2017) IBM Power9 processor architecture. IEEE Micro 37:40–51

    Article  Google Scholar 

  18. 18.

    NVidia: CUDA Toolkit documentation: cuFFT (2019). Accessed 01 Aug 2020

  19. 19.

    Foley D, Danskin J (2017) Ultra-Performance Pascal GPU and NVLink Interconnect. IEEE Micro 37(2):7–17.

    Article  Google Scholar 

  20. 20.

    Choquette J, Giroux O, Foley D (2018) Volta: performance and programmability. IEEE Micro 38(2):42–52.

    Article  Google Scholar 

  21. 21.

    Mulnix D (2017) Intel Xeon processor scalable family technical overview. Accessed 01 Aug 2020

  22. 22.

    Wang E et al (2014) Intel math kernel library. In: High-performance computing on the Intel\(\textregistered\) Xeon \(\text{Phi}^{{\rm TM}}\). Springer, Cham, pp 167-188.

  23. 23.

    Eggers SJ, Emer JS, Levy HM, Lo JL, Stamm RL, Tullsen DM (1997) Simultaneous multithreading: a platform for next-generation processors. IEEE Micro 17(5):12–19.

    Article  Google Scholar 

  24. 24.

    Starke WJ, Stuecheli J, Daly DM, Dodson JS, Auernhammer F, Sagmeister PM, Guthrie GL, Marino CF, Siegel M, Blaner B (2015) The cache and memory subsystems of the IBM POWER8 processor. IBM J Res Dev 59(1):3:1–3:13.

    Article  Google Scholar 

  25. 25.

    Starke WJ, Dodson JS, Stuecheli J, Retter E, Michael BW, Powell SJ, Marcella JA (2018) IBM POWER9 memory architectures for optimized systems. IBM J Res Dev 62(4/5):3:1–3:13.

    Article  Google Scholar 

  26. 26.

    Steinbach P, Werner M (2017) Gearshifft—the FFT benchmark suite for heterogeneous platforms. In: Kunkel J, Yokota R, Balaji P, Keyes D (eds) High performance computing. ISC 2017. Lecture notes in computer science, vol 10266. Springer, Cham, pp 199-216.

  27. 27.

    Sorokin AA, Makogonov SV, Korolev SP (2017) The information infrastructure for collective scientific work in the far east of Russia. Sci Tech Inf Proc 44:302–304.

    Article  Google Scholar 

  28. 28.

    Informatics Core Facility Statute. Available Online: Accessed 22 Jan 2020

Download references


This study used the computing resources and systems of the Shared Services Center ”Data Center of FEB RAS” (Khabarovsk) [27] and the Informatics Center of the Federal Research Center ”Computer Science and Control” of Russian Academy of Sciences (Moscow) [28].

Author information



Corresponding author

Correspondence to Sergey P. Korolev.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This research was partly funded by Russian Foundation for Basic Research (RFBR), Project Number 18-29-03196.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Malkovsky, S.I., Sorokin, A.A., Tsoy, G.I. et al. Evaluating the performance of FFT library implementations on modern hybrid computing systems. J Supercomput 77, 8326–8354 (2021).

Download citation


  • Hybrid computing systems
  • Intel Xeon
  • NVIDIA Tesla
  • FFT
  • FFTW
  • cuFFT
  • cuFFTW
  • Intel MKL