The Journal of Supercomputing

, Volume 72, Issue 2, pp 391–416 | Cite as

Hybrid and 4-D FFT implementations of an open-source parallel FFT package OpenFFT

  • Truong Vinh Truong DuyEmail author
  • Taisuke Ozaki


The fast Fourier transform (FFT) is a fundamental kernel in a wide variety of science and engineering fields from electronic structure calculations to medical imaging. OpenFFT is an open-source package for parallel 3-D FFTs, built on a domain decomposition method with optimal communication for minimizing the volume of communication. However, OpenFFT version 1.1 does not support hybrid calculations for fully utilizing common multi-core machines and parallel 4-D FFTs. In addition, there exist several state-of-the-art open-source packages for parallel FFTs, and performance comparison among them would be interesting and helpful for potential users. In this paper, we (1) extend the functionality of OpenFFT by developing a hybrid MPI/OpenMP version and a parallel 4-D FFT, and (2) conduct performance comparison among the currently available parallel FFT packages. In the former, we first analyze the computational parts of OpenFFT to explore the opportunities for taking advantage of fine-grained parallelization with OpenMP. Based on the analysis, we implement and make comparison to choose between the two most promising hybrid options. We then develop the parallel 4-D FFT by extending the hybrid implementation of the parallel 3-D FFT. The decomposition method for 4-D FFTs preserves its 3-D original features to maximize the localization of data when transposing. In the latter, we evaluate and compare the performance of FFTE, P3DFFT, PFFT, 2DECOMP&FFT, and OpenFFT for both 3-D and 4-D FFTs on a number of different machines with varied computational scales. The evaluation results assert the benefit of the hybrid feature in improving the scalability of OpenFFT, and confirm its minimal volume of communication for 4-D FFTs in practice. Also, although no significant difference is observed in overall performance in general, there are specific cases when some packages have the edge over the others.


Fast Fourier transform (FFT) Parallel FFTs Hybrid MPI/OpenMP 3-D and 4-D FFTs Performance comparison 



This work was supported by the Strategic Programs for Innovative Research (SPIRE), MEXT, the Computational Materials Science Initiative (CMSI), and Materials Design through Computics: Complex Correlation and Non-Equilibrium Dynamics A Grant in Aid for Scientific Research on Innovative Areas, MEXT, Japan. The benchmark calculations were performed using the K computer at RIKEN, and the Cray XC30 and SGI InfiniBand machines at Japan Advanced Institute of Science and Technology (JAIST).


  1. 1.
    2DECOMP&FFT: Library for 2D pencil decomposition and distributed Fast Fourier Transform. (retrieved 2014-12-01)
  2. 2.
    Ayala O, Wang LP (2013) Parallel implementation and scalability analysis of 3D fast fourier transform using 2D domain decomposition. Parallel Comput 39(1):58–77. doi: 10.1016/j.parco.2012.12.002.
  3. 3.
    Broughton SA, Bryan KM (2008) Discrete Fourier analysis and wavelets: applications to signal and image processing. Wiley, New YorkCrossRefGoogle Scholar
  4. 4.
    Cardoso N, Silva PJ, Bicudo P, Oliveira O (2013) Landau gauge fixing on gpus. Comput Phys Commun 184(1):124–129. doi: 10.1016/j.cpc.2012.09.007.
  5. 5.
    Clarke L, Stich I, Payne M (1992) Large-scale ab initio total energy calculations on parallel computers. Comput Phys Commun 72(1):14–28. doi: 10.1016/0010-4655(92)90003-H.
  6. 6.
    Cooley JW, Tukey JW (1965) An algorithm for the machine calculation of complex Fourier series. Math Comput 19(90):297–301. doi: 10.2307/2003354 CrossRefMathSciNetzbMATHGoogle Scholar
  7. 7.
    Dmitruk P, Wang LP, Matthaeus W, Zhang R, Seckel D (2001) Scalable parallel fft for spectral simulations on a beowulf cluster. Parallel Comput 27(14):1921–1936. doi: 10.1016/S0167-8191(01)00120-X.
  8. 8.
    Duy TVT, Ozaki T (2014) A decomposition method with minimum communication amount for parallelization of multi-dimensional FFTs. Comput Phys Commun 185(1):153–164. doi: 10.1016/j.cpc.2013.08.028.
  9. 9.
    Duy TVT, Ozaki T (2014) A three-dimensional domain decomposition method for large-scale DFT electronic structure calculations. Comput Phys Commun 185(3):777–789. doi: 10.1016/j.cpc.2013.11.008.
  10. 10.
    Duy TVT, Ozaki T (2014) OpenFFT: an open-source package for 3-D FFTs with minimal volume of communication. In: Kunkel JM, Ludwig T, Meuer HW (eds) Proceedings of the 29th international supercomputing conference. Lecture notes in computer science, vol 8488. Springer International Publishing, Switzerland, pp 517–518.
  11. 11.
    Duy TVT, Ozaki T (2015) Performance tuning of an open-source parallel 3-D FFT package OpenFFT. arXiv:1501.07350
  12. 12.
    Eklund A, Andersson M, Knutsson, H.: True 4d image denoising on the gpu. Int J Biomed Imaging 2011:16 (2011). doi: 10.1155/2011/952819.
  13. 13.
    Eleftheriou M, Fitch B, Rayshubskiy A, Ward T, Germain R (2005) Performance measurements of the 3d fft on the blue gene/l supercomputer. In: Cunha JC, Medeiros P (eds) Euro-Par 2005 parallel processing. Lecture notes in computer science, vol 3648. Springer Berlin Heidelberg, pp 795–803. doi: 10.1007/11549468_87
  14. 14.
    Eleftheriou M, Moreira JE, Fitch B, Germain R (2003) A volumetric fft for bluegene/l. In: Pinkston T, Prasanna V (eds) High performance computing—HiPC 2003. Lecture notes in computer science, vol 2913. Springer, Berlin, Heidelberg, pp 194–203 doi: 10.1007/978-3-540-24596-4_21
  15. 15.
    Fang B, Deng Y, Martyna G (2007) Performance of the 3d fft on the 6d network torus qcdoc parallel supercomputer. 176(8):531–538. doi: 10.1016/j.cpc.2006.12.006.
  16. 16.
    FFTE: A fast Fourier transform package. (retrieved 2014-12-01)
  17. 17.
    FFTW: Fastest Fourier transform in the west. (retrieved 2014-12-01)
  18. 18.
    Gonzales R, Woods R (1992) Digital image processing. Addison-Wesley Publishing Company, BostonGoogle Scholar
  19. 19.
    Haynes PD, Cote M (2000) Parallel fast Fourier transforms for electronic structure calculations. Comput Phys Commun 130(1–2):130–136. doi: 10.1016/S0010-4655(00)00049-7.
  20. 20.
    Kovacs JA, Chacón P, Cong Y, Metwally E, Wriggers W (2003) Fast rotational matching of rigid bodies by fast Fourier transform acceleration of five degrees of freedom. Acta Crystallogr Sect D 59(8):1371–1376. doi: 10.1107/S0907444903011247 CrossRefGoogle Scholar
  21. 21.
    Li N, Laizet S (2010) 2decomp&fft—a highly scalable 2d decomposition library and fft interface. In: Cray User Group 2010 Conference.
  22. 22.
    OpenFFT: An open-source parallel package for 3-D FFTs. (retrieved 2014-12-01)
  23. 23.
    OpenMX: Open source package for Material eXplorer. (retrieved 2013-01-15)
  24. 24.
    P3DFFT: Scalable framework for three-dimensional Fourier transforms. (retrieved 2014-12-01)
  25. 25.
    Pekurovsky D (2012) P3dfft: a framework for parallel computations of fourier transforms in three dimensions. SIAM J Sci Comput 34(4):C192–C209. doi: 10.1137/11082748X CrossRefMathSciNetzbMATHGoogle Scholar
  26. 26.
    PFFT: Parallel fast Fourier transforms. (retrieved 2014-12-01)
  27. 27.
    Pippig M (2013) Pfft: an extension of fftw to massively parallel architectures. SIAM J Sci Comput 35(3):C213–C236. doi: 10.1137/120885887 CrossRefMathSciNetzbMATHGoogle Scholar
  28. 28.
    Ritchie DW (2012) Modeling proteinprotein interactions by rigid-body docking. In: Lee Banting TC (ed) Drug design strategies computational techniques and applications. RSC Drug Discovery, pp. 56–86. The Royal Society of Chemistry. doi: 10.1039/9781849733403-00056
  29. 29.
    Ritchie DW, Kozakov D, Vajda S (2008) Accelerating and focusing proteinprotein docking correlations using multi-dimensional rotational fft generating functions. Bioinformatics 24(17):1865–1873. doi: 10.1093/bioinformatics/btn334.
  30. 30.
    Ritchie DW, Venkatraman V, Mavridis L (2010) Using graphics processors to accelerate protein docking calculations. In: Solomonides T, Blanquer I, Breton V, Glatard T, Legr Y (eds) Healthgrid applications and core technologies. Studies in health technology and informatics, vol 159. IOS Press, Amsterdam, pp 146–155. doi: 10.3233/978-1-60750-583-9-146 Google Scholar
  31. 31.
    Takahashi D (2010) An implementation of parallel 3-d fft with 2-d decomposition on a massively parallel cluster of multi-core processors. In: Wyrzykowski R, Dongarra J, Karczewski K, Wasniewski J (eds) Parallel processing and applied mathematics. Lecture notes in computer science, vol 6067. Springer Berlin/Heidelberg, pp 606–614Google Scholar
  32. 32.
    Veeraraghavan A, Raskar R, Agrawal A, Mohan A, Tumblin J (2007) Dappled photography: mask enhanced cameras for heterodyned light fields and coded aperture refocusing. ACM Trans Graph 26(3). doi: 10.1145/1276377.1276463
  33. 33.
    Wetzstein G, Ihrke I, Heidrich W (2013) On plenoptic multiplexing and reconstruction. Int J Comput Vis 101(2):384–400. doi: 10.1007/s11263-012-0585-9

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  1. 1.Institute for Solid State PhysicsThe University of TokyoKashiwaJapan

Personalised recommendations