Automatic Performance Optimization of the Discrete Fourier Transform on Distributed Memory Computers

  • Andreas Bonelli
  • Franz Franchetti
  • Juergen Lorenz
  • Markus Püschel
  • Christoph W. Ueberhuber
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4330)


This paper introduces a formal framework for automatically generating performance optimized implementations of the discrete Fourier transform (DFT) for distributed memory computers. The framework is implemented as part of the program generation and optimization system Spiral. DFT algorithms are represented as mathematical formulas in Spiral’s internal language SPL. Using a tagging mechanism and formula rewriting, we extend Spiral to automatically generate parallelized formulas. Using the same mechanism, we enable the generation of rescaling DFT algorithms, which redistribute the data in intermediate steps to fewer processors to reduce communication overhead. It is a novel feature of these methods that the redistribution steps are merged with the communication steps of the algorithm to avoid additional communication overhead. Among the possible alternative algorithms, Spiral’s search mechanism now determines the fastest for a given platform, effectively generating adapted code without human intervention. Experiments with DFT MPI programs generated by Spiral show performance gains of up to 30% due to rescaling. Further, our generated programs compare favorably with Fftw-MPI 2.1.5.


Discrete Fourier Transform Communication Step Data Redistribution Distribute Memory Computer Discrete Fourier Transform Algorithm 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Adelmann, A., Bonelli, A., Petersen, W.P., Ueberhuber, C.W.: Communication efficiency of parallel 3D FFTs. In: VECPAR 2004, vol. III, pp. 901–907 (2004)Google Scholar
  2. 2.
    Baumgartner, G., Auer, A., Bernholdt, D.E., Bibireata, A., Choppella, V., Cociorva, D., Gao, X., Harrison, R.J., Hirata, S., Krishnamoorthy, S., Krishnan, S., Lam, C., Lu, Q., Nooijen, M., Pitzer, R.M., Ramanujam, J., Sadayappan, P., Sibiryakov, A.: Synthesis of high-performance parallel programs for a class of ab initio quantum chemistry models. In: [17], pp. 276–292 (2005)Google Scholar
  3. 3.
    Blackford, L.S., Choi, J., Cleary, A., D’Azevedo, E., Demmel, J., Dhillon, I., Dongarra, J., Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D., Whaley, R.C.: ScaLapack Users’ Guide. SIAM, Philadelphia, PA (1997)Google Scholar
  4. 4.
    Dershowitz, N., Plaisted, D.A.: Rewriting. In: Robinson, A., Voronkov, A. (eds.) Handbook of Automated Reasoning, ch. 9, vol. 1, pp. 535–610. Elsevier, Amsterdam (2001)CrossRefGoogle Scholar
  5. 5.
    Eleftheriou, M., Fitch, B., Rayshubskiy, A., Ward, T.C., Germain, R.: Scalable framework for 3D FFTs on the Blue Gene/L supercomputer: Implementation and early performance measurements. IBM Journal of Research and Development 49(2/3), 457–464 (2005)CrossRefGoogle Scholar
  6. 6.
    Faraj, A., Yuan, X.: Automatic generation and tuning of MPI collective communication routines. In: Proc. International Conference on Supercomputing (ICS), pp. 393–402 (2005)Google Scholar
  7. 7.
    Franchetti, F., Püschel, M.: A SIMD vectorizing compiler for digital signal processing algorithms. In: Proc. International Parallel and Distributed Processing Symposium (IPDPS), pp. 20–26 (2002)Google Scholar
  8. 8.
    Franchetti, F., Voronenko, Y., Püschel, M.: Loop merging for signal transforms. In: Proc. Programming Language Design and Implementation (PLDI), pp. 315–326 (2005)Google Scholar
  9. 9.
    Franchetti, F., Voronenko, Y., Püschel, M.: FFT program generation for shared memory: SMP and multicore. In: Proc. Supercomputing, SC (2006)Google Scholar
  10. 10.
    Franchetti, F., Voronenko, Y., Püschel, M.: A rewriting system for the vectorization of signal transforms. In: Daydé, M., Palma, J.M.L.M., Coutinho, Á.L.G.A., Pacitti, E., Lopes, J.C. (eds.) VECPAR 2006. LNCS, vol. 4395, pp. 363–377. Springer, Heidelberg (2007) (On CD-ROM)CrossRefGoogle Scholar
  11. 11.
    Frigo, M.: A fast Fourier transform compiler. In: Proc. Programming Language Design and Implementation (PLDI), pp. 169–180 (1999)Google Scholar
  12. 12.
    Frigo, M., Johnson, S.G.: Fftw: An adaptive software architecture for the FFT. In: Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 3, pp. 1381–1384. IEEE, Los Alamitos (1998)Google Scholar
  13. 13.
    Frigo, M., Johnson, S.G.: The design and implementation of Fftw3. In: [17], pp. 216–231 (2005)Google Scholar
  14. 14.
    Goumas, G., Drosinos, N., Athanasaki, M., Koziris, N.: Automatic parallel code generation for tiled nested loops. In: Proc. Symposium on Applied Computing (SAC), pp. 1412–1419. ACM Press, New York (2004)Google Scholar
  15. 15.
    Gygi, F., Draeger, E., de Supinski, B.R., Yates, R.K., Franchetti, F., Kral, S., Lorenz, J., Ueberhuber, C.W., Gunnels, J., Sexton, J.: Large-scale first-principles molecular dynamics simulations on the Blue Gene/L platform using the Qbox code. In: Proc. Supercomputing (SC), p. 24 (2005)Google Scholar
  16. 16.
    Johnson, J., Chen, K.: A self-adapting distributed memory package for fast signal transforms. In: Proc. International Parallel and Distributed Processing Symposium (IPDPS), p. 44a (2004)Google Scholar
  17. 17.
    Moura, J.M.F., Püschel, M., Padua, D., Dongarra, J. (eds.): Special Issue on Program Generation, Optimization, and Platform Adaptation, Proceedings of the IEEE 93(2) (2005)Google Scholar
  18. 18.
    Pjesivac-Grbovic, J., Angskun, T., Bosilca, G., Fagg, G.E., Gabriel, E., Dongarra, J.: Performance analysis of MPI collective operations. Cluster Computing Journal, Special Issue on Performance Modeling and Evaluation of Parallel and Distributed Systems (accepted for publication, 2006)Google Scholar
  19. 19.
    Püschel, M., Moura, J.M.F., Johnson, J., Padua, D., Veloso, M., Singer, B.W., Xiong, J., Franchetti, F., Gačić, A., Voronenko, Y., Chen, K., Johnson, R.W., Rizzolo, N.: Spiral: Code generation for DSP transforms. In: [17], pp. 232–275 (2005)Google Scholar
  20. 20.
    Spiral web site,
  21. 21.
    Van Loan, C.: Computational Frameworks for the Fast Fourier Transform. Frontiers in Applied Mathematics, vol. 10. Society for Industrial and Applied Mathematics (SIAM), Philadelphia (1992)MATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Andreas Bonelli
    • 1
  • Franz Franchetti
    • 2
  • Juergen Lorenz
    • 1
  • Markus Püschel
    • 2
  • Christoph W. Ueberhuber
    • 1
  1. 1.Institute for Analysis and Scientific ComputingVienna University of TechnologyWienAustria
  2. 2.Department of Electrical and Computer EngineeringCarnegie Mellon UniversityPittsburghUSA

Personalised recommendations