High Performance FFT on SGI Altix 3700

  • Akira Nukada
  • Daisuke Takahashi
  • Reiji Suda
  • Akira Nishida
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4782)


We have developed a high-performance FFT on SGI Altix 3700, improving the efficiency of the floating-point operations required to compute FFT by using a kind of loop fusion technique. As a result, we achieved a performance of 4.94 Gflops at 1-D FFT of length 4096 with an Itanium 2 1.3 GHz (95% of peak), and a performance of 28 Gflops at 2-D FFT of 40962 with 32 processors. Our FFT kernel outperformed the other existing libraries.


Nest Loop Iteration Count Cache Memory Fast Fourier Transform Algorithm Index Array 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Cooley, J.W., Tukey, J.W.: An Algorithm for the Machine Calculation of Complex Fourier Series. Math. Comput. 19, 297–301 (1965)zbMATHCrossRefMathSciNetGoogle Scholar
  2. 2.
    Dunigan, T.H., Vetter, J.S., Worley, P.H.: Performance evaluation of the SGI Altix 3700. In: ICPP, pp. 231–240 (2005)Google Scholar
  3. 3.
    Wolf, M.E., Lam, M.S.: A loop transformation theory and an algorithm to maximize parallelism. IEEE Trans. Parallel Distrib. Syst. 2, 452–471 (1991)CrossRefGoogle Scholar
  4. 4.
    Swarztrauber, P.N.: FFT algorithms for vector computers. Parallel Computing 1, 45–63 (1984)zbMATHCrossRefGoogle Scholar
  5. 5.
    Van Loan, C.: Computational Frameworks for the Fast Fourier Transform. SIAM Press, Philadelphia, PA (1992)zbMATHGoogle Scholar
  6. 6.
    Intel Coporation: Itanium Architecture Software Developer’s Manual Revision 2.1 (2002)Google Scholar
  7. 7.
    Colwell, R.P., et al.: A VLIW architecture for a trace scheduling compiler. IEEE Trans. on Computers 37, 967–979 (1988)CrossRefGoogle Scholar
  8. 8.
    Gwennap, L.: Intel, HP make EPIC disclosure. Microprocessor Report 11, 1–9 (1997)Google Scholar
  9. 9.
    Rau, B.R.: Iterative modulo scheduling: An algorithm for software pipelining loops. In: Proc. 27th Annual International Symposium on Microarchitecture, San Jose, CA, pp. 63–74 (1994)Google Scholar
  10. 10.
    Pease, M.C.: An adaptation of the fast Fourier transform for parallel processing. J. ACM 15, 252–264 (1968)zbMATHCrossRefGoogle Scholar
  11. 11.
    Linzer, E.N., Feig, E.: Implementation of efficient FFT algorithms on fused multiply-add architectures. IEEE Trans. Signal Processing 41, 93–107 (1993)zbMATHCrossRefGoogle Scholar
  12. 12.
    Goedecker, S.: Fast radix 2,3,4 and 5 kernels for fast Fourier transformations on computers with overlapping multiply-add instructions. SIAM J. Sci. Comput. 18, 1605–1611 (1997)zbMATHCrossRefMathSciNetGoogle Scholar
  13. 13.
    Karner, H., et al.: Multiply-Add Optimized FFT Kernels. Math. Models and Methods in Appl. Sci. 11, 105–117 (2001)zbMATHCrossRefMathSciNetGoogle Scholar
  14. 14.
    Bergland, G.D.: A fast Fourier transform algorithm using base 8 iterations. Math. Comp. 22, 275–279 (1968)zbMATHCrossRefMathSciNetGoogle Scholar
  15. 15.
    Frigo, M., Johnson, S.G.: The design and implementation of FFTW3. Proceedings of the IEEE, special issue on ”Program Generation, Optimization, and Platform Adaptation” 93, 216–231 (2005)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Akira Nukada
    • 1
    • 2
  • Daisuke Takahashi
    • 3
    • 1
  • Reiji Suda
    • 2
    • 1
  • Akira Nishida
    • 2
    • 1
  1. 1.Japan Science and Technology Agency, Saitama 332-0012Japan
  2. 2.The University of Tokyo, Tokyo 113-8685Japan
  3. 3.University of Tsukuba, Ibaraki 305-8573Japan

Personalised recommendations