An Implementation of Parallel 1-D FFT Using SSE3 Instructions on Dual-Core Processors

  • Daisuke Takahashi
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4699)


In the present paper, an implementation of a parallel one-dimensional fast Fourier transform (FFT) using Streaming SIMD Extensions 3 (SSE3) instructions on dual-core processors is proposed. Combination of vectorization and the block six-step FFT algorithm is shown to effectively improve performance. The performance results for one-dimensional FFTs on dual-core Intel Xeon processors are reported. We successfully achieved performance of approximately 2006 MFLOPS on a dual-core Intel Xeon PC (2.8 GHz, two CPUs, four cores) and approximately 3492 MFLOPS on a dual-core Intel Xeon 5150 PC (2.66 GHz, two CPUs, four cores) for a 220-point FFT.


Fast Fourier Transform Cache Size Fast Fourier Transform Algorithm Cache Blocking Twiddle Factor 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Cooley, J.W., Tukey, J.W.: An algorithm for the machine calculation of complex Fourier series. Math. Comput. 19, 297–301 (1965)zbMATHCrossRefMathSciNetGoogle Scholar
  2. 2.
    Nadehara, K., Miyazaki, T., Kuroda, I.: Radix-4 FFT implementation using SIMD multimedia instructions. In: Proc. 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 1999), vol. 4, pp. 2131–2134 (1999)Google Scholar
  3. 3.
    Franchetti, F., Karner, H., Kral, S., Ueberhuber, C.W.: Architecture independent short vector FFTs. In: Proc. 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2001), vol. 2, pp. 1109–1112 (2001)Google Scholar
  4. 4.
    Rodriguez, V.P.: A radix-2 FFT algorithm for modern single instruction multiple data (SIMD) architectures. In: Proc. 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2002), vol. 3, pp. 3220–3223 (2002)Google Scholar
  5. 5.
    Kral, S., Franchetti, F., Lorenz, J., Ueberhuber, C.W.: SIMD vectorization of straight line FFT code. In: Kosch, H., Böszörményi, L., Hellwagner, H. (eds.) Euro-Par 2003. LNCS, vol. 2790, pp. 251–260. Springer, Heidelberg (2003)Google Scholar
  6. 6.
    Frigo, M., Johnson, S.G.: The design and implementation of FFTW3. Proc. IEEE 93, 216–231 (2005)CrossRefGoogle Scholar
  7. 7.
    Franchetti, F., Kral, S., Lorenz, J., Ueberhuber, C.W.: Efficient utilization of SIMD extensions. Proc. IEEE 93, 409–425 (2005)CrossRefGoogle Scholar
  8. 8.
    Intel Corporation: IA-32 Intel Architecture Software Developer’s Manual. Basic Architecture, vol. 1 (2006)Google Scholar
  9. 9.
    Intel Corporation: Intel C++ Compiler Documentation (2006)Google Scholar
  10. 10.
    Bailey, D.H.: FFTs in external or hierarchical memory. The Journal of Supercomputing 4, 23–35 (1990)CrossRefGoogle Scholar
  11. 11.
    Van Loan, C.: Computational Frameworks for the Fast Fourier Transform. SIAM Press, Philadelphia (1992)zbMATHGoogle Scholar
  12. 12.
    Takahashi, D.: A blocking algorithm for FFT on cache-based processors. In: Hertzberger, B., Hoekstra, A.G., Williams, R. (eds.) High-Performance Computing and Networking. LNCS, vol. 2110, pp. 551–554. Springer, Heidelberg (2001)Google Scholar
  13. 13.
    Swarztrauber, P.N.: FFT algorithms for vector computers. Parallel Computing 1, 45–63 (1984)zbMATHCrossRefGoogle Scholar
  14. 14.
    Frigo, M., Johnson, S.G.: FFTW,
  15. 15.
    Intel Corporation: Intel Math Kernel Library Reference Manual (2005)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Daisuke Takahashi
    • 1
  1. 1.Graduate School of Systems and Information Engineering, University of Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki 305-8573Japan

Personalised recommendations