Code Generators for Automatic Tuning of Numerical Kernels: Experiences with FFTW Position Paper

  • Richard Vuduc
  • James W. Demmel
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1924)


Achieving peak performance in important numerical kernels such as dense matrix multiply or sparse-matrix vector multiplication usually requires extensive, machine-dependent tuning by hand. In response, a number automatic tuning systems have been developed which typically operate by (1) generating multiple implementations of a kernel, and (2) empirically selecting an optimal implementation. One such system is FFTW (Fastest Fourier Transform in the West) for the discrete Fourier transform. In this paper, we review FFTW’s inner workings with an emphasis on its code generator, and report on our empirical evaluation of the system on two different hardware and compiler platforms. We then describe a number of our own extensions to the FFTW code generator that compute effcient discrete cosine transforms and show promising speed-ups over a vendor-tuned library. We also comment on current opportunities to develop tuning systems in the spirit of FFTW for other widely-used kernels.


Discrete Cosine Transform Linear Network Kernel Generator Automatic Tune Twiddle Factor 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 3.
    G. Bi. DCT algorithms for composite sequence lengths. IEEE Transactions on Signal Processing, 46(3):554–562, March 1998.Google Scholar
  2. 4.
    J. Bilmes, K. Asanović, C. Chin, and J. Demmel. Optimizing matrix multiply using PHiPAC: a Portable, High-Performance, ANSIC coding methodology. In Proceedings of the International Conference on Supercomputing,Vienna, Austria,July 1997.Google Scholar
  3. 5.
    J. Bilmes, K. Asanović, J. Demmel, D. Lam, and C. Chin. The PHiPAC WWW home page.
  4. 6.
    S. Blackford, G. Corliss, J. Demmel, J. Dongarra, I. Du., S. Hammarling, G. Henry, M. Heroux, C. Hu, W. Kahan, L. Kaufman, B. Kearfott, F. Krogh, X. Li, Z. Maany, A. Petitet, R. Pozo, K. Remington, W. Walster, C. Whaley, and J.W. von Gudenberg. Document for the Basic Linear Algebra Subprograms (BLAS) standard: Blastechnical forum.
  5. 7.
    L.-P. Chau and W.-C. Siu. Recursive algorithm for the discrete cosine transform with general lengths. Electronic Letters, 30(3):197–198, February 1994.Google Scholar
  6. 8.
    J.W. Cooley and J.W. Tukey. An algorithm for the machine calculation of complex fourier series. Mathematics of Computation, 19:297–301, April 1965.Google Scholar
  7. 9.
    R. E. Crochiere and A. V. Oppenheim. Analysis of digital linear networks. In Proceedings of the IEEE, volume 63, pages 581–595, April 1975.CrossRefGoogle Scholar
  8. 10.
    J. Dongarra, J. D. Croz, I. Du., and S. Hammarling. A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Soft., 16(1):1–17, March 1990.Google Scholar
  9. 11.
    J. Dongarra, J. D. Croz, I. Du., S. Hammarling, and R. J. Hanson. An extended set of Fortran basic linear algebra subroutines. ACM Trans. Math. Soft., 14(1):1–17, March1988.Google Scholar
  10. 12.
    E. Elmroth and F. Gustavson. Applying recursion to serial and parallel QRfactor-izationleads to better performance. IBM Journal of Research and Development,44(1), January 2000.
  11. 13.
    M. Frigo. fast Fourier transform compiler. In Proceedings of the ACM SIGPLAN onference on Programming Language Design and Implementation, May 1999.Google Scholar
  12. 14.
    M. rigo and. Johnson. FFTW: An adaptive software architecture for the FFT.In Proceedings of the International Conference on Acoustics,S peech,and Signal rocessing, May 1998.Google Scholar
  13. 15.
    F. Gustavson. Recursion leads to automatic variable blocking for dense linear lgebra algorithms. IBM Journal of Research and Development, 41(6), November 997.
  14. 16.
    G. Henry. Linux libraries for 32-bit Intel Architectures, March 2000. rs http: URL.Google Scholar
  15. 17.
    E.-J. Im and. Yelick. Optimizing sparse matrix vector multiplication on SMPs.In Proceedings of the Ninth SIAM Conference on Parallel Processing for Scientific computing, March 1999.Google Scholar
  16. 18.
    C. Lawson, R. Hanson, D. Kincaid, and F. Krogh. Basic linear algebra subprograms for Fortran usage. ACM Trans. Math. Soft., 5:308–323, 1979.zbMATHCrossRefGoogle Scholar
  17. 19.
    P. Z. Lee. Restructured recursive DCT and DST algorithms. IEEE Transactionson Signal Processing, 42(7), July 1994.Google Scholar
  18. 20.
    P. Z. Lee and F.-Y. Huang. An efficient prime-factor algorithm for the discrete cosine transform and its hardware implementation. IEEE Transactions on Signal Processing, 42(8):1996–2005, August 1994.Google Scholar
  19. 21.
    C. Loeffer, A. Ligtenberg, and G. Moschytz. Practical fast 1-D DCT algorithms with 11 multiplications. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, volume 2, pages 988–991, May 1989.Google Scholar
  20. 22.
    D. P. K. Lun. On efficient software realization of the prime factor discrete cosine transform. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, volume 3, pages 465–468, April 1994.Google Scholar
  21. 23.
    A. Oppenheim and R. Schafer. Discrete-time Signal Processing. Prentice-Hall,1999.Google Scholar
  22. 24.
    W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical Recipes in C. Cambridge University Press, 1992.
  23. 25.
    K. R. Rao and P. Yip. Discrete Cosine Transform: Algorithms,A dvantages, Applications.Academic Press, Inc., 1992.Google Scholar
  24. 26.
    J. Siek and A. Lumsdaine. The Matrix Template Library: A generic programming approach to high performance numerical linear algebra. In Proceedings of the International Symposium on Computing in Object-Oriented Parallel Environments,December 1998.Google Scholar
  25. 27.
    P. Swarztrauber. FFTPACK User’s Guide, April 1985.
  26. 28.
    S. Toledo. Locality of reference in LU decomposition with partial pivoting.SIAM Journal on Matrix Analysis and Applications, 18(4), 1997.
  27. 31.
    T. Veldhuizen and D. Gannon. Active libraries: Rethinking the roles of compilers and libraries. In Proceedings of the SIAM Workshop on Object Oriented Methods for Inter-operable Scientific and Engineering Computing (OO’98). SIAMPress,1998.Google Scholar
  28. 32.
    Z. Wang. Recursive algorithms for the forward and inverse discrete cosine transform with arbitrary lengths. IEEE Signal Processing Letters, 1(7):101–102, July 1994.Google Scholar
  29. 33.
    R. C. Whaley and J. Dongarra. The ATLAS WWW home page.
  30. 34.
    P. Yip and K. R. Rao. The decimation-in-frequency algorithms for a family of discrete sine and cosine transforms. Circuits,System s, and Signal Processing, pages 4–19, 1988.Google Scholar
  31. 35.
    Z. Zhijin and Q. Huisheng. Recursive algorithms for the discrete cosine transform.In Proceedings of the IEEE International Conference on Signal Processing,volume 1, pages 115–118, October 1996.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2000

Authors and Affiliations

  • Richard Vuduc
    • 1
  • James W. Demmel
    • 2
  1. 1.Computer Science DivisionUniversity of California at BerkeleyBerkeley, CAUSA
  2. 2.Computer Science Division and Dept. of MathematicsUniversity of California at BerkeleyBerkeley, CAUSA

Personalised recommendations