Code Generators for Automatic Tuning of Numerical Kernels: Experiences with FFTW Position Paper
Achieving peak performance in important numerical kernels such as dense matrix multiply or sparse-matrix vector multiplication usually requires extensive, machine-dependent tuning by hand. In response, a number automatic tuning systems have been developed which typically operate by (1) generating multiple implementations of a kernel, and (2) empirically selecting an optimal implementation. One such system is FFTW (Fastest Fourier Transform in the West) for the discrete Fourier transform. In this paper, we review FFTW’s inner workings with an emphasis on its code generator, and report on our empirical evaluation of the system on two different hardware and compiler platforms. We then describe a number of our own extensions to the FFTW code generator that compute effcient discrete cosine transforms and show promising speed-ups over a vendor-tuned library. We also comment on current opportunities to develop tuning systems in the spirit of FFTW for other widely-used kernels.
Unable to display preview. Download preview PDF.
- 3.G. Bi. DCT algorithms for composite sequence lengths. IEEE Transactions on Signal Processing, 46(3):554–562, March 1998.Google Scholar
- 4.J. Bilmes, K. Asanović, C. Chin, and J. Demmel. Optimizing matrix multiply using PHiPAC: a Portable, High-Performance, ANSIC coding methodology. In Proceedings of the International Conference on Supercomputing,Vienna, Austria,July 1997.Google Scholar
- 5.J. Bilmes, K. Asanović, J. Demmel, D. Lam, and C. Chin. The PHiPAC WWW home page. http://www.icsi.berkeley.edu/~bilmes/phipac.
- 6.S. Blackford, G. Corliss, J. Demmel, J. Dongarra, I. Du., S. Hammarling, G. Henry, M. Heroux, C. Hu, W. Kahan, L. Kaufman, B. Kearfott, F. Krogh, X. Li, Z. Maany, A. Petitet, R. Pozo, K. Remington, W. Walster, C. Whaley, and J.W. von Gudenberg. Document for the Basic Linear Algebra Subprograms (BLAS) standard: Blastechnical forum. http://www.netlib.org/cgi-bin/checkout/blast/blast.pl.
- 7.L.-P. Chau and W.-C. Siu. Recursive algorithm for the discrete cosine transform with general lengths. Electronic Letters, 30(3):197–198, February 1994.Google Scholar
- 8.J.W. Cooley and J.W. Tukey. An algorithm for the machine calculation of complex fourier series. Mathematics of Computation, 19:297–301, April 1965.Google Scholar
- 10.J. Dongarra, J. D. Croz, I. Du., and S. Hammarling. A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Soft., 16(1):1–17, March 1990.Google Scholar
- 11.J. Dongarra, J. D. Croz, I. Du., S. Hammarling, and R. J. Hanson. An extended set of Fortran basic linear algebra subroutines. ACM Trans. Math. Soft., 14(1):1–17, March1988.Google Scholar
- 12.E. Elmroth and F. Gustavson. Applying recursion to serial and parallel QRfactor-izationleads to better performance. IBM Journal of Research and Development,44(1), January 2000. http://www.cs.umu.se/~elmroth/papers/eg99.ps.
- 13.M. Frigo. fast Fourier transform compiler. In Proceedings of the ACM SIGPLAN onference on Programming Language Design and Implementation, May 1999.Google Scholar
- 14.M. rigo and. Johnson. FFTW: An adaptive software architecture for the FFT.In Proceedings of the International Conference on Acoustics,S peech,and Signal rocessing, May 1998.Google Scholar
- 15.F. Gustavson. Recursion leads to automatic variable blocking for dense linear lgebra algorithms. IBM Journal of Research and Development, 41(6), November 997. http://www.research.ibm.com/journal/rd/416/gustavson.html.
- 16.G. Henry. Linux libraries for 32-bit Intel Architectures, March 2000. rs http:http://www.cs.utk.edu/~ghenry/distrib URL.Google Scholar
- 17.E.-J. Im and. Yelick. Optimizing sparse matrix vector multiplication on SMPs.In Proceedings of the Ninth SIAM Conference on Parallel Processing for Scientific computing, March 1999.Google Scholar
- 19.P. Z. Lee. Restructured recursive DCT and DST algorithms. IEEE Transactionson Signal Processing, 42(7), July 1994.Google Scholar
- 20.P. Z. Lee and F.-Y. Huang. An efficient prime-factor algorithm for the discrete cosine transform and its hardware implementation. IEEE Transactions on Signal Processing, 42(8):1996–2005, August 1994.Google Scholar
- 21.C. Loeffer, A. Ligtenberg, and G. Moschytz. Practical fast 1-D DCT algorithms with 11 multiplications. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, volume 2, pages 988–991, May 1989.Google Scholar
- 22.D. P. K. Lun. On efficient software realization of the prime factor discrete cosine transform. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, volume 3, pages 465–468, April 1994.Google Scholar
- 23.A. Oppenheim and R. Schafer. Discrete-time Signal Processing. Prentice-Hall,1999.Google Scholar
- 24.W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical Recipes in C. Cambridge University Press, 1992. http://www.nr.com.
- 25.K. R. Rao and P. Yip. Discrete Cosine Transform: Algorithms,A dvantages, Applications.Academic Press, Inc., 1992.Google Scholar
- 26.J. Siek and A. Lumsdaine. The Matrix Template Library: A generic programming approach to high performance numerical linear algebra. In Proceedings of the International Symposium on Computing in Object-Oriented Parallel Environments,December 1998.Google Scholar
- 27.P. Swarztrauber. FFTPACK User’s Guide, April 1985.http://www.netlib.org/fftpack.
- 28.S. Toledo. Locality of reference in LU decomposition with partial pivoting.SIAM Journal on Matrix Analysis and Applications, 18(4), 1997.http://www.math.tau.il/~sivan/pubs/029774.ps.gz.
- 31.T. Veldhuizen and D. Gannon. Active libraries: Rethinking the roles of compilers and libraries. In Proceedings of the SIAM Workshop on Object Oriented Methods for Inter-operable Scientific and Engineering Computing (OO’98). SIAMPress,1998.Google Scholar
- 32.Z. Wang. Recursive algorithms for the forward and inverse discrete cosine transform with arbitrary lengths. IEEE Signal Processing Letters, 1(7):101–102, July 1994.Google Scholar
- 33.R. C. Whaley and J. Dongarra. The ATLAS WWW home page.http://www.netlib.org/atlas/.
- 34.P. Yip and K. R. Rao. The decimation-in-frequency algorithms for a family of discrete sine and cosine transforms. Circuits,System s, and Signal Processing, pages 4–19, 1988.Google Scholar
- 35.Z. Zhijin and Q. Huisheng. Recursive algorithms for the discrete cosine transform.In Proceedings of the IEEE International Conference on Signal Processing,volume 1, pages 115–118, October 1996.Google Scholar