Code Generators for Automatic Tuning of Numerical Kernels: Experiences with FFTW Position Paper
 Richard Vuduc,
 James W. Demmel
 … show all 2 hide
Abstract
Achieving peak performance in important numerical kernels such as dense matrix multiply or sparsematrix vector multiplication usually requires extensive, machinedependent tuning by hand. In response, a number automatic tuning systems have been developed which typically operate by (1) generating multiple implementations of a kernel, and (2) empirically selecting an optimal implementation. One such system is FFTW (Fastest Fourier Transform in the West) for the discrete Fourier transform. In this paper, we review FFTW’s inner workings with an emphasis on its code generator, and report on our empirical evaluation of the system on two different hardware and compiler platforms. We then describe a number of our own extensions to the FFTW code generator that compute effcient discrete cosine transforms and show promising speedups over a vendortuned library. We also comment on current opportunities to develop tuning systems in the spirit of FFTW for other widelyused kernels.
 G. Bi. DCT algorithms for composite sequence lengths. IEEE Transactions on Signal Processing, 46(3):554–562, March 1998.
 J. Bilmes, K. Asanović, C. Chin, and J. Demmel. Optimizing matrix multiply using PHiPAC: a Portable, HighPerformance, ANSIC coding methodology. In Proceedings of the International Conference on Supercomputing,Vienna, Austria,July 1997.
 J. Bilmes, K. Asanović, J. Demmel, D. Lam, and C. Chin. The PHiPAC WWW home page. http://www.icsi.berkeley.edu/~bilmes/phipac.
 S. Blackford, G. Corliss, J. Demmel, J. Dongarra, I. Du., S. Hammarling, G. Henry, M. Heroux, C. Hu, W. Kahan, L. Kaufman, B. Kearfott, F. Krogh, X. Li, Z. Maany, A. Petitet, R. Pozo, K. Remington, W. Walster, C. Whaley, and J.W. von Gudenberg. Document for the Basic Linear Algebra Subprograms (BLAS) standard: Blastechnical forum. http://www.netlib.org/cgibin/checkout/blast/blast.pl.
 L.P. Chau and W.C. Siu. Recursive algorithm for the discrete cosine transform with general lengths. Electronic Letters, 30(3):197–198, February 1994.
 J.W. Cooley and J.W. Tukey. An algorithm for the machine calculation of complex fourier series. Mathematics of Computation, 19:297–301, April 1965.
 Crochiere, R. E., Oppenheim, A. V. (1975) Analysis of digital linear networks. Proceedings of the IEEE 63: pp. 581595 CrossRef
 J. Dongarra, J. D. Croz, I. Du., and S. Hammarling. A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Soft., 16(1):1–17, March 1990.
 J. Dongarra, J. D. Croz, I. Du., S. Hammarling, and R. J. Hanson. An extended set of Fortran basic linear algebra subroutines. ACM Trans. Math. Soft., 14(1):1–17, March1988.
 E. Elmroth and F. Gustavson. Applying recursion to serial and parallel QRfactorizationleads to better performance. IBM Journal of Research and Development,44(1), January 2000. http://www.cs.umu.se/~elmroth/papers/eg99.ps.
 M. Frigo. fast Fourier transform compiler. In Proceedings of the ACM SIGPLAN onference on Programming Language Design and Implementation, May 1999.
 M. rigo and. Johnson. FFTW: An adaptive software architecture for the FFT.In Proceedings of the International Conference on Acoustics,S peech,and Signal rocessing, May 1998.
 F. Gustavson. Recursion leads to automatic variable blocking for dense linear lgebra algorithms. IBM Journal of Research and Development, 41(6), November 997. http://www.research.ibm.com/journal/rd/416/gustavson.html.
 G. Henry. Linux libraries for 32bit Intel Architectures, March 2000. rs http:http://www.cs.utk.edu/~ghenry/distrib URL.
 E.J. Im and. Yelick. Optimizing sparse matrix vector multiplication on SMPs.In Proceedings of the Ninth SIAM Conference on Parallel Processing for Scientific computing, March 1999.
 Lawson, C., Hanson, R., Kincaid, D., Hanson, R., Krogh, F. (1979) Basic linear algebra subprograms for Fortran usage. ACM Trans. Math. Soft. 5: pp. 308323 CrossRef
 P. Z. Lee. Restructured recursive DCT and DST algorithms. IEEE Transactionson Signal Processing, 42(7), July 1994.
 P. Z. Lee and F.Y. Huang. An efficient primefactor algorithm for the discrete cosine transform and its hardware implementation. IEEE Transactions on Signal Processing, 42(8):1996–2005, August 1994.
 C. Loeffer, A. Ligtenberg, and G. Moschytz. Practical fast 1D DCT algorithms with 11 multiplications. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, volume 2, pages 988–991, May 1989.
 D. P. K. Lun. On efficient software realization of the prime factor discrete cosine transform. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, volume 3, pages 465–468, April 1994.
 A. Oppenheim and R. Schafer. Discretetime Signal Processing. PrenticeHall,1999.
 W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical Recipes in C. Cambridge University Press, 1992. http://www.nr.com.
 K. R. Rao and P. Yip. Discrete Cosine Transform: Algorithms,A dvantages, Applications.Academic Press, Inc., 1992.
 J. Siek and A. Lumsdaine. The Matrix Template Library: A generic programming approach to high performance numerical linear algebra. In Proceedings of the International Symposium on Computing in ObjectOriented Parallel Environments,December 1998.
 P. Swarztrauber. FFTPACK User’s Guide, April 1985.http://www.netlib.org/fftpack.
 S. Toledo. Locality of reference in LU decomposition with partial pivoting.SIAM Journal on Matrix Analysis and Applications, 18(4), 1997.http://www.math.tau.il/~sivan/pubs/029774.ps.gz.
 T. Veldhuizen and D. Gannon. Active libraries: Rethinking the roles of compilers and libraries. In Proceedings of the SIAM Workshop on Object Oriented Methods for Interoperable Scientific and Engineering Computing (OO’98). SIAMPress,1998.
 Z. Wang. Recursive algorithms for the forward and inverse discrete cosine transform with arbitrary lengths. IEEE Signal Processing Letters, 1(7):101–102, July 1994.
 R. C. Whaley and J. Dongarra. The ATLAS WWW home page.http://www.netlib.org/atlas/.
 P. Yip and K. R. Rao. The decimationinfrequency algorithms for a family of discrete sine and cosine transforms. Circuits,System s, and Signal Processing, pages 4–19, 1988.
 Z. Zhijin and Q. Huisheng. Recursive algorithms for the discrete cosine transform.In Proceedings of the IEEE International Conference on Signal Processing,volume 1, pages 115–118, October 1996.
 Title
 Code Generators for Automatic Tuning of Numerical Kernels: Experiences with FFTW Position Paper
 Book Title
 Semantics, Applications, and Implementation of Program Generation
 Book Subtitle
 International Workshop, SAIG 2000 Montreal, Canada, September 20, 2000 Proceedings
 Pages
 pp 190211
 Copyright
 2000
 DOI
 10.1007/3540453504_14
 Print ISBN
 9783540410546
 Online ISBN
 9783540453505
 Series Title
 Lecture Notes in Computer Science
 Series Volume
 1924
 Series ISSN
 03029743
 Publisher
 Springer Berlin Heidelberg
 Copyright Holder
 SpringerVerlag Berlin Heidelberg
 Additional Links
 Topics
 Industry Sectors
 eBook Packages
 Editors

 Walid Taha ^{(4)}
 Editor Affiliations

 4. Department of Computer Science Eklandagatan 86, Chalmers University of Technology
 Authors

 Richard Vuduc ^{(5)}
 James W. Demmel ^{(6)}
 Author Affiliations

 5. Computer Science Division, University of California at Berkeley, Berkeley, CA, 94720, USA
 6. Computer Science Division and Dept. of Mathematics, University of California at Berkeley, Berkeley, CA, 94720, USA
Continue reading...
To view the rest of this content please follow the download PDF link above.