FFTs with Near-Optimal Memory Access Through Block Data Layouts: Algorithm, Architecture and Design Automation
Fast Fourier transform algorithms on large data sets achieve poor performance on various platforms because of the inefficient strided memory access patterns. These inefficient access patterns need to be reshaped to achieve high performance implementations. In this paper we formally restructure 1D, 2D and 3D FFTs targeting a generic machine model with a two-level memory hierarchy requiring block data transfers, and derive memory access pattern efficient algorithms using custom block data layouts. These algorithms need to be carefully mapped to the targeted platform’s architecture, particularly the memory subsystem, to fully utilize performance and energy efficiency potentials. Using the Kronecker product formalism, we integrate our optimizations into Spiral framework and evaluate a family of DRAM-optimized FFT algorithms and their hardware implementation design space via automated techniques. In our evaluations, we demonstrate DRAM-optimized accelerator designs over a large tradeoff space given various problem (single/double precision 1D, 2D and 3D FFTs) and hardware platform (off-chip DRAM, 3D-stacked DRAM, ASIC, FPGA, etc.) parameters. We show that Spiral generated pareto optimal designs can achieve close to theoretical peak performance of the targeted platform offering 6x and 6.5x system performance and power efficiency improvements respectively over conventional row-column FFT algorithms.
KeywordsFast fourier transforms Algorithm design and analysis Data layout DRAM Bandwidth Design space
The work was sponsored by Defense Advanced Research Projects Agency (DARPA) under agreement no. HR0011-13-2-0007. The content, views and conclusions presented in this document do not necessarily reflect the position or the policy of DARPA or the U.S. Government. No official endorsement should be inferred.
- 1.CACTI 6.5. HP labs. http://www.hpl.hp.com/research/cacti/.
- 2.DDR3-1600 dram datasheet. MT41J256M4, Micron. http://www.micron.com/parts/dram/ddr3-sdram.
- 3.DDR3 sdram system-power calculator. Micron. http://www.micron.com/products/support/power-calc.
- 4.DesignWare library. Synopsys. http://www.synopsys.com/dw.
- 5.McPAT 1.0. HP labs. http://www.hpl.hp.com/research/mcpat/.
- 6.CUDA toolkit 5.0 performance report (2013). Nvidia. https://developer.nvidia.com/cuda-math-library.
- 7.Akin, B., Franchetti, F., & Hoe, J.C. (2014). FFTs with near-optimal memory access through block data layouts. In Proceedings of IEEE international conference on acoustics speech and signal processing (ICASSP).Google Scholar
- 8.Akin, B., Franchetti, F., & Hoe, J.C. (2014). Understanding the design space of dram-optimized hardware FFT accelerators. In IEEE 25th international conference on application-specific systems, architectures and processors, ASAP 2014, June 18-20, (pp. 248–255). Zurich.Google Scholar
- 9.Akin, B., Franchetti, F., & Hoe, J.C. (2015). Data reorganization in memory using 3d-stacked dram. In Proceedings of the 42nd international symposium on computer architecture (ISCA).Google Scholar
- 10.Akin, B., Hoe, J.C., & Franchetti, F. (2014). Hamlet: hardware accelerated memory layout transform within 3d-stacked DRAM. In IEEE high performance extreme computing conference, HPEC 2014, September 9-11 (pp. 1–6). Waltham.Google Scholar
- 11.Akin, B., Milder, P.A., Franchetti, F., & Hoe, J.C. (2012). Memory bandwidth efficient two-dimensional fast Fourier transform algorithm and implementation for large problem sizes. In Proceedings of the IEEE symposium on FCCM (pp. 188–191).Google Scholar
- 12.Chen, K., Li, S., Muralimanohar, N., Ahn, J.H., Brockman, J., & Jouppi, N. (2012). CACTI-3DD: architecture-level modeling for 3D die-stacked DRAM main memory. In Design, automation test in Europe (DATE) (pp. 33–38).Google Scholar
- 13.Chung, E.S., Milder, P.A., Hoe, J.C., & Mai, K. (2010). Single-chip heterogeneous computing: does the future include custom logic, FPGAs, and GPGPUs?. In Proceedings of the 43th IEEE/ACM international symposium on microarchitecture (MICRO).Google Scholar
- 16.Franchetti, F., & Püschel, M. (2011). Encyclopedia of Parallel Computing, chap, Fast fourier transform: Springer.Google Scholar
- 17.Frigo, M., & Johnson, S.G. (2005). The design and implementation of FFTW3. Proceedings of the IEEE, Special issue on Program Generation Optimization, and Platform Adaptation, 93(2), 216–231.Google Scholar
- 18.Govindaraju, N.K., Lloyd, B., Dotsenko, Y., Smith, B., & Manferdelli, J. (2008). High performance discrete Fourier transforms on graphics processors. In Proceedings of the ACM/IEEE conference on supercomputing (SC) (pp. 2:1–2:12).Google Scholar
- 19.Loh, G.H. (2008). 3D-stacked memory architectures for multi-core processors. In Proceedings of the 35th annual international symposium on computer architecture, (ISCA) (pp. 453–464).Google Scholar
- 20.Milder, P.A., Franchetti, F., Hoe, J.C., & Püschel, M. (2012). Computer generation of hardware for linear digital signal processing transforms. ACM Transactions on Design Automation of Electronic Systems, 17(2).Google Scholar
- 21.Milder, P.A., Hoe, J.C., & Püschel, M. (2009). Automatic generation of streaming datapaths for arbitrary fixed permutations. In Design, automation and test in Europe (DATE) (pp. 1118–1123).Google Scholar
- 22.Pawlowski, J.T. (2011). Hybrid memory cube (HMC). In Hotchips.Google Scholar
- 23.Pedram, A., McCalpin, J., & Gerstlauer, A. (2013). Transforming a linear algebra core to an FFT accelerator. In Proceedings of IEEE international conference on application-specific systems, architectures and processors (ASAP) (pp. 175–184).Google Scholar
- 24.Püschel, M., Moura, J.M.F., Johnson, J., Padua, D., Veloso, M., Singer, B., Xiong, J., Franchetti, F., Gacic, A., Voronenko, Y., Chen, K., Johnson, R.W., & Rizzolo, N. (2005). SPIRAL: code generation for DSP transforms. Proceedings of IEEE, Special Issue on Program Generation Optimization, and Adaptation, 93(2), 232–275.Google Scholar
- 26.Loan Van, C. (1992). Computational frameworks for the fast Fourier transform: SIAM.Google Scholar
- 29.Zhu, Q., Akin, B., Sumbul, H., Sadi, F., Hoe, J., Pileggi, L., & Franchetti, F. (2013). A 3D-stacked logic-in-memory accelerator for application-specific data intensive computing. In 2013 IEEE international 3D systems integration conference (3DIC) (pp. 1–7).Google Scholar
- 30.Zhu, Q., Vaidyanathan, K., Shacham, O., Horowitz, M., Pileggi, L., & Franchetti, F. (2012). Design automation framework for application-specific logic-in-memory blocks. In Proceedings of IEEE international conference on application-specific systems, architectures and processors (ASAP) (pp. 125–132).Google Scholar