FFTs with Near-Optimal Memory Access Through Block Data Layouts: Algorithm, Architecture and Design Automation

Akin, Berkin; Franchetti, Franz; Hoe, James C.

doi:10.1007/s11265-015-1018-0

FFTs with Near-Optimal Memory Access Through Block Data Layouts: Algorithm, Architecture and Design Automation

Published: 23 June 2015

Volume 85, pages 67–82, (2016)
Cite this article

Journal of Signal Processing Systems Aims and scope Submit manuscript

Berkin Akin¹,
Franz Franchetti¹ &
James C. Hoe¹

594 Accesses
8 Citations
Explore all metrics

Abstract

Fast Fourier transform algorithms on large data sets achieve poor performance on various platforms because of the inefficient strided memory access patterns. These inefficient access patterns need to be reshaped to achieve high performance implementations. In this paper we formally restructure 1D, 2D and 3D FFTs targeting a generic machine model with a two-level memory hierarchy requiring block data transfers, and derive memory access pattern efficient algorithms using custom block data layouts. These algorithms need to be carefully mapped to the targeted platform’s architecture, particularly the memory subsystem, to fully utilize performance and energy efficiency potentials. Using the Kronecker product formalism, we integrate our optimizations into Spiral framework and evaluate a family of DRAM-optimized FFT algorithms and their hardware implementation design space via automated techniques. In our evaluations, we demonstrate DRAM-optimized accelerator designs over a large tradeoff space given various problem (single/double precision 1D, 2D and 3D FFTs) and hardware platform (off-chip DRAM, 3D-stacked DRAM, ASIC, FPGA, etc.) parameters. We show that Spiral generated pareto optimal designs can achieve close to theoretical peak performance of the targeted platform offering 6x and 6.5x system performance and power efficiency improvements respectively over conventional row-column FFT algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Survey on Pipelined FFT Hardware Architectures

Article Open access 06 July 2021

Recent progress in InGaZnO FETs for high-density 2T0C DRAM applications

Article 21 September 2023

On the impact of quantum computing technology on future developments in high-performance scientific computing

Article Open access 31 August 2017

Notes

Calculated as 5n log2(n)/t for DFT_n where t is the total runtime.

References

CACTI 6.5. HP labs. http://www.hpl.hp.com/research/cacti/.
DDR3-1600 dram datasheet. MT41J256M4, Micron. http://www.micron.com/parts/dram/ddr3-sdram.
DDR3 sdram system-power calculator. Micron. http://www.micron.com/products/support/power-calc.
DesignWare library. Synopsys. http://www.synopsys.com/dw.
McPAT 1.0. HP labs. http://www.hpl.hp.com/research/mcpat/.
CUDA toolkit 5.0 performance report (2013). Nvidia. https://developer.nvidia.com/cuda-math-library.
Akin, B., Franchetti, F., & Hoe, J.C. (2014). FFTs with near-optimal memory access through block data layouts. In Proceedings of IEEE international conference on acoustics speech and signal processing (ICASSP).
Akin, B., Franchetti, F., & Hoe, J.C. (2014). Understanding the design space of dram-optimized hardware FFT accelerators. In IEEE 25th international conference on application-specific systems, architectures and processors, ASAP 2014, June 18-20, (pp. 248–255). Zurich.
Akin, B., Franchetti, F., & Hoe, J.C. (2015). Data reorganization in memory using 3d-stacked dram. In Proceedings of the 42nd international symposium on computer architecture (ISCA).
Akin, B., Hoe, J.C., & Franchetti, F. (2014). Hamlet: hardware accelerated memory layout transform within 3d-stacked DRAM. In IEEE high performance extreme computing conference, HPEC 2014, September 9-11 (pp. 1–6). Waltham.
Akin, B., Milder, P.A., Franchetti, F., & Hoe, J.C. (2012). Memory bandwidth efficient two-dimensional fast Fourier transform algorithm and implementation for large problem sizes. In Proceedings of the IEEE symposium on FCCM (pp. 188–191).
Chen, K., Li, S., Muralimanohar, N., Ahn, J.H., Brockman, J., & Jouppi, N. (2012). CACTI-3DD: architecture-level modeling for 3D die-stacked DRAM main memory. In Design, automation test in Europe (DATE) (pp. 33–38).
Chung, E.S., Milder, P.A., Hoe, J.C., & Mai, K. (2010). Single-chip heterogeneous computing: does the future include custom logic, FPGAs, and GPGPUs?. In Proceedings of the 43th IEEE/ACM international symposium on microarchitecture (MICRO).
Cooley, J.W., & Tukey, J.W. (1965). An algorithm for the machine calculation of complex Fourier series. Mathematics of Computation, 19(90), 297–301.
Article MathSciNet MATH Google Scholar
Eleftheriou, M., Fitch, B., Rayshubskiy, A., Ward, T.J.C., & Germain, R. (2005). Scalable framework for 3D FFTs on the Blue Gene/L supercomputer: implementation and early performance measurements. IBM Journal of Research and Development, 49(2.3), 457–464.
Article Google Scholar
Franchetti, F., & Püschel, M. (2011). Encyclopedia of Parallel Computing, chap, Fast fourier transform: Springer.
Frigo, M., & Johnson, S.G. (2005). The design and implementation of FFTW3. Proceedings of the IEEE, Special issue on Program Generation Optimization, and Platform Adaptation, 93(2), 216–231.
Google Scholar
Govindaraju, N.K., Lloyd, B., Dotsenko, Y., Smith, B., & Manferdelli, J. (2008). High performance discrete Fourier transforms on graphics processors. In Proceedings of the ACM/IEEE conference on supercomputing (SC) (pp. 2:1–2:12).
Loh, G.H. (2008). 3D-stacked memory architectures for multi-core processors. In Proceedings of the 35th annual international symposium on computer architecture, (ISCA) (pp. 453–464).
Milder, P.A., Franchetti, F., Hoe, J.C., & Püschel, M. (2012). Computer generation of hardware for linear digital signal processing transforms. ACM Transactions on Design Automation of Electronic Systems, 17(2).
Milder, P.A., Hoe, J.C., & Püschel, M. (2009). Automatic generation of streaming datapaths for arbitrary fixed permutations. In Design, automation and test in Europe (DATE) (pp. 1118–1123).
Pawlowski, J.T. (2011). Hybrid memory cube (HMC). In Hotchips.
Pedram, A., McCalpin, J., & Gerstlauer, A. (2013). Transforming a linear algebra core to an FFT accelerator. In Proceedings of IEEE international conference on application-specific systems, architectures and processors (ASAP) (pp. 175–184).
Püschel, M., Moura, J.M.F., Johnson, J., Padua, D., Veloso, M., Singer, B., Xiong, J., Franchetti, F., Gacic, A., Voronenko, Y., Chen, K., Johnson, R.W., & Rizzolo, N. (2005). SPIRAL: code generation for DSP transforms. Proceedings of IEEE, Special Issue on Program Generation Optimization, and Adaptation, 93(2), 232–275.
Google Scholar
Rosenfeld, P., Cooper-Balis, E., & Jacob, B. (2011). Dramsim2: a cycle accurate memory system simulator. IEEE Computer Architecture Letters, 10(1), 16–19.
Article Google Scholar
Loan Van, C. (1992). Computational frameworks for the fast Fourier transform: SIAM.
Weis, C., Loi, I., Benini, L., & Wehn, N. (2013). Exploration and optimization of 3-D integrated dram subsystems. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 32(4), 597–610.
Article Google Scholar
Yu, C.L., Irick, K., Chakrabarti, C., & Narayanan, V. (2010). Multidimensional DFT IP generator for FPGA platforms. IEEE Transactions on Circuits and Systems, 58(4), 755–764.
Article MathSciNet Google Scholar
Zhu, Q., Akin, B., Sumbul, H., Sadi, F., Hoe, J., Pileggi, L., & Franchetti, F. (2013). A 3D-stacked logic-in-memory accelerator for application-specific data intensive computing. In 2013 IEEE international 3D systems integration conference (3DIC) (pp. 1–7).
Zhu, Q., Vaidyanathan, K., Shacham, O., Horowitz, M., Pileggi, L., & Franchetti, F. (2012). Design automation framework for application-specific logic-in-memory blocks. In Proceedings of IEEE international conference on application-specific systems, architectures and processors (ASAP) (pp. 125–132).

Download references

Acknowledgments

The work was sponsored by Defense Advanced Research Projects Agency (DARPA) under agreement no. HR0011-13-2-0007. The content, views and conclusions presented in this document do not necessarily reflect the position or the policy of DARPA or the U.S. Government. No official endorsement should be inferred.

Author information

Authors and Affiliations

Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA, USA
Berkin Akin, Franz Franchetti & James C. Hoe

Authors

Berkin Akin
View author publications
You can also search for this author in PubMed Google Scholar
Franz Franchetti
View author publications
You can also search for this author in PubMed Google Scholar
James C. Hoe
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Berkin Akin.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Akin, B., Franchetti, F. & Hoe, J.C. FFTs with Near-Optimal Memory Access Through Block Data Layouts: Algorithm, Architecture and Design Automation. J Sign Process Syst 85, 67–82 (2016). https://doi.org/10.1007/s11265-015-1018-0

Download citation

Received: 10 October 2014
Revised: 20 March 2015
Accepted: 13 May 2015
Published: 23 June 2015
Issue Date: October 2016
DOI: https://doi.org/10.1007/s11265-015-1018-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

FFTs with Near-Optimal Memory Access Through Block Data Layouts: Algorithm, Architecture and Design Automation

Abstract

Access this article

Similar content being viewed by others

A Survey on Pipelined FFT Hardware Architectures

Recent progress in InGaZnO FETs for high-density 2T0C DRAM applications

On the impact of quantum computing technology on future developments in high-performance scientific computing

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

FFTs with Near-Optimal Memory Access Through Block Data Layouts: Algorithm, Architecture and Design Automation

Abstract

Access this article

Similar content being viewed by others

A Survey on Pipelined FFT Hardware Architectures

Recent progress in InGaZnO FETs for high-density 2T0C DRAM applications

On the impact of quantum computing technology on future developments in high-performance scientific computing

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation