Advertisement

Journal of Signal Processing Systems

, Volume 85, Issue 1, pp 67–82 | Cite as

FFTs with Near-Optimal Memory Access Through Block Data Layouts: Algorithm, Architecture and Design Automation

  • Berkin Akin
  • Franz Franchetti
  • James C. Hoe
Article

Abstract

Fast Fourier transform algorithms on large data sets achieve poor performance on various platforms because of the inefficient strided memory access patterns. These inefficient access patterns need to be reshaped to achieve high performance implementations. In this paper we formally restructure 1D, 2D and 3D FFTs targeting a generic machine model with a two-level memory hierarchy requiring block data transfers, and derive memory access pattern efficient algorithms using custom block data layouts. These algorithms need to be carefully mapped to the targeted platform’s architecture, particularly the memory subsystem, to fully utilize performance and energy efficiency potentials. Using the Kronecker product formalism, we integrate our optimizations into Spiral framework and evaluate a family of DRAM-optimized FFT algorithms and their hardware implementation design space via automated techniques. In our evaluations, we demonstrate DRAM-optimized accelerator designs over a large tradeoff space given various problem (single/double precision 1D, 2D and 3D FFTs) and hardware platform (off-chip DRAM, 3D-stacked DRAM, ASIC, FPGA, etc.) parameters. We show that Spiral generated pareto optimal designs can achieve close to theoretical peak performance of the targeted platform offering 6x and 6.5x system performance and power efficiency improvements respectively over conventional row-column FFT algorithms.

Keywords

Fast fourier transforms Algorithm design and analysis Data layout DRAM Bandwidth Design space 

Notes

Acknowledgments

The work was sponsored by Defense Advanced Research Projects Agency (DARPA) under agreement no. HR0011-13-2-0007. The content, views and conclusions presented in this document do not necessarily reflect the position or the policy of DARPA or the U.S. Government. No official endorsement should be inferred.

References

  1. 1.
  2. 2.
    DDR3-1600 dram datasheet. MT41J256M4, Micron. http://www.micron.com/parts/dram/ddr3-sdram.
  3. 3.
    DDR3 sdram system-power calculator. Micron. http://www.micron.com/products/support/power-calc.
  4. 4.
    DesignWare library. Synopsys. http://www.synopsys.com/dw.
  5. 5.
  6. 6.
    CUDA toolkit 5.0 performance report (2013). Nvidia. https://developer.nvidia.com/cuda-math-library.
  7. 7.
    Akin, B., Franchetti, F., & Hoe, J.C. (2014). FFTs with near-optimal memory access through block data layouts. In Proceedings of IEEE international conference on acoustics speech and signal processing (ICASSP).Google Scholar
  8. 8.
    Akin, B., Franchetti, F., & Hoe, J.C. (2014). Understanding the design space of dram-optimized hardware FFT accelerators. In IEEE 25th international conference on application-specific systems, architectures and processors, ASAP 2014, June 18-20, (pp. 248–255). Zurich.Google Scholar
  9. 9.
    Akin, B., Franchetti, F., & Hoe, J.C. (2015). Data reorganization in memory using 3d-stacked dram. In Proceedings of the 42nd international symposium on computer architecture (ISCA).Google Scholar
  10. 10.
    Akin, B., Hoe, J.C., & Franchetti, F. (2014). Hamlet: hardware accelerated memory layout transform within 3d-stacked DRAM. In IEEE high performance extreme computing conference, HPEC 2014, September 9-11 (pp. 1–6). Waltham.Google Scholar
  11. 11.
    Akin, B., Milder, P.A., Franchetti, F., & Hoe, J.C. (2012). Memory bandwidth efficient two-dimensional fast Fourier transform algorithm and implementation for large problem sizes. In Proceedings of the IEEE symposium on FCCM (pp. 188–191).Google Scholar
  12. 12.
    Chen, K., Li, S., Muralimanohar, N., Ahn, J.H., Brockman, J., & Jouppi, N. (2012). CACTI-3DD: architecture-level modeling for 3D die-stacked DRAM main memory. In Design, automation test in Europe (DATE) (pp. 33–38).Google Scholar
  13. 13.
    Chung, E.S., Milder, P.A., Hoe, J.C., & Mai, K. (2010). Single-chip heterogeneous computing: does the future include custom logic, FPGAs, and GPGPUs?. In Proceedings of the 43th IEEE/ACM international symposium on microarchitecture (MICRO).Google Scholar
  14. 14.
    Cooley, J.W., & Tukey, J.W. (1965). An algorithm for the machine calculation of complex Fourier series. Mathematics of Computation, 19(90), 297–301.MathSciNetCrossRefzbMATHGoogle Scholar
  15. 15.
    Eleftheriou, M., Fitch, B., Rayshubskiy, A., Ward, T.J.C., & Germain, R. (2005). Scalable framework for 3D FFTs on the Blue Gene/L supercomputer: implementation and early performance measurements. IBM Journal of Research and Development, 49(2.3), 457–464.CrossRefGoogle Scholar
  16. 16.
    Franchetti, F., & Püschel, M. (2011). Encyclopedia of Parallel Computing, chap, Fast fourier transform: Springer.Google Scholar
  17. 17.
    Frigo, M., & Johnson, S.G. (2005). The design and implementation of FFTW3. Proceedings of the IEEE, Special issue on Program Generation Optimization, and Platform Adaptation, 93(2), 216–231.Google Scholar
  18. 18.
    Govindaraju, N.K., Lloyd, B., Dotsenko, Y., Smith, B., & Manferdelli, J. (2008). High performance discrete Fourier transforms on graphics processors. In Proceedings of the ACM/IEEE conference on supercomputing (SC) (pp. 2:1–2:12).Google Scholar
  19. 19.
    Loh, G.H. (2008). 3D-stacked memory architectures for multi-core processors. In Proceedings of the 35th annual international symposium on computer architecture, (ISCA) (pp. 453–464).Google Scholar
  20. 20.
    Milder, P.A., Franchetti, F., Hoe, J.C., & Püschel, M. (2012). Computer generation of hardware for linear digital signal processing transforms. ACM Transactions on Design Automation of Electronic Systems, 17(2).Google Scholar
  21. 21.
    Milder, P.A., Hoe, J.C., & Püschel, M. (2009). Automatic generation of streaming datapaths for arbitrary fixed permutations. In Design, automation and test in Europe (DATE) (pp. 1118–1123).Google Scholar
  22. 22.
    Pawlowski, J.T. (2011). Hybrid memory cube (HMC). In Hotchips.Google Scholar
  23. 23.
    Pedram, A., McCalpin, J., & Gerstlauer, A. (2013). Transforming a linear algebra core to an FFT accelerator. In Proceedings of IEEE international conference on application-specific systems, architectures and processors (ASAP) (pp. 175–184).Google Scholar
  24. 24.
    Püschel, M., Moura, J.M.F., Johnson, J., Padua, D., Veloso, M., Singer, B., Xiong, J., Franchetti, F., Gacic, A., Voronenko, Y., Chen, K., Johnson, R.W., & Rizzolo, N. (2005). SPIRAL: code generation for DSP transforms. Proceedings of IEEE, Special Issue on Program Generation Optimization, and Adaptation, 93(2), 232–275.Google Scholar
  25. 25.
    Rosenfeld, P., Cooper-Balis, E., & Jacob, B. (2011). Dramsim2: a cycle accurate memory system simulator. IEEE Computer Architecture Letters, 10(1), 16–19.CrossRefGoogle Scholar
  26. 26.
    Loan Van, C. (1992). Computational frameworks for the fast Fourier transform: SIAM.Google Scholar
  27. 27.
    Weis, C., Loi, I., Benini, L., & Wehn, N. (2013). Exploration and optimization of 3-D integrated dram subsystems. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 32(4), 597–610.CrossRefGoogle Scholar
  28. 28.
    Yu, C.L., Irick, K., Chakrabarti, C., & Narayanan, V. (2010). Multidimensional DFT IP generator for FPGA platforms. IEEE Transactions on Circuits and Systems, 58(4), 755–764.MathSciNetCrossRefGoogle Scholar
  29. 29.
    Zhu, Q., Akin, B., Sumbul, H., Sadi, F., Hoe, J., Pileggi, L., & Franchetti, F. (2013). A 3D-stacked logic-in-memory accelerator for application-specific data intensive computing. In 2013 IEEE international 3D systems integration conference (3DIC) (pp. 1–7).Google Scholar
  30. 30.
    Zhu, Q., Vaidyanathan, K., Shacham, O., Horowitz, M., Pileggi, L., & Franchetti, F. (2012). Design automation framework for application-specific logic-in-memory blocks. In Proceedings of IEEE international conference on application-specific systems, architectures and processors (ASAP) (pp. 125–132).Google Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  1. 1.Department of Electrical and Computer EngineeringCarnegie Mellon UniversityPittsburghUSA

Personalised recommendations