Experiences with Mapping Non-linear Memory Access Patterns into GPUs

  • Eladio Gutierrez
  • Sergio Romero
  • Maria A. Trenas
  • Oscar Plata
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5544)


Modern Graphics Processing Units (GPU) are very powerful computational systems on a chip. For this reason there is a growing interest in using these units as general purpose hardware accelerators (GPGPU). To facilitate the programming of general purpose applications, NVIDIA introduced the CUDA programming environment. CUDA provides a simplified abstraction of the underlying complex GPU architecture, so as a number of critical optimizations must be applied to the code in order to get maximum performance. In this paper we discuss our experience in porting an application kernel to the GPU, and all classes of design decisions we adopted in order to obtain maximum performance.


Fast Fourier Transform Graphic Processing Unit Shared Memory Global Memory Memory Access Pattern 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Fialka, O., Cadik, M.: FFT and Convolution Performance in Image Filtering on GPU. In: 10th Int’l. Conf. on Information Visualization (2006)Google Scholar
  2. 2.
    General-Purpose Computation Using Graphics Hardware,
  3. 3.
    Govindaraju, N.K., Larsen, S., Gray, J., Manocha, D.: A Memory Model for Scientific Algorithms on Graphics Processors. In: ACM Int. Conf. Supercomputing (2006)Google Scholar
  4. 4.
    Govindaraju, N.K., Lloyd, B., Dotsenko, Y., Smith, B., Manferdelli, J.: High Performance Discrete Fourier Transforms on Graphics Processors. In: Int’l. Conf. for High Performance Computing, Networking, Storage and Analysis (SC 2008) (2008) Google Scholar
  5. 5.
    Garland, M., Le Grand, S., Nickolls, J., Anderson, J., Hardwick, J., Morton, S., Phillips, E., Zhang, Y., Volkov, V.: Parallel Computing Experiences with CUDA. IEEE Micro. 28(4), 13–27 (2008)CrossRefGoogle Scholar
  6. 6.
    Jansen, T., von Rymon-Lipinski, B., Hanssen, N., Keeve, E.: Fourier volume rendering on the GPU using a split-stream FFT. In: Vision, Modeling, and Visualization Workshop (2004)Google Scholar
  7. 7.
    Lindholm, E., Nickolls, J., Oberman, S., Montrym, J.: NVIDIA Tesla: A Unified Graphics and Computing Architecture. IEEE Micro. 28(2), 39–55 (2008)CrossRefGoogle Scholar
  8. 8.
    Manikandan, M., Bondhugula, U., Krishnamoorthy, S., Ramanujam, J., Rountev, A., Sadayappan, P.: A Compiler Framework for Optimization of Affine Loop Nests for GPGPUs. In: ACM Int’l. Conf. on Supercomputing (2008)Google Scholar
  9. 9.
    Moreland, K., Angel, E.: The FFT on a GPU. In: ACM Conf. Graph. Hardware (2003)Google Scholar
  10. 10.
    Nickolls, J., Buck, I., Garland, M., Skadron, K.: Scalable Parallel Programming with CUDA. ACM Queue 6(2), 40–53 (2008)CrossRefGoogle Scholar
  11. 11.
  12. 12.
    The OpenCL Specification, Ver. 1.0.29, Khronos OpenCL Working Group,
  13. 13.
    Petit, E., Matz, S., Bodin, F.: Data Transfer Optimization in Scientific Applications for GPU based Acceleration. In: Workshop Compilers for Parallel Computers (2007)Google Scholar
  14. 14.
    Sumanaweera, T., Liu, D.: Medical Image Reconstruction with the FFT. GPU Gems 2, 765–784 (2005)Google Scholar
  15. 15.
    Volkov, V., Kazian, B.: Fitting FFT onto the G80 Architecture (2008),

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Eladio Gutierrez
    • 1
  • Sergio Romero
    • 1
  • Maria A. Trenas
    • 1
  • Oscar Plata
    • 1
  1. 1.Department of Computer ArchitectureUniversity of MalagaSpain

Personalised recommendations