Addressing GPU On-Chip Shared Memory Bank Conflicts Using Elastic Pipeline

Open Access


One of the major problems with the GPU on-chip shared memory is bank conflicts. We analyze that the throughput of the GPU processor core is often constrained neither by the shared memory bandwidth, nor by the shared memory latency (as long as it stays constant), but is rather due to the varied latencies caused by memory bank conflicts. This results in conflicts at the writeback stage of the in-order pipeline and causes pipeline stalls, thus degrading system throughput. Based on this observation, we investigate and propose a novel Elastic Pipeline design that minimizes the negative impact of on-chip memory bank conflicts on system throughput, by decoupling bank conflicts from pipeline stalls. Simulation results show that our proposed Elastic Pipeline together with the co-designed bank-conflict aware warp scheduling reduces the pipeline stalls by up to 64.0 % (with 42.3 % on average) and improves the overall performance by up to 20.7 % (on average 13.3 %) for representative benchmarks, at trivial hardware overhead.


GPU On-chip shared memory Bank conflicts Elastic pipeline 



This work was supported by the European Commission in the context of the SARC project (FP6 FET Contract #27648) and was continued by the ENCORE project (FP7 ICT4 Contract #249059). We would also like to thank the reviewers for their insightful comments.

Open Access

This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.


  1. 1.
    Gou, C., Gaydadjiev, G.N.: Elastic pipeline: addressing GPU on-chip shared memory bank conflicts. In: Proceedings of the 8th ACM International Conference on Computing Frontiers (CF’11) (May 2011), pp. 1–11 (2011)Google Scholar
  2. 2.
    GPGPU home page.
  3. 3.
    Valiant L.G.: A bridging model for parallel computation. Commun. ACM 33, 103–111 (1990)CrossRefGoogle Scholar
  4. 4.
    Thistle, M.R., Smith, B.J.: A processor architecture for horizon. In: Proceedings of the 1988 ACM/IEEE Conference on Supercomputing (Los Alamitos, CA, USA, 1988), Supercomputing ’88. IEEE Computer Society Press, pp. 35–41 (1988)Google Scholar
  5. 5.
    Nickolls J., Buck I., Garland M., Skadron K.: Scalable parallel programming with cuda. Queue 6, 40–53 (2008)CrossRefGoogle Scholar
  6. 6.
  7. 7.
    Hou, Q., Zhou, K., Guo, B.: BSGP: bulk-synchronous GPU programming. In: ACM SIGGRAPH 2008 papers, SIGGRAPH ’08. ACM, pp. 19:1–19:12 (2008)Google Scholar
  8. 8.
  9. 9.
    Glaskowsky, P.N.: NVIDIA’s Fermi: the first complete GPU computing architectureGoogle Scholar
  10. 10.
    Little J.D.C.: A proof for the queuing formula: L = w. Oper. Res. 9(3), 383–387 (1961)MathSciNetMATHCrossRefGoogle Scholar
  11. 11.
    NVIDIA. CUDA best practice guide, edition 3.0Google Scholar
  12. 12.
    Bakhoda, A., Yuan, G., Fung, W., Wong, H., Aamodt, T.: Analyzing CUDA workloads using a detailed GPU simulator. In: IEEE International Symposium on, Performance Analysis of Systems and Software, 2009 (ISPASS 2009), pp. 163–174 (2009)Google Scholar
  13. 13.
    Wong, H., Papadopoulou, M.-M., Sadooghi-Alvandi, M., Moshovos, A.: Demystifying GPU microarchitecture through microbenchmarking. In: IEEE International Symposium on, Performance Analysis of Systems Software (ISPASS, 2010), pp. 235–246 (2010)Google Scholar
  14. 14.
    Stark, J., Brown, M.D., Patt, Y.N.: On pipelining dynamic instruction scheduling logic. In: Proceedings of the 33rd Annual ACM/IEEE International Symposium on Microarchitecture, MICRO 33. ACM, pp. 57–66Google Scholar
  15. 15.
    Fung, W.W.L., Sham, I., Yuan, G., Aamodt, T.M.: Dynamic warp formation and scheduling for efficient GPU control flow. In: Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 40. IEEE Computer Society, pp. 407–420 (2007)Google Scholar
  16. 16.
    Volkov, V.: Better performance at lower occupancy. In: GPU Technology Conference 2010 GTC ’10 (2010)Google Scholar
  17. 17.
    Bhatnagar H.: Advanced ASIC chip synthesis using synopsys design compiler physical compiler and primeTime. Kluwer, Dordrecht (2001)Google Scholar
  18. 18.
    Rixner, S., Dally, W., Kapasi, U., Mattson, P., Owens, J.: Memory access scheduling. In: Proceedings of the 27th International Symposium on, Computer Architecture, 2000 (2000)Google Scholar
  19. 19.
    NVIDIA. The CUDA compiler driver NVCC, edition 2.2Google Scholar
  20. 20.
    NVIDIA. CUDA SDK version 2.2Google Scholar
  21. 21.
    Manavski, S. CUDA compatible GPU as an efficient hardware accelerator for AES cryptography. In: IEEE International Conference on, Signal Processing and Communications, 2007 (ICSPC 2007), pp. 65–68 (2007)Google Scholar
  22. 22.
  23. 23.
    Kirk D.B., Mei W., Hwu W.: Programming Massively Parallel Processors: A Hands-on Approach. Morgan Kaufmann, Los Altos (2010)Google Scholar
  24. 24.
    Harper D.T. III: Block, multistride vector and FFT accesses in parallel memory systems. IEEE Trans. Parallel Distrib. Syst. 2(1), 43–51 (1991)CrossRefGoogle Scholar
  25. 25.
    Harper D.T. III: Increased memory performance during vector accesses through the use of linear address transformations. IEEE Trans. Comput. 41(2), 227–230 (1992)CrossRefGoogle Scholar
  26. 26.
    Harper D.T. III, Linebarger D.A.: Conflict-free vector access using a dynamic storage scheme. IEEE Trans. Comput. 40(3), 276–283 (1991)CrossRefGoogle Scholar
  27. 27.
    Gou, C., Kuzmanov, G., Gaydadjiev, G.N.: SAMS multi-layout memory: providing multiple views of data to boost SIMD performance. In: Proceedings of the 24th ACM International Conference on Supercomputing, ICS ’10. ACM, pp. 179–188 (2010)Google Scholar
  28. 28.
    Gou, C., Kuzmanov, G.K., Gaydadjiev, G.N. SAMS: single-affiliation multiple-stride parallel memory scheme. In: Proceedings of the 2008 Workshop on Memory Access on Future Processors: A Solved Problem? MAW ’08. ACM, pp. 350–368 (2008)Google Scholar
  29. 29.
    Valero M., Lang T., Peiron M., Ayguade E.: Conflict-free access for streams in multimodule memories. IEEE Trans. Comput. 44, 634–646 (1995)MATHCrossRefGoogle Scholar
  30. 30.
    Diamos, G.F., Kerr, A.R., Yalamanchili, S., Clark, N.: Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems. In: Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, PACT ’10. ACM, pp. 353–364 (2010)Google Scholar
  31. 31.
    Yang, Y., Xiang, P., Kong, J., Zhou, H.: A GPGPU compiler for memory optimization and parallelism management. In: Proceedings of the 2010 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’10. ACM, pp. 86–97 (2010)Google Scholar
  32. 32.
    Gelado, I., Stone, J.E., Cabezas, J., Patel, S., Navarro, N., Hwu, W.-M.W.: An asymmetric distributed shared memory model for heterogeneous parallel systems. In: Proceedings of the Fifteenth Edition of ASPLOS on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’10. ACM, pp. 347–358 (2010)Google Scholar
  33. 33.
    Intel Sandy Bridge, Intel processor roadmap (2010)Google Scholar
  34. 34.
    Fung, W.W.L., Aamodt, T.M.: CPU-assisted GPGPU on fused CPU-GPU architectures. In: Proceedings of the 2012 IEEE 18th International Symposium on High Performance Computer Architecture HPCA ’12 (2012)Google Scholar
  35. 35.
    Gou C., Gaydadjiev G.: Exploiting spmd horizontal locality. IEEE Comput. Archit. Lett. 10(1), 20–23 (2011)CrossRefGoogle Scholar
  36. 36.
    Lee, J., Lakshminarayana, N.B., Kim, H., Vuduc, R.: Many-thread aware prefetching mechanisms for GPGPU applications. In: Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on, Microarchitecture, MICRO ’43. IEEE Computer Society, pp. 213–224 (2010)Google Scholar
  37. 37.
    Narasiman, V., Lee, C.J., Shebanow, M., Miftakhutdinov, R., Mutlu, O., Patt, Y.N.: Improving GPU performance via large warps and two-level warp scheduling. In: Proceedings of the 44th Annual ACM/IEEE International Symposium on Microarchitecture MICRO 44 (2011)Google Scholar
  38. 38.
    Yuan, G., Bakhoda, A., and Aamodt, T.: Complexity effective memory access scheduling for many-core accelerator architectures. In: 42nd Annual IEEE/ACM International Symposium on, Microarchitecture, 2009 (MICRO-42), pp. 34–44 (2009)Google Scholar

Copyright information

© The Author(s) 2012

Authors and Affiliations

  1. 1.Computer Engineering Laboratory, Faculty of Electrical EngineeringMathematics and Computer Science TU DelftDelftThe Netherlands

Personalised recommendations