Addressing GPU On-Chip Shared Memory Bank Conflicts Using Elastic Pipeline

Gou, Chunyang; Gaydadjiev, Georgi N.

doi:10.1007/s10766-012-0201-1

Addressing GPU On-Chip Shared Memory Bank Conflicts Using Elastic Pipeline

Open access
Published: 03 July 2012

Volume 41, pages 400–429, (2013)
Cite this article

Download PDF

You have full access to this open access article

International Journal of Parallel Programming Aims and scope Submit manuscript

Addressing GPU On-Chip Shared Memory Bank Conflicts Using Elastic Pipeline

Download PDF

Chunyang Gou¹ &
Georgi N. Gaydadjiev¹

3104 Accesses
9 Citations
3 Altmetric
Explore all metrics

Abstract

One of the major problems with the GPU on-chip shared memory is bank conflicts. We analyze that the throughput of the GPU processor core is often constrained neither by the shared memory bandwidth, nor by the shared memory latency (as long as it stays constant), but is rather due to the varied latencies caused by memory bank conflicts. This results in conflicts at the writeback stage of the in-order pipeline and causes pipeline stalls, thus degrading system throughput. Based on this observation, we investigate and propose a novel Elastic Pipeline design that minimizes the negative impact of on-chip memory bank conflicts on system throughput, by decoupling bank conflicts from pipeline stalls. Simulation results show that our proposed Elastic Pipeline together with the co-designed bank-conflict aware warp scheduling reduces the pipeline stalls by up to 64.0 % (with 42.3 % on average) and improves the overall performance by up to 20.7 % (on average 13.3 %) for representative benchmarks, at trivial hardware overhead.

Article PDF

Adaptive Modular Mapping to Reduce Shared Memory Bank Conflicts on GPUs

A memory scheduling strategy for eliminating memory access interference in heterogeneous system

Article Open access 10 January 2020

Juan Fang, Mengxuan Wang & Zelin Wei

Designing Coalescing Network-on-Chip for Efficient Memory Accesses of GPGPUs

References

Gou, C., Gaydadjiev, G.N.: Elastic pipeline: addressing GPU on-chip shared memory bank conflicts. In: Proceedings of the 8th ACM International Conference on Computing Frontiers (CF’11) (May 2011), pp. 1–11 (2011)
GPGPU home page. http://gpgpu.org
Valiant L.G.: A bridging model for parallel computation. Commun. ACM 33, 103–111 (1990)
Article Google Scholar
Thistle, M.R., Smith, B.J.: A processor architecture for horizon. In: Proceedings of the 1988 ACM/IEEE Conference on Supercomputing (Los Alamitos, CA, USA, 1988), Supercomputing ’88. IEEE Computer Society Press, pp. 35–41 (1988)
Nickolls J., Buck I., Garland M., Skadron K.: Scalable parallel programming with cuda. Queue 6, 40–53 (2008)
Article Google Scholar
OpenCL home page. http://www.khronos.org/opencl/
Hou, Q., Zhou, K., Guo, B.: BSGP: bulk-synchronous GPU programming. In: ACM SIGGRAPH 2008 papers, SIGGRAPH ’08. ACM, pp. 19:1–19:12 (2008)
AMD fusion. http://sites.amd.com/us/fusion/apu/pages/fusion.aspx
Glaskowsky, P.N.: NVIDIA’s Fermi: the first complete GPU computing architecture
Little J.D.C.: A proof for the queuing formula: L = w. Oper. Res. 9(3), 383–387 (1961)
Article MathSciNet MATH Google Scholar
NVIDIA. CUDA best practice guide, edition 3.0
Bakhoda, A., Yuan, G., Fung, W., Wong, H., Aamodt, T.: Analyzing CUDA workloads using a detailed GPU simulator. In: IEEE International Symposium on, Performance Analysis of Systems and Software, 2009 (ISPASS 2009), pp. 163–174 (2009)
Wong, H., Papadopoulou, M.-M., Sadooghi-Alvandi, M., Moshovos, A.: Demystifying GPU microarchitecture through microbenchmarking. In: IEEE International Symposium on, Performance Analysis of Systems Software (ISPASS, 2010), pp. 235–246 (2010)
Stark, J., Brown, M.D., Patt, Y.N.: On pipelining dynamic instruction scheduling logic. In: Proceedings of the 33rd Annual ACM/IEEE International Symposium on Microarchitecture, MICRO 33. ACM, pp. 57–66
Fung, W.W.L., Sham, I., Yuan, G., Aamodt, T.M.: Dynamic warp formation and scheduling for efficient GPU control flow. In: Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 40. IEEE Computer Society, pp. 407–420 (2007)
Volkov, V.: Better performance at lower occupancy. In: GPU Technology Conference 2010 GTC ’10 (2010)
Bhatnagar H.: Advanced ASIC chip synthesis using synopsys design compiler physical compiler and primeTime. Kluwer, Dordrecht (2001)
Google Scholar
Rixner, S., Dally, W., Kapasi, U., Mattson, P., Owens, J.: Memory access scheduling. In: Proceedings of the 27th International Symposium on, Computer Architecture, 2000 (2000)
NVIDIA. The CUDA compiler driver NVCC, edition 2.2
NVIDIA. CUDA SDK version 2.2
Manavski, S. CUDA compatible GPU as an efficient hardware accelerator for AES cryptography. In: IEEE International Conference on, Signal Processing and Communications, 2007 (ICSPC 2007), pp. 65–68 (2007)
AMD Graphic Core Next Architecture. http://developer.amd.com/afds/assets/presen-tations/2620_final.pdf
Kirk D.B., Mei W., Hwu W.: Programming Massively Parallel Processors: A Hands-on Approach. Morgan Kaufmann, Los Altos (2010)
Google Scholar
Harper D.T. III: Block, multistride vector and FFT accesses in parallel memory systems. IEEE Trans. Parallel Distrib. Syst. 2(1), 43–51 (1991)
Article Google Scholar
Harper D.T. III: Increased memory performance during vector accesses through the use of linear address transformations. IEEE Trans. Comput. 41(2), 227–230 (1992)
Article Google Scholar
Harper D.T. III, Linebarger D.A.: Conflict-free vector access using a dynamic storage scheme. IEEE Trans. Comput. 40(3), 276–283 (1991)
Article Google Scholar
Gou, C., Kuzmanov, G., Gaydadjiev, G.N.: SAMS multi-layout memory: providing multiple views of data to boost SIMD performance. In: Proceedings of the 24th ACM International Conference on Supercomputing, ICS ’10. ACM, pp. 179–188 (2010)
Gou, C., Kuzmanov, G.K., Gaydadjiev, G.N. SAMS: single-affiliation multiple-stride parallel memory scheme. In: Proceedings of the 2008 Workshop on Memory Access on Future Processors: A Solved Problem? MAW ’08. ACM, pp. 350–368 (2008)
Valero M., Lang T., Peiron M., Ayguade E.: Conflict-free access for streams in multimodule memories. IEEE Trans. Comput. 44, 634–646 (1995)
Article MATH Google Scholar
Diamos, G.F., Kerr, A.R., Yalamanchili, S., Clark, N.: Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems. In: Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, PACT ’10. ACM, pp. 353–364 (2010)
Yang, Y., Xiang, P., Kong, J., Zhou, H.: A GPGPU compiler for memory optimization and parallelism management. In: Proceedings of the 2010 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’10. ACM, pp. 86–97 (2010)
Gelado, I., Stone, J.E., Cabezas, J., Patel, S., Navarro, N., Hwu, W.-M.W.: An asymmetric distributed shared memory model for heterogeneous parallel systems. In: Proceedings of the Fifteenth Edition of ASPLOS on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’10. ACM, pp. 347–358 (2010)
Intel Sandy Bridge, Intel processor roadmap (2010)
Fung, W.W.L., Aamodt, T.M.: CPU-assisted GPGPU on fused CPU-GPU architectures. In: Proceedings of the 2012 IEEE 18th International Symposium on High Performance Computer Architecture HPCA ’12 (2012)
Gou C., Gaydadjiev G.: Exploiting spmd horizontal locality. IEEE Comput. Archit. Lett. 10(1), 20–23 (2011)
Article Google Scholar
Lee, J., Lakshminarayana, N.B., Kim, H., Vuduc, R.: Many-thread aware prefetching mechanisms for GPGPU applications. In: Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on, Microarchitecture, MICRO ’43. IEEE Computer Society, pp. 213–224 (2010)
Narasiman, V., Lee, C.J., Shebanow, M., Miftakhutdinov, R., Mutlu, O., Patt, Y.N.: Improving GPU performance via large warps and two-level warp scheduling. In: Proceedings of the 44th Annual ACM/IEEE International Symposium on Microarchitecture MICRO 44 (2011)
Yuan, G., Bakhoda, A., and Aamodt, T.: Complexity effective memory access scheduling for many-core accelerator architectures. In: 42nd Annual IEEE/ACM International Symposium on, Microarchitecture, 2009 (MICRO-42), pp. 34–44 (2009)

Download references

Acknowledgments

This work was supported by the European Commission in the context of the SARC project (FP6 FET Contract #27648) and was continued by the ENCORE project (FP7 ICT4 Contract #249059). We would also like to thank the reviewers for their insightful comments.

Open Access

This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.

Author information

Authors and Affiliations

Computer Engineering Laboratory, Faculty of Electrical Engineering, Mathematics and Computer Science TU Delft, Delft, The Netherlands
Chunyang Gou & Georgi N. Gaydadjiev

Authors

Chunyang Gou
View author publications
You can also search for this author in PubMed Google Scholar
Georgi N. Gaydadjiev
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chunyang Gou.

Additional information

This is an extended version of the paper originally presented in Computing Frontiers 2011 [1].

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Gou, C., Gaydadjiev, G.N. Addressing GPU On-Chip Shared Memory Bank Conflicts Using Elastic Pipeline. Int J Parallel Prog 41, 400–429 (2013). https://doi.org/10.1007/s10766-012-0201-1

Download citation

Received: 17 August 2011
Accepted: 14 June 2012
Published: 03 July 2012
Issue Date: June 2013
DOI: https://doi.org/10.1007/s10766-012-0201-1

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Addressing GPU On-Chip Shared Memory Bank Conflicts Using Elastic Pipeline

Abstract

Article PDF

Similar content being viewed by others

Adaptive Modular Mapping to Reduce Shared Memory Bank Conflicts on GPUs

A memory scheduling strategy for eliminating memory access interference in heterogeneous system

Designing Coalescing Network-on-Chip for Efficient Memory Accesses of GPGPUs

References

Acknowledgments

Open Access

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Addressing GPU On-Chip Shared Memory Bank Conflicts Using Elastic Pipeline

Abstract

Article PDF

Similar content being viewed by others

Adaptive Modular Mapping to Reduce Shared Memory Bank Conflicts on GPUs

A memory scheduling strategy for eliminating memory access interference in heterogeneous system

Designing Coalescing Network-on-Chip for Efficient Memory Accesses of GPGPUs

References

Acknowledgments

Open Access

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation