Abstract
One of the major problems with the GPU on-chip shared memory is bank conflicts. We analyze that the throughput of the GPU processor core is often constrained neither by the shared memory bandwidth, nor by the shared memory latency (as long as it stays constant), but is rather due to the varied latencies caused by memory bank conflicts. This results in conflicts at the writeback stage of the in-order pipeline and causes pipeline stalls, thus degrading system throughput. Based on this observation, we investigate and propose a novel Elastic Pipeline design that minimizes the negative impact of on-chip memory bank conflicts on system throughput, by decoupling bank conflicts from pipeline stalls. Simulation results show that our proposed Elastic Pipeline together with the co-designed bank-conflict aware warp scheduling reduces the pipeline stalls by up to 64.0 % (with 42.3 % on average) and improves the overall performance by up to 20.7 % (on average 13.3 %) for representative benchmarks, at trivial hardware overhead.
Article PDF
Similar content being viewed by others
References
Gou, C., Gaydadjiev, G.N.: Elastic pipeline: addressing GPU on-chip shared memory bank conflicts. In: Proceedings of the 8th ACM International Conference on Computing Frontiers (CF’11) (May 2011), pp. 1–11 (2011)
GPGPU home page. http://gpgpu.org
Valiant L.G.: A bridging model for parallel computation. Commun. ACM 33, 103–111 (1990)
Thistle, M.R., Smith, B.J.: A processor architecture for horizon. In: Proceedings of the 1988 ACM/IEEE Conference on Supercomputing (Los Alamitos, CA, USA, 1988), Supercomputing ’88. IEEE Computer Society Press, pp. 35–41 (1988)
Nickolls J., Buck I., Garland M., Skadron K.: Scalable parallel programming with cuda. Queue 6, 40–53 (2008)
OpenCL home page. http://www.khronos.org/opencl/
Hou, Q., Zhou, K., Guo, B.: BSGP: bulk-synchronous GPU programming. In: ACM SIGGRAPH 2008 papers, SIGGRAPH ’08. ACM, pp. 19:1–19:12 (2008)
AMD fusion. http://sites.amd.com/us/fusion/apu/pages/fusion.aspx
Glaskowsky, P.N.: NVIDIA’s Fermi: the first complete GPU computing architecture
Little J.D.C.: A proof for the queuing formula: L = w. Oper. Res. 9(3), 383–387 (1961)
NVIDIA. CUDA best practice guide, edition 3.0
Bakhoda, A., Yuan, G., Fung, W., Wong, H., Aamodt, T.: Analyzing CUDA workloads using a detailed GPU simulator. In: IEEE International Symposium on, Performance Analysis of Systems and Software, 2009 (ISPASS 2009), pp. 163–174 (2009)
Wong, H., Papadopoulou, M.-M., Sadooghi-Alvandi, M., Moshovos, A.: Demystifying GPU microarchitecture through microbenchmarking. In: IEEE International Symposium on, Performance Analysis of Systems Software (ISPASS, 2010), pp. 235–246 (2010)
Stark, J., Brown, M.D., Patt, Y.N.: On pipelining dynamic instruction scheduling logic. In: Proceedings of the 33rd Annual ACM/IEEE International Symposium on Microarchitecture, MICRO 33. ACM, pp. 57–66
Fung, W.W.L., Sham, I., Yuan, G., Aamodt, T.M.: Dynamic warp formation and scheduling for efficient GPU control flow. In: Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 40. IEEE Computer Society, pp. 407–420 (2007)
Volkov, V.: Better performance at lower occupancy. In: GPU Technology Conference 2010 GTC ’10 (2010)
Bhatnagar H.: Advanced ASIC chip synthesis using synopsys design compiler physical compiler and primeTime. Kluwer, Dordrecht (2001)
Rixner, S., Dally, W., Kapasi, U., Mattson, P., Owens, J.: Memory access scheduling. In: Proceedings of the 27th International Symposium on, Computer Architecture, 2000 (2000)
NVIDIA. The CUDA compiler driver NVCC, edition 2.2
NVIDIA. CUDA SDK version 2.2
Manavski, S. CUDA compatible GPU as an efficient hardware accelerator for AES cryptography. In: IEEE International Conference on, Signal Processing and Communications, 2007 (ICSPC 2007), pp. 65–68 (2007)
AMD Graphic Core Next Architecture. http://developer.amd.com/afds/assets/presen-tations/2620_final.pdf
Kirk D.B., Mei W., Hwu W.: Programming Massively Parallel Processors: A Hands-on Approach. Morgan Kaufmann, Los Altos (2010)
Harper D.T. III: Block, multistride vector and FFT accesses in parallel memory systems. IEEE Trans. Parallel Distrib. Syst. 2(1), 43–51 (1991)
Harper D.T. III: Increased memory performance during vector accesses through the use of linear address transformations. IEEE Trans. Comput. 41(2), 227–230 (1992)
Harper D.T. III, Linebarger D.A.: Conflict-free vector access using a dynamic storage scheme. IEEE Trans. Comput. 40(3), 276–283 (1991)
Gou, C., Kuzmanov, G., Gaydadjiev, G.N.: SAMS multi-layout memory: providing multiple views of data to boost SIMD performance. In: Proceedings of the 24th ACM International Conference on Supercomputing, ICS ’10. ACM, pp. 179–188 (2010)
Gou, C., Kuzmanov, G.K., Gaydadjiev, G.N. SAMS: single-affiliation multiple-stride parallel memory scheme. In: Proceedings of the 2008 Workshop on Memory Access on Future Processors: A Solved Problem? MAW ’08. ACM, pp. 350–368 (2008)
Valero M., Lang T., Peiron M., Ayguade E.: Conflict-free access for streams in multimodule memories. IEEE Trans. Comput. 44, 634–646 (1995)
Diamos, G.F., Kerr, A.R., Yalamanchili, S., Clark, N.: Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems. In: Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, PACT ’10. ACM, pp. 353–364 (2010)
Yang, Y., Xiang, P., Kong, J., Zhou, H.: A GPGPU compiler for memory optimization and parallelism management. In: Proceedings of the 2010 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’10. ACM, pp. 86–97 (2010)
Gelado, I., Stone, J.E., Cabezas, J., Patel, S., Navarro, N., Hwu, W.-M.W.: An asymmetric distributed shared memory model for heterogeneous parallel systems. In: Proceedings of the Fifteenth Edition of ASPLOS on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’10. ACM, pp. 347–358 (2010)
Intel Sandy Bridge, Intel processor roadmap (2010)
Fung, W.W.L., Aamodt, T.M.: CPU-assisted GPGPU on fused CPU-GPU architectures. In: Proceedings of the 2012 IEEE 18th International Symposium on High Performance Computer Architecture HPCA ’12 (2012)
Gou C., Gaydadjiev G.: Exploiting spmd horizontal locality. IEEE Comput. Archit. Lett. 10(1), 20–23 (2011)
Lee, J., Lakshminarayana, N.B., Kim, H., Vuduc, R.: Many-thread aware prefetching mechanisms for GPGPU applications. In: Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on, Microarchitecture, MICRO ’43. IEEE Computer Society, pp. 213–224 (2010)
Narasiman, V., Lee, C.J., Shebanow, M., Miftakhutdinov, R., Mutlu, O., Patt, Y.N.: Improving GPU performance via large warps and two-level warp scheduling. In: Proceedings of the 44th Annual ACM/IEEE International Symposium on Microarchitecture MICRO 44 (2011)
Yuan, G., Bakhoda, A., and Aamodt, T.: Complexity effective memory access scheduling for many-core accelerator architectures. In: 42nd Annual IEEE/ACM International Symposium on, Microarchitecture, 2009 (MICRO-42), pp. 34–44 (2009)
Acknowledgments
This work was supported by the European Commission in the context of the SARC project (FP6 FET Contract #27648) and was continued by the ENCORE project (FP7 ICT4 Contract #249059). We would also like to thank the reviewers for their insightful comments.
Open Access
This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.
Author information
Authors and Affiliations
Corresponding author
Additional information
This is an extended version of the paper originally presented in Computing Frontiers 2011 [1].
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Gou, C., Gaydadjiev, G.N. Addressing GPU On-Chip Shared Memory Bank Conflicts Using Elastic Pipeline. Int J Parallel Prog 41, 400–429 (2013). https://doi.org/10.1007/s10766-012-0201-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10766-012-0201-1