ComScribe: Identifying Intra-node GPU Communication

Akhtar, Palwisha; Tezcan, Erhan; Qararyah, Fareed Mohammad; Unat, Didem

doi:10.1007/978-3-030-71058-3_10

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12614))

Included in the following conference series:

International Symposium on Benchmarking, Measuring and Optimization

972 Accesses
1 Citations

Abstract

GPU communication plays a critical role in performance and scalability of multi-GPU accelerated applications. With the ever increasing methods and types of communication, it is often hard for the programmer to know the exact amount and type of communication taking place in an application. Though there are prior works that detect communication in distributed systems for MPI and multi-threaded applications on shared memory systems, to our knowledge, none of these works identify intra-node GPU communication. We propose a tool, ComScribe that identifies and categorizes types of communication among all GPU-GPU and CPU-GPU pairs in a node. Built on top of NVIDIA’s profiler nvprof, ComScribe visualizes data movement as a communication matrix or bar-chart for explicit communication primitives, Unified Memory operations, and Zero-copy Memory transfers. To validate our tool on 16 GPUs, we present communication patterns of 8 micro- and 3 macro-benchmarks from NVIDIA, Comm|Scope, and MGBench benchmark suites. To demonstrate tool’s capabilities in real-life applications, we also present insightful communication matrices of two deep neural network models. All in all, ComScribe can guide the programmer in identifying which groups of GPUs communicate in what volume by using which primitives. This offers avenues to detect performance bottlenecks and more importantly communication bugs in an application.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://github.com/c3sr/comm_scope/blob/master/src/cudaMemcpyAsync-duplex/gpu_gpu_peer.cpp.

References

Abadi, M., et al.: TensorFlow: a system for large-scale machine learning. In: Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI 2016, pp. 265–283. USENIX Association, USA (2016)
Google Scholar
Amari, S.: Backpropagation and stochastic gradient descent method. Neurocomputing 5(4–5), 185–196 (1993)
Google Scholar
Azimi, R., Tam, D.K., Soares, L., Stumm, M.: Enhancing operating system support for multicore processors by using hardware performance monitoring. ACM SIGOPS Oper. Syst. Rev. 43(2), 56–65 (2009)
Article Google Scholar
Barrow-Williams, N., Fensch, C., Moore, S.: A communication characterisation of Splash-2 and Parsec. In: 2009 IEEE International Symposium on Workload Characterization (IISWC), pp. 86–97. IEEE (2009)
Google Scholar
Ben-Nun, T., Hoefler, T.: Demystifying parallel and distributed deep learning: an in-depth concurrency analysis. ACM Comput. Surv. (CSUR) 52(4), 1–43 (2019)
Article Google Scholar
Ben-Nun, T., Levy, E., Barak, A., Rubin, E.: Memory access patterns: the missing piece of the multi-GPU puzzle. In: International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 15–20 November 2015 (2015). https://doi.org/10.1145/2807591.2807611
Ben-Nuun, T.: MGBench: multi-GPU computing benchmark suite (CUDA) (2017). https://github.com/tbennun/mgbench. Accessed 29 Jul 2020
Buono, D., Artico, F., Checconi, F., Choi, J.W., Que, X., Schneidenbach, L.: Data analytics with NVLink: an SpMV case study. In: ACM International Conference on Computing Frontiers 2017, CF 2017, pp. 89–96 (2017). https://doi.org/10.1145/3075564.3075569
da Cruz, E.H.M., Alves, M.A.Z., Carissimi, A., Navaux, P.O.A., Ribeiro, C.P., Méhaut, J.F.: Using memory access traces to map threads and data on hierarchical multi-core platforms. In: 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum, pp. 551–558. IEEE (2011)
Google Scholar
Danjean, V., Namyst, R., Wacrenier, P.-A.: An efficient multi-level trace toolkit for multi-threaded applications. In: Cunha, J.C., Medeiros, P.D. (eds.) Euro-Par 2005. LNCS, vol. 3648, pp. 166–175. Springer, Heidelberg (2005). https://doi.org/10.1007/11549468_21
Chapter Google Scholar
Diener, M., Cruz, E.H., Alves, M.A., Navaux, P.O.: Communication in shared memory: concepts, definitions, and efficient detection. In: 2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP), pp. 151–158. IEEE (2016)
Google Scholar
Diener, M., Cruz, E.H., Pilla, L.L., Dupros, F., Navaux, P.O.: Characterizing communication and page usage of parallel applications for thread and data mapping. Perform. Eval. 88, 18–36 (2015)
Article Google Scholar
Foley, D., Danskin, J.: Ultra-performance pascal GPU and NVLink interconnect. IEEE Micro 37(2), 7–17 (2017)
Article Google Scholar
Harris, M.: Unified memory for cuda beginners — nvidia developer blog (June 2017). https://devblogs.nvidia.com/unified-memory-cuda-beginners/
Li, A., et al.: Evaluating modern GPU interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect. IEEE Trans. Parallel Distrib. Syst. 31(1), 94–110 (2020)
Article Google Scholar
Li, A., Song, S.L., Chen, J., Liu, X., Tallent, N., Barker, K.: Tartan: evaluating modern GPU interconnect via a multi-GPU benchmark suite. In: 2018 IEEE International Symposium on Workload Characterization (IISWC), pp. 191–202 (2018)
Google Scholar
Micikevicius, P.: Multi-GPU programming (2012). https://on-demand.gputechconf.com/gtc/2012/presentations/S0515-GTC2012-Multi-GPU-Programming.pdf
NVIDIA: NVIDIA GPUDirect technology (2012). http://developer.download.nvidia.com/devzone/devcenter/cuda/docs/GPUDirect_Technology_Overview.pdf
NVIDIA: NVIDIA DGX-1 with the Tesla V100 system architecture. White Paper (2017). https://images.nvidia.com/content/pdf/dgx1-v100-system-architecture-whitepaper.pdf. Accessed 28 Jul 2020
NVIDIA: DGX-2 : AI servers for solving complex AI challenges — NVIDIA (2018). https://www.nvidia.com/en-us/data-center/dgx-2/. Accessed 28 Jul 2020
NVIDIA: Multi-gpu-programming-models: examples demonstrating available options to program multiple GPUs in a single node or a cluster (2018). https://github.com/NVIDIA/multi-gpu-programming-models. Accessed 29 Jul 2020
NVIDIA: CUDA runtime API (July 2019). https://docs.nvidia.com/cuda/pdf/CUDA_Runtime_API.pdf. Accessed 29 Jul 2020
NVIDIA: Ising-gpu: GPU-accelerated Monte Carlo simulations of 2d Ising model (2019). https://github.com/NVIDIA/ising-gpu. Accessed 29 Jul 2020
NVIDIA: Best practices guide: CUDA toolkit documentation (2020). https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#zero-copy. Accessed 12 May 2020
NVIDIA: CUDA profiler user’s guide (July 2020). https://docs.nvidia.com/cuda/pdf/CUDA_Profiler_Users_Guide.pdf. Accessed 28 Jul 2020
NVIDIA DGX A100: universal system for AI infrastructure — NVIDIA (2020). https://www.nvidia.com/en-us/data-center/dgx-a100/. Accessed 28 Jul 2020
NVIDIA: NVIDIA Nsigh systems documentation (2020). https://docs.nvidia.com/nsight-systems/index.html. Accessed 28 Jul 2020
Pearson, C., et al.: Evaluating characteristics of CUDA communication primitives on high-bandwidth interconnects. In: Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering, ICPE 2019, pp. 209–218. Association for Computing Machinery, New York (2019). https://doi.org/10.1145/3297663.3310299
Sasongko, M.A., Chabbi, M., Akhtar, P., Unat, D.: ComDetective: a lightweight communication detection tool for threads. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–21 (2019)
Google Scholar
Shazeer, N., et al.: Mesh-TensorFlow: deep learning for supercomputers. In: Advances in Neural Information Processing Systems, pp. 10414–10423 (2018)
Google Scholar
Sourouri, M., Gillberg, T., Baden, S.B., Cai, X.: Effective multi-GPU communication using multiple CUDA streams and threads. In: Proceedings of the International Conference on Parallel and Distributed Systems, ICPADS 2015, 10 April 2015, pp. 981–986 (2014). https://doi.org/10.1109/PADSW.2014.7097919
Tam, D., Azimi, R., Stumm, M.: Thread clustering: sharing-aware scheduling on SMP-CMP-SMT multiprocessors. ACM SIGOPS Oper. Syst. Rev. 41(3), 47–58 (2007)
Article Google Scholar
Trahay, F., Rue, F., Faverge, M., Ishikawa, Y., Namyst, R., Dongarra, J.: EZTrace: a generic framework for performance analysis. In: 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, pp. 618–619. IEEE (2011)
Google Scholar
Valiant, L.G.: A bridging model for parallel computation. Commun. ACM 33(8), 103–111 (1990)
Article Google Scholar
Vaswani, A., et al.: Attention is all you need. CoRR abs/1706.03762 (2017). http://arxiv.org/abs/1706.03762
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Google Scholar
Wang, Y., Jiang, L., Yang, M.H., Li, L.J., Long, M., Fei-Fei, L.: Eidetic 3d LSTM: a model for video prediction and beyond. In: International Conference on Learning Representations (2018)
Google Scholar
Wang, Y., Jiang, L., Yang, M.H., Li, L.J., Long, M., Fei-Fei, L.: Eidetic 3d LSTM: a model for video prediction and beyond. In: International Conference on Learning Representations (2019). https://openreview.net/forum?id=B1lKS2AqtX

Download references

Acknowledgement

Some of the authors from Koç University are supported by the Turkish Science and Technology Research Centre Grant No: 118E801. The research presented in this paper has benefited from the Experimental Infrastructure for Exploration of Exascale Computing (eX3), which is financially supported by the Research Council of Norway under contract 270053.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Koç University, Istanbul, Turkey
Palwisha Akhtar, Erhan Tezcan, Fareed Mohammad Qararyah & Didem Unat

Authors

Palwisha Akhtar
View author publications
You can also search for this author in PubMed Google Scholar
Erhan Tezcan
View author publications
You can also search for this author in PubMed Google Scholar
Fareed Mohammad Qararyah
View author publications
You can also search for this author in PubMed Google Scholar
Didem Unat
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Palwisha Akhtar .

Editor information

Editors and Affiliations

Department of Computer Science, Technical University of Darmstadt, Darmstadt, Germany
Felix Wolf
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Wanling Gao

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Akhtar, P., Tezcan, E., Qararyah, F.M., Unat, D. (2021). ComScribe: Identifying Intra-node GPU Communication. In: Wolf, F., Gao, W. (eds) Benchmarking, Measuring, and Optimizing. Bench 2020. Lecture Notes in Computer Science(), vol 12614. Springer, Cham. https://doi.org/10.1007/978-3-030-71058-3_10

Download citation

DOI: https://doi.org/10.1007/978-3-030-71058-3_10
Published: 02 March 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-71057-6
Online ISBN: 978-3-030-71058-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics