Skip to main content

ComScribe: Identifying Intra-node GPU Communication

  • Conference paper
  • First Online:
Benchmarking, Measuring, and Optimizing (Bench 2020)

Abstract

GPU communication plays a critical role in performance and scalability of multi-GPU accelerated applications. With the ever increasing methods and types of communication, it is often hard for the programmer to know the exact amount and type of communication taking place in an application. Though there are prior works that detect communication in distributed systems for MPI and multi-threaded applications on shared memory systems, to our knowledge, none of these works identify intra-node GPU communication. We propose a tool, ComScribe that identifies and categorizes types of communication among all GPU-GPU and CPU-GPU pairs in a node. Built on top of NVIDIA’s profiler nvprof, ComScribe visualizes data movement as a communication matrix or bar-chart for explicit communication primitives, Unified Memory operations, and Zero-copy Memory transfers. To validate our tool on 16 GPUs, we present communication patterns of 8 micro- and 3 macro-benchmarks from NVIDIA, Comm|Scope, and MGBench benchmark suites. To demonstrate tool’s capabilities in real-life applications, we also present insightful communication matrices of two deep neural network models. All in all, ComScribe can guide the programmer in identifying which groups of GPUs communicate in what volume by using which primitives. This offers avenues to detect performance bottlenecks and more importantly communication bugs in an application.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/c3sr/comm_scope/blob/master/src/cudaMemcpyAsync-duplex/gpu_gpu_peer.cpp.

References

  1. Abadi, M., et al.: TensorFlow: a system for large-scale machine learning. In: Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI 2016, pp. 265–283. USENIX Association, USA (2016)

    Google Scholar 

  2. Amari, S.: Backpropagation and stochastic gradient descent method. Neurocomputing 5(4–5), 185–196 (1993)

    Google Scholar 

  3. Azimi, R., Tam, D.K., Soares, L., Stumm, M.: Enhancing operating system support for multicore processors by using hardware performance monitoring. ACM SIGOPS Oper. Syst. Rev. 43(2), 56–65 (2009)

    Article  Google Scholar 

  4. Barrow-Williams, N., Fensch, C., Moore, S.: A communication characterisation of Splash-2 and Parsec. In: 2009 IEEE International Symposium on Workload Characterization (IISWC), pp. 86–97. IEEE (2009)

    Google Scholar 

  5. Ben-Nun, T., Hoefler, T.: Demystifying parallel and distributed deep learning: an in-depth concurrency analysis. ACM Comput. Surv. (CSUR) 52(4), 1–43 (2019)

    Article  Google Scholar 

  6. Ben-Nun, T., Levy, E., Barak, A., Rubin, E.: Memory access patterns: the missing piece of the multi-GPU puzzle. In: International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 15–20 November 2015 (2015). https://doi.org/10.1145/2807591.2807611

  7. Ben-Nuun, T.: MGBench: multi-GPU computing benchmark suite (CUDA) (2017). https://github.com/tbennun/mgbench. Accessed 29 Jul 2020

  8. Buono, D., Artico, F., Checconi, F., Choi, J.W., Que, X., Schneidenbach, L.: Data analytics with NVLink: an SpMV case study. In: ACM International Conference on Computing Frontiers 2017, CF 2017, pp. 89–96 (2017). https://doi.org/10.1145/3075564.3075569

  9. da Cruz, E.H.M., Alves, M.A.Z., Carissimi, A., Navaux, P.O.A., Ribeiro, C.P., Méhaut, J.F.: Using memory access traces to map threads and data on hierarchical multi-core platforms. In: 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum, pp. 551–558. IEEE (2011)

    Google Scholar 

  10. Danjean, V., Namyst, R., Wacrenier, P.-A.: An efficient multi-level trace toolkit for multi-threaded applications. In: Cunha, J.C., Medeiros, P.D. (eds.) Euro-Par 2005. LNCS, vol. 3648, pp. 166–175. Springer, Heidelberg (2005). https://doi.org/10.1007/11549468_21

    Chapter  Google Scholar 

  11. Diener, M., Cruz, E.H., Alves, M.A., Navaux, P.O.: Communication in shared memory: concepts, definitions, and efficient detection. In: 2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP), pp. 151–158. IEEE (2016)

    Google Scholar 

  12. Diener, M., Cruz, E.H., Pilla, L.L., Dupros, F., Navaux, P.O.: Characterizing communication and page usage of parallel applications for thread and data mapping. Perform. Eval. 88, 18–36 (2015)

    Article  Google Scholar 

  13. Foley, D., Danskin, J.: Ultra-performance pascal GPU and NVLink interconnect. IEEE Micro 37(2), 7–17 (2017)

    Article  Google Scholar 

  14. Harris, M.: Unified memory for cuda beginners — nvidia developer blog (June 2017). https://devblogs.nvidia.com/unified-memory-cuda-beginners/

  15. Li, A., et al.: Evaluating modern GPU interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect. IEEE Trans. Parallel Distrib. Syst. 31(1), 94–110 (2020)

    Article  Google Scholar 

  16. Li, A., Song, S.L., Chen, J., Liu, X., Tallent, N., Barker, K.: Tartan: evaluating modern GPU interconnect via a multi-GPU benchmark suite. In: 2018 IEEE International Symposium on Workload Characterization (IISWC), pp. 191–202 (2018)

    Google Scholar 

  17. Micikevicius, P.: Multi-GPU programming (2012). https://on-demand.gputechconf.com/gtc/2012/presentations/S0515-GTC2012-Multi-GPU-Programming.pdf

  18. NVIDIA: NVIDIA GPUDirect technology (2012). http://developer.download.nvidia.com/devzone/devcenter/cuda/docs/GPUDirect_Technology_Overview.pdf

  19. NVIDIA: NVIDIA DGX-1 with the Tesla V100 system architecture. White Paper (2017). https://images.nvidia.com/content/pdf/dgx1-v100-system-architecture-whitepaper.pdf. Accessed 28 Jul 2020

  20. NVIDIA: DGX-2 : AI servers for solving complex AI challenges — NVIDIA (2018). https://www.nvidia.com/en-us/data-center/dgx-2/. Accessed 28 Jul 2020

  21. NVIDIA: Multi-gpu-programming-models: examples demonstrating available options to program multiple GPUs in a single node or a cluster (2018). https://github.com/NVIDIA/multi-gpu-programming-models. Accessed 29 Jul 2020

  22. NVIDIA: CUDA runtime API (July 2019). https://docs.nvidia.com/cuda/pdf/CUDA_Runtime_API.pdf. Accessed 29 Jul 2020

  23. NVIDIA: Ising-gpu: GPU-accelerated Monte Carlo simulations of 2d Ising model (2019). https://github.com/NVIDIA/ising-gpu. Accessed 29 Jul 2020

  24. NVIDIA: Best practices guide: CUDA toolkit documentation (2020). https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#zero-copy. Accessed 12 May 2020

  25. NVIDIA: CUDA profiler user’s guide (July 2020). https://docs.nvidia.com/cuda/pdf/CUDA_Profiler_Users_Guide.pdf. Accessed 28 Jul 2020

  26. NVIDIA DGX A100: universal system for AI infrastructure — NVIDIA (2020). https://www.nvidia.com/en-us/data-center/dgx-a100/. Accessed 28 Jul 2020

  27. NVIDIA: NVIDIA Nsigh systems documentation (2020). https://docs.nvidia.com/nsight-systems/index.html. Accessed 28 Jul 2020

  28. Pearson, C., et al.: Evaluating characteristics of CUDA communication primitives on high-bandwidth interconnects. In: Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering, ICPE 2019, pp. 209–218. Association for Computing Machinery, New York (2019). https://doi.org/10.1145/3297663.3310299

  29. Sasongko, M.A., Chabbi, M., Akhtar, P., Unat, D.: ComDetective: a lightweight communication detection tool for threads. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–21 (2019)

    Google Scholar 

  30. Shazeer, N., et al.: Mesh-TensorFlow: deep learning for supercomputers. In: Advances in Neural Information Processing Systems, pp. 10414–10423 (2018)

    Google Scholar 

  31. Sourouri, M., Gillberg, T., Baden, S.B., Cai, X.: Effective multi-GPU communication using multiple CUDA streams and threads. In: Proceedings of the International Conference on Parallel and Distributed Systems, ICPADS 2015, 10 April 2015, pp. 981–986 (2014). https://doi.org/10.1109/PADSW.2014.7097919

  32. Tam, D., Azimi, R., Stumm, M.: Thread clustering: sharing-aware scheduling on SMP-CMP-SMT multiprocessors. ACM SIGOPS Oper. Syst. Rev. 41(3), 47–58 (2007)

    Article  Google Scholar 

  33. Trahay, F., Rue, F., Faverge, M., Ishikawa, Y., Namyst, R., Dongarra, J.: EZTrace: a generic framework for performance analysis. In: 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, pp. 618–619. IEEE (2011)

    Google Scholar 

  34. Valiant, L.G.: A bridging model for parallel computation. Commun. ACM 33(8), 103–111 (1990)

    Article  Google Scholar 

  35. Vaswani, A., et al.: Attention is all you need. CoRR abs/1706.03762 (2017). http://arxiv.org/abs/1706.03762

  36. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)

    Google Scholar 

  37. Wang, Y., Jiang, L., Yang, M.H., Li, L.J., Long, M., Fei-Fei, L.: Eidetic 3d LSTM: a model for video prediction and beyond. In: International Conference on Learning Representations (2018)

    Google Scholar 

  38. Wang, Y., Jiang, L., Yang, M.H., Li, L.J., Long, M., Fei-Fei, L.: Eidetic 3d LSTM: a model for video prediction and beyond. In: International Conference on Learning Representations (2019). https://openreview.net/forum?id=B1lKS2AqtX

Download references

Acknowledgement

Some of the authors from Koç University are supported by the Turkish Science and Technology Research Centre Grant No: 118E801. The research presented in this paper has benefited from the Experimental Infrastructure for Exploration of Exascale Computing (eX3), which is financially supported by the Research Council of Norway under contract 270053.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Palwisha Akhtar .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Akhtar, P., Tezcan, E., Qararyah, F.M., Unat, D. (2021). ComScribe: Identifying Intra-node GPU Communication. In: Wolf, F., Gao, W. (eds) Benchmarking, Measuring, and Optimizing. Bench 2020. Lecture Notes in Computer Science(), vol 12614. Springer, Cham. https://doi.org/10.1007/978-3-030-71058-3_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-71058-3_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-71057-6

  • Online ISBN: 978-3-030-71058-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics