Tracing and Profiling Machine Learning Dataflow Applications on GPU

Zins, Pierre; Dagenais, Michel

doi:10.1007/s10766-019-00630-5

Tracing and Profiling Machine Learning Dataflow Applications on GPU

Published: 11 February 2019

Volume 47, pages 973–1013, (2019)
Cite this article

International Journal of Parallel Programming Aims and scope Submit manuscript

583 Accesses
1 Citation
Explore all metrics

Abstract

In this paper, we propose a profiling and tracing method for dataflow applications with GPU acceleration. Dataflow models can be represented by graphs and are widely used in many domains like signal processing or machine learning. Within the graph, the data flows along the edges, and the nodes correspond to the computing units that process the data. To accelerate the execution, some co-processing units, like GPUs, are often used for computing intensive nodes. The work in this paper aims at providing useful information about the execution of the dataflow graph on the available hardware, in order to understand and possibly improve the performance. The collected traces include low-level information about the CPU, from the Linux Kernel (system calls), as well as mid-level and high-level information respectively about intermediate libraries like CUDA, HIP or HSA, and the dataflow model. This is followed by post-mortem analysis and visualization steps in order to enhance the trace and show useful information to the user. To demonstrate the effectiveness of the method, it was evaluated for TensorFlow, a well-known machine learning library that uses a dataflow computational graph to represent the algorithms. We present a few examples of machine learning applications that can be optimized with the help of the information provided by our proposed method. For example, we reduce the execution time of a face recognition application by a factor of 5X. We suggest a better placement of the computation nodes on the available hardware components for a distributed application. Finally, we also enhance the memory management of an application to speed up the execution.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 5

Fig. 16

Machine Learning Using Virtualized GPUs in Cloud Environments

Simulation and Application Performance Evaluation Using GPU Through CUDA C & Deep Learning in TensorFlow

Leveraging HW Approximation for Exploiting Performance-Energy Trade-offs Within the Edge-Cloud Computing Continuum

References

Moore, G.E.: Cramming more components onto integrated circuits. Electronics 38(8), 114 (1965)
Google Scholar
Owens, J.D., Houston, M., Luebke, D., Green, S., Stone, J.E., Phillips, J.C.: Gpu computing. Proc. IEEE 96(5), 879–899 (2008)
Article Google Scholar
Boutellier, J., Wu, J., Huttunen, H., Bhattacharyya, S.S.: PRUNE: dynamic and decidable dataflow for signal processing on heterogeneous platforms (2018). CoRR arXiv:1802.06625
Article MathSciNet Google Scholar
Boutellier, J., Nyländen, T.: Design flow for GPU and multicore execution of dynamic dataflow programs. J. Signal Process. Syst. 89(3), 469–478 (2017)
Article Google Scholar
Bezati, E., Mattavelli, M., Raulet, M.: Rvc-cal dataflow implementations of mpeg avc/h.264 cabac decoding. In: 2010 Conference on Design and Architectures for Signal and Image Processing (DASIP), pp. 207–213 (2010)
Hentati, M., Aoudni, Y., Nezan, J.F., Abid, M.: A hierarchical implementation of hadamard transform using rvc-cal dataflow programming and dynamic partial reconfiguration. In: Proceedings of the 2012 Conference on Design and Architectures for Signal and Image Processing, pp. 1–7 (2012)
Blattner, T., Keyrouz, W., Halem, M., Brady, M., Bhattacharyya, S.S.: A hybrid task graph scheduler for high performance image processing workflows. In: 2015 IEEE Global Conference on Signal and Information Processing (GlobalSIP), pp. 634–637 (2015)
Bourrasset, C., Maggiani, L., Srot, J., Berry, F.: Dataflow object detection system for fpga-based smart camera. IET Circuits Devices Syst. 10(4), 280–291 (2016)
Article Google Scholar
Halbwachs, N., Caspi, P., Raymond, P., Pilaud, D.: The synchronous data flow programming language lustre. Proc. IEEE 79(9), 1305–1320 (1991)
Article Google Scholar
Caspi, P., Pilaud, D., Halbwachs, N., Plaice, J.A.: LUSTRE: a declarative language for real-time programming. In: Proceedings of the 14th ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages, POPL ’87, pp. 178–188. ACM, New York (1987)
Wadge, W.W., Ashcroft, E.A.: LUCID, the Dataflow Programming Language. Academic Press Professional, Inc., San Diego (1985)
MATH Google Scholar
Eker, J., Janneck, J.W.: CAL language report: specification of the CAL actor language (2003)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3–6, 2012, Lake Tahoe, Nevada, USA, pp. 1106–1114 (2012)
Theano Development Team.: Theano: a python framework for fast computation of mathematical expressions (2016). CoRR arXiv:1605.02688
Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D., Bengio, Y.: Theano: a CPU and GPU math expression compiler. In: Proceedings of the Python for Scientific Computing Conference (SciPy), June 2010. Oral Presentation (2010)
Abadi, M., Isard, M., Murray, D.G.: A computational model for tensorflow (an introduction) (2017)
Abadi, M., et al.: Tensorflow: large-scale machine learning on heterogeneous distributed systems (2016). CoRR arXiv:1603.04467
Abadi, M., et al.: Tensorflow: a system for large-scale machine learning (2016). CoRR arXiv:1605.08695
David, G.: Unified kernel/user-space efficient linux tracing architecture. Master’s thesis, cole Polytechnique de Montral. Retrieved from https://publications.polymtl.ca/842/ (2012)
Fournier, P.-M., Desnoyers, M., Dagenais, M.R.: Combined tracing of the kernel and applications with LTTng. In: Proceedings of the 2009 Linux Symposium (2009)
Hesik, C.: CodeXL 2.6 is released!. https://gpuopen.com/codexl-2-6-released/ (2018)
NVIDIA Developer Tools Overview. https://developer.nvidia.com/tools-overview (2018)
Get Started with Intel Graphics Performance Analyzers (Intel GPA). https://software.intel.com/en-us/gpa_getting_started (2018)
Event Tracing. https://docs.microsoft.com/en-us/windows/desktop/etw/event-tracing-portal (2018)
Gregg, B.: Strace wow much syscall (2014)
Gregg, B.: Perf Examples (2014)
Rostedt, S.: Finding Origins of Latencies Using Ftrace (2009)
Gregg, B.: Flame Graphs (2011)
Desnoyers, M., Dagenais, M.R.: The LTTng tracer: a low impact performance and behavior monitor for GNU/Linux. OLS (Ottawa Linux Symposium) (2006)
NVIDIA Nsight Systems User Guide. https://docs.nvidia.com/nsight-systems/index.html (2018)
Nsight Compute. https://docs.nvidia.com/nsight-compute/index.html (2018)
Nsight Graphics. https://docs.nvidia.com/nsight-graphics/UserGuide/index.html (2018)
Ponweiser, T.: Profiling and tracing tools for performance analysis of large scale applications (2017)
Pillet, V., Labarta, J., Cortes, T., Girona, S., and Departament D’arquitectura De Computadors.: Paraver: a tool to visualize and analyze parallel code. Technical report, In WoTUG-18 (1995)
Canale, M., Casale-Brunet, S., Bezati, E., Mattavelli, M., Janneck, J., Casale-Brunet, S., Bezati, E., Mattavelli, M., Marco Mattavelli@epfl Ch., Janneck, J.: Dataflow programs analysis and optimization using model predictive control techniques two examples of bounded buffer scheduling: deadlock avoidance and deadlock recovery strategies. J. Signal Process. Syst. 84, 371–381 (2016)
Janneck, J.W., Miller, I.D., Parlour, D.B.: Profiling dataflow programs. In: 2008 IEEE International Conference on Multimedia and Expo, ICME 2008-Proceedings, pp. 1065–1068 (2008)
Brunet, S.C., Mattavelli, M., Janneck, J.W.: Profiling of dataflow programs using post mortem causation traces. In: IEEE Workshop on Signal Processing Systems, SiPS: Design and Implementation, pp. 220–225 (2012)
Mysore, S., Mazloom, B., Agrawal, B., Sherwood, T.: Understanding and visualizing full systems with data flow tomography (2008)
Article Google Scholar
Osmari, D.K., Vo, H.T., Silva, C.T., Comba, J.L.D., Lins, L.: Visualization and analysis of parallel dataflow execution with smart traces. In: Brazilian Symposium of Computer Graphic and Image Processing, pp. 165–172 (2014)
Stoner, G.: ROCm: platform for a new era of heterogeneous in HPC and ultrascale computing (2016)
Rogers, P.: HSA Overview, pp. 7–18 (2015)
Chapter Google Scholar
Goli, M., Iwanski, L., Richards, A.: Accelerated machine learning using TensorFlow and SYCL on OpenCL Devices. In: Proceedings of the 5th International Workshop on OpenCL, IWOCL 2017, pp. 8:1–8:4. ACM, New York (2017)
Keryell, R., Reyes, R., Howes, L.: Khronos sycl for opencl: a tutorial. In: Proceedings of the 3rd International Workshop on OpenCL, IWOCL ’15, pp. 24:1–24:1. ACM, New York (2015)
Lea, D.: A Memory Allocator (1996)
Paul, M.: Traage logiciel d’applications utilisant un processeur graphique. Master’s thesis, cole Polytechnique de Montral. Retrieved from https://publications.polymtl.ca/2838/ (2017)
Couturier, D., Dagenais, M.R.: LTTng CLUST: a system-wide unified CPU and GPU tracing tool for OpenCL applications. Adv. Softw. Eng. 2015, 2:2–2:2 (2015)
Article Google Scholar
Poirier, B., Roy, R., Dagenais, M.: Accurate offline synchronization of distributed traces using kernel-level events. SIGOPS Oper. Syst. Rev. 44(3), 75–87 (2010)
Article Google Scholar
Jabbarifar, M.: On line trace synchronization for large scale distributed systems. PhD thesis, École Polytechnique de Montréal (2013)
Wininger, F., Ezzati-Jivan, N., Dagenais, M.R.: A declarative framework for stateful analysis of execution traces. Softw. Qual. J. 25, 201–229 (2016)
Article Google Scholar
Kouame, K., Ezzati-Jivan, N., Dagenais, M.R.: A flexible data-driven approach for execution trace filtering. In: 2015 IEEE International Congress on Big Data, pp. 698–703 (2015)
Moindrot, O.: Triplet Loss and Online Triplet Mining in TensorFlow (2018)
Springenberg, J.T., Dosovitskiy, A., Brox, T., Riedmiller, M.A.: Striving for simplicity: the all convolutional net (2014). CoRR arXiv:1412.6806
Mayer, R., Mayer, C., Laich, L.: The TensorFlow Partitioning and Scheduling Problem: It’s the Critical Path! pp. 1–6 (2017)
Mirhoseini, A., Pham, H., Le, Q.V., Steiner, B., Larsen, R., Zhou, Y., Kumar, N., Norouzi, M., Bengio, S., Dean, J.: Device placement optimization with reinforcement learning. In: Icml (2017)
Optimizing for mobile. https://www.tensorflow.org/lite/tfmobile/optimizing (2018)

Download references

Acknowledgements

The financial support of Ericsson, Ciena, Google, EfficiOS, Prompt and the Natural Sciences and Engineering Research Council of Canada (NSERC) is gratefully acknowledged. We are also grateful to Advanced Micro Devices (AMD) for providing the hardware and software that made this research possible.

Author information

Authors and Affiliations

Ecole Polytechnique Montreal, Montreal, QC, H3T 1J4, Canada
Pierre Zins & Michel Dagenais

Authors

Pierre Zins
View author publications
You can also search for this author in PubMed Google Scholar
Michel Dagenais
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pierre Zins.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zins, P., Dagenais, M. Tracing and Profiling Machine Learning Dataflow Applications on GPU. Int J Parallel Prog 47, 973–1013 (2019). https://doi.org/10.1007/s10766-019-00630-5

Download citation

Received: 25 July 2018
Accepted: 04 February 2019
Published: 11 February 2019
Issue Date: December 2019
DOI: https://doi.org/10.1007/s10766-019-00630-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Tracing and Profiling Machine Learning Dataflow Applications on GPU

Abstract

Access this article

Similar content being viewed by others

Machine Learning Using Virtualized GPUs in Cloud Environments

Simulation and Application Performance Evaluation Using GPU Through CUDA C & Deep Learning in TensorFlow

Leveraging HW Approximation for Exploiting Performance-Energy Trade-offs Within the Edge-Cloud Computing Continuum

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Tracing and Profiling Machine Learning Dataflow Applications on GPU

Abstract

Access this article

Similar content being viewed by others

Machine Learning Using Virtualized GPUs in Cloud Environments

Simulation and Application Performance Evaluation Using GPU Through CUDA C & Deep Learning in TensorFlow

Leveraging HW Approximation for Exploiting Performance-Energy Trade-offs Within the Edge-Cloud Computing Continuum

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation