Skip to main content
Log in

A GPU optimization workflow for real-time execution of ultra-high frame rate computer vision applications

  • Research
  • Published:
Journal of Real-Time Image Processing Aims and scope Submit manuscript

Abstract

This work proposes a GPU optimization methodology for real-time execution of ultra high frame rate applications with small frame sizes. While the use of GPUs for offline processing is well-established, real-time execution remains challenging due to the lack of real-time execution guarantees, especially for embedded GPUs. Our methodology introduces guidelines and a workflow by focusing on: (a) controlling latency by means of minimization of CPU-GPU interactions; (b) computation pruning; and (c) inter/intra-kernel optimizations. Furthermore, our approach takes advantage of multi-frame processing to attain significantly higher throughput at the cost of increased latency when the application permits such trade-offs. To evaluate our optimization methodology, we applied it to the monitoring and controlling of laser powder bed fusion machines, a widely used metal additive manufacturing technique. Results show that in the considered application, the required performance could be obtained on a Jetson Xavier AGX platform, and by sacrificing latency, significantly higher throughput was achieved.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Data availability

The data used in this study is not available for public sharing as it was obtained under license.

Notes

  1. In our work, a frame is considered "small", if it fits within the shared memory of a streaming multiprocessor (SM) and the size of the work (e.g., the number of pixels or elements to be processed) falls within the range of thread-block size.

  2. Warp refers to a group of threads (typically 32), which execute the same instruction simultaneously on a single SM.

References

  1. Abe, F., Osakada, K., Shiomi, M., Uematsu, K., Matsumoto, M.: The manufacturing of hard tools from metallic powders by selective laser melting. J. Mater. Process. Technol. 111(1–3), 210–213 (2001)

    Article  CAS  Google Scholar 

  2. Adnan, AM., Radhakrishnan, S., Karabuk, S.: Efficient Kernel Fusion Techniques for Massive Video Data Analysis on GPGPUs. arXiv preprint arXiv:1509.04394 (2015)

  3. Adnan, M., Lu, Y., Jones, A., Cheng, F.T.: Application of the fog computing paradigm to additive manufacturing process monitoring and control. IEEE Trans. Multimed. 21, 6 (2021)

    Google Scholar 

  4. Allen, T.: Improving real-time performance with CUDA persistent threads (CuPer) on the Jetson TX2. Concurr. Real-Time (2018)

  5. Booth, B., Heylen, R., Nourazar, M., Verhees, D., Philips, W., Bey-Temsamani, A.: Encoding stability into laser powder bed fusion monitoring using temporal features and pore density modeling. Sensors 22(10), 3740 (2022)

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  6. Catthoor, F., Danckaert, K., Brockmeyer, E., Kulkarni, K., Kjeldsberg, PG., Van Achteren, T., Omnes, T.: Data Access and Storage Management for Embedded Programmable Processors. Springer Science & Business Media (2002)

  7. Cheng, J., Grossman, M., McKercher, T.: Professional CUDA C Programming. John Wiley & Sons (2014)

  8. CUDA C++ Programming Guide. Accessed: 13 June 2023 (2023)

  9. Farber, R.: CUDA application design and development. Elsevier (2011)

  10. Filipovič, J., Madzin, M., Fousek, J., Matyska, L.: Optimizing CUDA code by kernel fusion: application on BLAS. J. Supercomput. 71(10), 3934–3957 (2015)

    Article  Google Scholar 

  11. Fürtler, J., Bodenstorfer, E., Mayer, K.J., Brodersen, J., Heiss, D., Penz, H., Eckel, C., Gravogl, K., Nachtnebel, H.: High-performance camera module for fast quality inspection in industrial printing applications. Mach. Vis. Appl. Ind. Inspec. XV SPIE 6503, 155–166 (2007)

    ADS  Google Scholar 

  12. Goossens, B., De Vylder, J., Philips, W.: Quasar-a new heterogeneous programming framework for image and video processing algorithms on CPU and GPU. In: 2014 IEEE International Conference on Image Processing (ICIP), IEEE, pp 2183–2185 (2014)

  13. GPUDirect RDMA. https://docs.nvidia.com/cuda /gpudirect-rdma/index.html. Accessed: 28 May 2023 (2023)

  14. Gupta, K., Stuart, JA., Owens, JD.: A study of persistent threads style GPU programming for GPGPU workloads. IEEE (2012)

  15. He, L., Ren, X., Gao, Q., Zhao, X., Yao, B., Chao, Y.: The connected-component labeling problem: a review of state-of-the-art algorithms. Pattern Recogn. 70, 25–43 (2017)

    Article  ADS  Google Scholar 

  16. Kubík, P., Šebek, F., Krejčí, P., Brabec, M., Tippner, J., Dvořáček, O., Lechowicz, D., Frybort, S.: Linear woodcutting of European beech: experiments and computations. Wood Sci. Technol. 57(1), 51–74 (2023)

    Article  Google Scholar 

  17. Li, A., Zheng, B., Pekhimenko, G., Long, F.: Automatic horizontal fusion for GPU kernels. In: 2022 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), IEEE, pp 14–27 (2022)

  18. Liu, X., Guo, Y., Zhang, W., Wu, D., Huang, R., Yang, M., Lu, B.: Dynamic formation characteristics and mechanism of hybrid laser arc welding surface layer by Ni-based filler metal based on rotating laser induction. J. Mater. Res. Technol. 20, 3600–3615 (2022)

    Article  CAS  Google Scholar 

  19. Membarth, R., Reiche, O., Hannig, F., Teich, J., Körner, M., Eckert, W.: Hipa cc: A domain-specific language and compiler for image processing. IEEE Trans. Parallel Distrib. Syst. 27(1), 210–224 (2015)

    Article  Google Scholar 

  20. Pratt-Szeliga, PC., Fawcett, JW., Welch, RD.: Rootbeer: Seamlessly using gpus from java. In: 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems, IEEE, pp 375–380 (2012)

  21. Qiao, B., Özkan, MA., Teich, J., Hannig, F.: The best of both worlds: combining CUDA graph with an image processing DSL. In: 2020 57th ACM/IEEE Design Automation Conference (DAC), IEEE, pp 1–6 (2020)

  22. Ragan-Kelley, J., Barnes, C., Adams, A., Paris, S., Durand, F., Amarasinghe, S.: Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. Acm Sigplan Notices 48(6), 519–530 (2013)

    Article  Google Scholar 

  23. Reinke, P., Beckmann, T., Ahlers, C., Ahlrichs, J., Hammou, L., Schmidt, M.: High-speed digital photography of vapor cavitation in a narrow gap flow. Fluids 8(2), 44 (2023)

    Article  ADS  Google Scholar 

  24. Scime, L., Fisher, B., Beuth, J.: Using coordinate transforms to improve the utility of a fixed field of view high speed camera for additive manufacturing applications. Manuf. Lett. 15, 104–106 (2018)

    Article  Google Scholar 

  25. Sepasgozar, S.M., Shi, A., Yang, L., Shirowzhan, S., Edwards, D.J.: Additive manufacturing applications for industry 4.0: a systematic critical review. Buildings 10(12), 231 (2020)

    Article  Google Scholar 

  26. Steinberger, M., Kenzel, M., Boechat, P., Kerbl, B., Dokter, M., Schmalstieg, D.: Whippletree: task-based scheduling of dynamic workloads on the GPU. ACM Trans. Graph. (TOG) 33(6), 1–11 (2014)

    Article  Google Scholar 

  27. Truong, L., Barik, R., Totoni, E., Liu, H., Markley, C., Fox, A., Shpeisman, T.: Latte: A language, compiler, and runtime for elegant and efficient deep neural networks. In: Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation, pp 209–223, (2016)

  28. Varga, M., Ventura, Cervellón, A., Leroch, S., Eder, S., Rojacz, H., Rodríguez Ripoll, M.: Fundamental abrasive contact at high speeds: scratch testing in experiment and simulation. In: Wear 522:204696, 24th International Conference on Wear of Materials (2023)

  29. Vasilache, N., Zinenko, O., Theodoridis, T., Goyal, P., DeVito, Z., Moses, WS., Verdoolaege, S., Adams, A., Cohen, A.: Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions. arXiv preprint arXiv:1802.04730 (2018)

  30. Wienke, S., Springer, P., Terboven, C., an Mey, D.: OpenACC-first experiences with real-world applications. In: Euro-Par 2012 Parallel Processing: 18th International Conference, Euro-Par 2012, Rhodes Island, Greece, August 27-31, (2012). Proceedings 18, pp 859–870. Springer (2012)

  31. Xiao, S., Feng, Wc.: Inter-block GPU communication via fast barrier synchronization. In: 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), IEEE, pp 1–12 (2010)

  32. Zhang, L., Wahib, M., Chen, P., Meng, J., Wang, X., Matsuoka, S.: Persistent Kernels for Iterative Memory-bound GPU Applications. arXiv preprint arXiv:2204.02064 (2022)

  33. Zou, A., Li, J., Gill, CD., Zhang, X.: RTGPU: Real-time GPU scheduling of hard deadline parallel tasks with fine-grain utilization. IEEE Trans. Parallel Distrib. Syst. (2023)

Download references

Acknowledgements

This work is financially supported by the VLAIO ICON project‘Vision in the Loop’(HBC.2019.2808), a collaboration between imec, Flanders Make, Materialise, Dekimo, ESMA and AdditiveLab. The RTX A6000 GPU used for this research was donated by the NVIDIA Corporation.

Author information

Authors and Affiliations

Authors

Contributions

MN and BG designed and developed the optimization methodology. MN carried out the experiments. MN wrote the manuscript with support from BG and BGB. BG and BGB supervised the project. All authors reviewed the manuscript

Corresponding author

Correspondence to Mohsen Nourazar.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Nourazar, M., Booth, B.G. & Goossens, B. A GPU optimization workflow for real-time execution of ultra-high frame rate computer vision applications. J Real-Time Image Proc 21, 5 (2024). https://doi.org/10.1007/s11554-023-01384-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11554-023-01384-7

Keywords

Navigation