Skip to main content
Log in

The Good, the Bad and the Ugly: Practices and Perspectives on Hardware Acceleration for Embedded Image Processing

  • Published:
Journal of Signal Processing Systems Aims and scope Submit manuscript

Abstract

Modern embedded image processing deployment systems are heterogeneous combinations of general-purpose and specialized processors, custom ASIC accelerators and bespoke hardware accelerators. This paper offers a primer on hardware acceleration of image processing, focusing on embedded, real-time applications. We then survey the landscape of High Level Synthesis technologies that are amenable to the domain, as well as new-generation Hardware Description Languages, and present our ongoing work on IMP-lang, a language for early stage design of heterogeneous image processing systems. We show that hardware acceleration is not just a process of converting a piece of computation into an equivalent hardware system: that naive approach offers, in most cases, little benefit. Instead, acceleration must take into account how data is streamed throughout the system, and optimize that streaming accordingly. We show that the choice of tooling plays an important role in the results of acceleration. Different tools, in function of the underlying language paradigm, produce wildly different results across performance, size, and power consumption metrics. Finally, we show that bringing heterogeneous considerations to the language level offers significant advantages to early design estimation, allowing designers to partition their algorithms more efficiently, iterating towards a convergent design that can then be implemented across heterogeneous elements accordingly.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11
Figure 12
Figure 13
Figure 14
Figure 15
Figure 16
Figure 17

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Availability of Data and Material

Not applicable.

Code Availability

Code publicly available under a Creative Commons License.

Notes

  1. https://github.com/paulofrgarcia-cmkl/IMP

References

  1. Fu, K.-S., et al. (1976). Pattern recognition and image processing. IEEE Transactions on Computers, 100(12), 1336–1346.

    MATH  Google Scholar 

  2. Chen, Y., Yang, X.-H., Wei, Z., Heidari, A. A., Zheng, N., Li, Z., Chen, H., Hu, H., Zhou, Q., & Guan, Q. (2022). Generative adversarial networks in medical image augmentation: A review. Computers in Biology and Medicine, 105382.

  3. Salembier, P., & Garrido, L. (2000). Binary partition tree as an efficient representation for image processing, segmentation, and information retrieval. IEEE Transactions on Image Processing, 9(4), 561–576.

    Article  Google Scholar 

  4. Abràmoff, M. D., Magalhães, P. J., & Ram, S. J. (2004). Image processing with imagej. Biophotonics International, 11(7), 36–42.

    Google Scholar 

  5. Bond, J. (1997). The drivers of the information revolution: Cost, computing power, and convergence.

  6. Mittal, S., Gupta, S., & Dasgupta, S. (2008). FPGA: An efficient and promising platform for real-time image processing applications. In National Conference on Research and Development in Hardware Systems (CSI-RDHS).

  7. Huang, L., & Barth, M. (2009). Tightly-coupled lidar and computer vision integration for vehicle detection. In 2009 IEEE Intelligent Vehicles Symposium (pp. 604–609). IEEE.

  8. Brunetti, A., Buongiorno, D., Trotta, G. F., & Bevilacqua, V. (2018). Computer vision and deep learning techniques for pedestrian detection and tracking: A survey. Neurocomputing, 300, 17–33.

    Article  Google Scholar 

  9. Zhang, X., Chen, Z., Wu, Q. J., Cai, L., Lu, D., & Li, X. (2018). Fast semantic segmentation for scene perception. IEEE Transactions on Industrial Informatics, 15(2), 1183–1192.

    Article  Google Scholar 

  10. Al-Kaff, A., Martin, D., Garcia, F., de la Escalera, A., & Armingol, J. M. (2018). Survey of computer vision algorithms and applications for unmanned aerial vehicles. Expert Systems with Applications, 92, 447–463.

    Article  Google Scholar 

  11. Feng, X., Jiang, Y., Yang, X., Du, M., & Li, X. (2019). Computer vision algorithms and hardware implementations: A survey. Integration, 69, 309–320.

    Article  Google Scholar 

  12. Voulodimos, A., Doulamis, N., Doulamis, A., & Protopapadakis, E. (2018). Deep learning for computer vision: A brief review. Computational Intelligence and Neuroscience, 2018.

  13. Jinghong, D., Yaling, D., & Kun, L. (2007). Development of image processing system based on DSP and FPGA. In 2007 8th International Conference on Electronic Measurement and Instruments (pp. 2–791). IEEE.

  14. Castaño-Díez, D., Moser, D., Schoenegger, A., Pruggnaller, S., & Frangakis, A. S. (2008). Performance evaluation of image processing algorithms on the GPU. Journal of Structural Biology, 164(1), 153–160.

    Article  Google Scholar 

  15. Saegusa, T., Maruyama, T., & Yamaguchi, Y. (2008). How fast is an FPGA in image processing? In 2008 International Conference on Field Programmable Logic and Applications (pp. 77–82). IEEE.

  16. Bhowmik, D., Garcia, P., Wallace, A., Stewart, R., & Michaelson, G. (2017). Power efficient dataflow design for a heterogeneous smart camera architecture. In 2017 Conference on Design and Architectures for Signal and Image Processing (DASIP) (p. 8122128). IEEE.

  17. Rt-shadows. (2015). Real-time system hardware for agnostic and deterministic OSES within softcore. In 2015 IEEE 20th Conference on Emerging Technologies & Factory Automation (ETFA) (pp. 1–4). IEEE.

  18. Arató, P., Juhász, S., Mann, Z. Á., Orbán, A., & Papp, D. (2003). Hardware-software partitioning in embedded system design. In IEEE International Symposium on Intelligent Signal Processing, 2003 (pp. 197–202). IEEE.

  19. Fryer, J., & Garcia, P. (2020). Towards a programming paradigm for reconfigurable computing: Asynchronous graph programming. In 2020 25th IEEE International Conference on Emerging Technologies and Factory Automation (ETFA) (vol. 1, pp. 1721–1728). IEEE.

  20. Brebner, G. (1999). Tooling up for reconfigurable system design. In IEE Colloquium on Reconfigurable Systems (Ref. No. 1999/061) (pp. 2–1). IET.

  21. HajiRassouliha, A., Taberner, A. J., Nash, M. P., & Nielsen, P. M. (2018). Suitability of recent hardware accelerators (DSPS, FPGAS, and GPUS) for computer vision and image processing algorithms. Signal Processing: Image Communication, 68, 101–119.

    Google Scholar 

  22. Coussy, P., Gajski, D. D., Meredith, M., & Takach, A. (2009). An introduction to high-level synthesis. IEEE Design & Test of Computers, 26(4), 8–17.

    Article  Google Scholar 

  23. Borkar, A., Hayes, M., & Smith, M. T. (2009). Robust lane detection and tracking with Ransac and Kalman filter. In 2009 16th IEEE International Conference on Image Processing (ICIP) (pp. 3261–3264). IEEE.

  24. Martin, G., & Smith, G. (2009). High-level synthesis: Past, present, and future. IEEE Design & Test of Computers, 4, 18–25.

    Article  Google Scholar 

  25. Nane, R., Sima, V. M., Pilato, C., Choi, J., Fort, B., Canis, A., Chen, Y. T., Hsiao, H., Brown, S., Ferrandi, F., Anderson, J., & Bertels, K. (2016). A survey and evaluation of FPGA high-level synthesis tools. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, PP(99), 1–1. https://doi.org/10.1109/TCAD.2015.2513673

  26. Trimberger, S. M. (2015). Three ages of FPGAs: a retrospective on the first thirty years of FPGA technology. Proceedings of the IEEE, 103(3), 318–331.

    Article  Google Scholar 

  27. Meeus, W., Van Beeck, K., Goedemé, T., Meel, J., & Stroobandt, D. (2012). An overview of today’s high-level synthesis tools. Design Automation for Embedded Systems, 16(3), 31–51.

    Article  Google Scholar 

  28. Cong, J., Liu, B., Neuendorffer, S., Noguera, J., Vissers, K., & Zhang, Z. (2011). High-level synthesis for FPGAs: From prototyping to deployment. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 30(4), 473–491. https://doi.org/10.1109/TCAD.2011.2110592

    Article  Google Scholar 

  29. Zhang, X., & Ng, K. W. (2000). A review of high-level synthesis for dynamically reconfigurable FPGAs. Microprocessors and Microsystems, 24(4), 199–211. https://doi.org/10.1016/S0141-9331(00)00074-0

    Article  Google Scholar 

  30. Compton, K., & Hauck, S. (2002). Reconfigurable computing: A survey of systems and software. ACM Computing Surveys (csuR), 34(2), 171–210.

    Article  Google Scholar 

  31. Cardoso, J. M., Diniz, P. C., & Weinhardt, M. (2010). Compiling for reconfigurable computing: A survey. ACM Computing Surveys (CSUR), 42(4), 13.

    Article  Google Scholar 

  32. Lhairech-Lebreton, G., Coussy, P., & Martin, E. (2010). Hierarchical and multiple-clock domain high-level synthesis for low-power design on FPGA. In 2010 International Conference on Field Programmable Logic and Applications (pp. 464–468). https://doi.org/10.1109/FPL.2010.94

  33. Panda, P. R. (2001). SystemC: A modeling platform supporting multiple design abstractions. In Proceedings of the 14th International Symposium on System Synthesis, 2001 (pp. 75–80). IEEE.

  34. Loo, S., Wells, B. E., Freije, N., & Kulick, J. (2002). Handel-C for rapid prototyping of VLSI coprocessors for real time systems. In Proceedings of the Thirty-Fourth Southeastern Symposium on System Theory, 2002 (pp. 6–10). IEEE.

  35. Vanmeerbeeck, G., Schaumont, P., Vernalde, S., Engels, M., & Bolsens, I. (2001). Hardware/software partitioning of embedded system in OCAPI-xl. In Proceedings of the Ninth International Symposium on Hardware/Software Codesign, 2001, CODES 2001 (pp. 30–35). IEEE.

  36. Bollaert, T. (2008). Catapult synthesis: A practical introduction to interactive C synthesis. In High-Level Synthesis (pp. 29–52). Springer.

  37. Feist, T. (2012). Vivado design suite. White Paper, 5.

  38. Xu, J., Subramanian, N., Alessio, A., & Hauck, S. (2010). Impulse C vs. VHDL for accelerating tomographic reconstruction. In 2010 18th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) (pp. 171–174). IEEE.

  39. Cadence. C-to-Silicon Compiler High-Level Synthesis. Retrieved November 1, 2022, from https://www.cadence.com/rl/Resources/datasheets/C2Silicon_ds.pdf

  40. Synopsis. Synphony C Compiler. Retrieved November 1, 2022, from https://www.synopsys.com/Tools/Implementation/RTLSynthesis/Pages/SynphonyC-Compiler.aspx

  41. Cadence. Cynthesizer Solution. Retrieved November 1, 2022, from http://www.cadence.com/rl/Resources/datasheets/cynthesizer_ds.pdf

  42. Canis, A., Choi, J., Aldham, M., Zhang, V., Kammoona, A., Anderson, J. H., Brown, S., & Czajkowski, T. (2011). Legup: High-level synthesis for FPGA-based processor/accelerator systems. In Proceedings of the 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays (pp. 33–36). ACM.

  43. Mencer, O. (2006). ASC: A stream compiler for computing with FPGAs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 25(9), 1603–1617.

    Article  Google Scholar 

  44. Nios, I. (2007). C2h compiler users guide. Altera.

  45. Putnam, A., Bennett, D., Dellinger, E., Mason, J., Sundararajan, P., & Eggers, S. (2008). Chimps: A C-level compilation flow for hybrid CPU-FPGA architectures. In International Conference on Field Programmable Logic and Applications, 2008, FPL 2008. IEEE.

  46. Villarreal, J., Park, A., Najjar, W., & Halstead, R. (2010). Designing modular hardware accelerators in C with ROCCC 2.0. In 2010 18th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) (pp. 127–134). IEEE.

  47. Coussy, P., Lhairech-Lebreton, G., Heller, D., & Martin, E. (2010). Gaut–a free and open source high-level synthesis tool.

  48. Tripp, J. L., Gokhale, M. B., & Peterson, K. D. (2007). Trident: From high-level language to hardware circuitry. Computer, 3, 28–37.

    Article  Google Scholar 

  49. Settle, S. O. (2013). High-performance dynamic programming on FPGAS with OpenCL. In Proceedings on IEEE High Performance Extreme Computing Conference (HPEC) (pp. 1–6).

  50. Fifield, J., Keryell, R., Ratigner, H., Styles, H., & Wu, J. (2016). Optimizing OpenCL applications on Xilinx FPGA. In Proceedings of the 4th International Workshop on OpenCL (p. 5). ACM.

  51. Papakonstantinou, A., Gururaj, K., Stratton, J. A., Chen, D., Cong, J., & Hwu, W.-M. W. (2009). FCUDA: Enabling efficient compilation of Cuda Kernels onto FPGAs. In IEEE 7th Symposium on Application Specific Processors, 2009. SASP’09 (pp. 35–42). IEEE.

  52. Auerbach, J., Bacon, D. F., Cheng, P., & Rabbah, R. (2010). Lime: A Java-compatible and synthesizable language for heterogeneous architectures. In ACM Sigplan Notices (vol. 45, pp. 89–108). ACM.

  53. Singh, S., & Greaves, D. (2008). Kiwi: Synthesis of FPGA circuits from parallel programs. In 16th International Symposium On Field-Programmable Custom Computing Machines, 2008. FCCM’08 (pp. 3–12). IEEE.

  54. Nane, R., Sima, V.-M., Olivier, B., Meeuws, R., Yankova, Y., & Bertels, K. (2012). Dwarv 2.0: A cosy-based C-to-VHDL hardware compiler. In 2012 22nd International Conference on Field Programmable Logic and Applications (FPL) (pp. 619–622). IEEE.

  55. Pilato, C., & Ferrandi, F. (2013). Bambu: A modular framework for the high level synthesis of memory-intensive applications. In 2013 23rd International Conference on Field Programmable Logic and Applications (FPL) (pp. 1–4). IEEE.

  56. Kavvadias, N., & Masselos, K. (2015). Source and IR-level optimisations in the hercules high-level synthesis tool. International Journal of Innovation and Regional Development, 6(3), 243–266.

    Article  Google Scholar 

  57. Harmsen, R. (2012). Compiling recursion to reconfigurable hardware using clash.

  58. Li, Y., & Leeser, M. HML: an innovative hardware description language and its translation to VHDL. In Proceedings of the ASP-DAC’95/CHDL’95/VLSI’95., IFIP International Conference on Hardware Description Languages. IFIP International Conference on Very Large Scal (pp. 691–696). IEEE.

  59. Sander, I., Acosta, A., & Jantsch, A. (2009). Hardware design and synthesis in ForSyDe. In Workshop on Hardware Design Using Functional Languages (HFL 09).

  60. Singh, S., & Sheeran, M. (2004). Designing FPGA circuits in lava. Unpublished paper. Retrieved October 15, 2022, from https://www.gla.ac.uk/satnam/lava/lava_intro.pdf

  61. Hannig, F., Ruckdeschel, H., Dutta, H., & Teich, J. (2008). Paro: Synthesis of hardware accelerators for multi-dimensional dataflow-intensive applications. In Reconfigurable Computing: Architectures, Tools and Applications (pp. 287–293). Springer.

  62. Hammarberg, J., & Nadjm-Tehrani, S. (2003). Development of safety-critical reconfigurable hardware with Esterel. Electronic Notes in Theoretical Computer Science, 80, 219–234.

    Article  Google Scholar 

  63. Derrien, S., & Risset, T. (2000). Interfacing compiled FPGA programs: The MMAlpha approach. In PDPTA.

  64. Aguilar-Pelaez, E., Bayliss, S., Smith, A., Winterstein, F., Ghica, D. R., Thomas, D., & Constantinides, G. A. (2014). Compiling higher order functional programs to composable digital hardware. In 2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) (pp. 234–234). IEEE.

  65. Procter, A., Harrison, W. L., Graves, I., Becchi, M., & Allwein, G. (2015). Semantics driven hardware design, implementation, and verification with rewire. SIGPLAN Not., 50(5), 13–11310. https://doi.org/10.1145/2808704.2754970

    Article  Google Scholar 

  66. Sharp, R. (2004). 5. high-level synthesis of SAFL. In Higher-Level Hardware Synthesis (pp. 65–86). Springer.

  67. Sérot, J., & Michaelson, G. (2012). Harnessing parallelism in FPGAs using the hume language. In Proceedings of the 1st ACM SIGPLAN Workshop on Functional High-performance Computing (pp. 27–36). ACM.

  68. Hegarty, J., Brunhaver, J., DeVito, Z., Ragan-Kelley, J., Cohen, N., Bell, S., Vasilyev, A., Horowitz, M., & Hanrahan, P. (2014). Darkroom: Compiling high-level image processing code into hardware pipelines.

  69. Membarth, R., Reiche, O., Hannig, F., Teich, J., Körner, M., & Eckert, W. (2016). Hipacc: A domain-specific language and compiler for image processing. IEEE Transactions on Parallel and Distributed Systems, 27(1), 210–224. https://doi.org/10.1109/TPDS.2015.2394802

    Article  Google Scholar 

  70. Cuadrado, J. S., & Molina, J. G. (2007). Building domain-specific languages for model-driven development. IEEE Software, 24(5), 48–55.

    Article  Google Scholar 

  71. Lattner, C., & Adve, V. (2004). LLVM: A compilation framework for lifelong program analysis & transformation. In International Symposium on Code Generation and Optimization, 2004. CGO 2004 (pp. 75–86). IEEE.

  72. Wipliez, M., Roquier, G., & Nezan, J.-F. (2011). Software code generation for the RVC-CAL language. Journal of Signal Processing Systems, 63(2), 203–213.

    Article  Google Scholar 

  73. Bezati, E., Mattavelli, M., & Janneck, J. W. (2013). High-level synthesis of dataflow programs for signal processing systems. In 2013 8th International Symposium on Image and Signal Processing and Analysis (ISPA) (pp. 750–754). IEEE.

  74. Yviquel, H., Lorence, A., Jerbi, K., Cocherel, G., Sanchez, A., & Raulet, M. (2013). ORCC: Multimedia development made easy. In Proceedings of the 21st ACM International Conference on Multimedia (pp. 863–866). ACM.

  75. Hormati, A., Kudlur, M., Mahlke, S., Bacon, D., & Rabbah, R. (2008). Optimus: Efficient realization of streaming applications on FPGAs. In Proceedings of the 2008 International Conference on Compilers, Architectures and Synthesis for Embedded Systems (pp. 41–50). ACM.

  76. Thies, W., Karczmarek, M., & Amarasinghe, S. (2002). StreamIt: A language for streaming applications. In Compiler Construction (pp. 179–196). Springer.

  77. Püschel, M., Moura, J. M., Johnson, J. R., Padua, D., Veloso, M. M., Singer, B. W., Xiong, J., Franchetti, F., Gačic, A., Voronenko, Y., et al. (2005). Spiral: Code generation for DSP transforms. Proceedings of the IEEE, 93(2), 232–275.

    Article  Google Scholar 

  78. D’Alberto, P., Milder, P. A., Sandryhaila, A., Franchetti, F., Hoe, J. C., Moura, J. M., Puschel, M., & Johnson, J. R. (2007). Generating FPGA-accelerated DFT libraries. In 15th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, 2007. FCCM 2007 (pp. 173–184). IEEE.

  79. Stewart, R., Duncan, K., Michaelson, G., Garcia, P., Bhowmik, D., & Wallace, A. (2018). RIPL: A parallel image processing language for FPGAs. ACM Transactions on Reconfigurable Technology and Systems, 11(1). https://doi.org/10.1145/3180481

  80. Nikhil, R. (2004). Bluespec system Verilog: Efficient, correct RTL from high level specifications. In Proceedings. Second ACM and IEEE International Conference on Formal Methods and Models for Co-Design, 2004. MEMOCODE’04 (pp. 69–70). IEEE.

  81. Bachrach, J., Vo, H., Richards, B., Lee, Y., Waterman, A., Avižienis, R., Wawrzynek, J., & Asanović, K. (2012). Chisel: Constructing hardware in a scala embedded language. In Proceedings of the 49th Annual Design Automation Conference (pp. 1216–1225). ACM.

  82. Synflow. Introducing Cx. Retrieved November 1, 2022, from http://cx-lang.org/

  83. Edwards, S. A. (2000). Kahn process networks. In Languages for Digital Embedded Systems (pp. 189–195). Springer.

  84. Fleming, S. T., Beretta, I., Thomas, D. B., Constantinides, G. A., & Ghica, D. R. (2015). PushPush: Seamless integration of hardware and software objects via function calls over AXI. In 2015 25th International Conference on Field Programmable Logic and Applications (FPL) (pp. 1–8). https://doi.org/10.1109/FPL.2015.7294024

  85. Liu, Y., Bouganis, C.-S., Cheung, P. Y., Leong, P. H., & Motley, S. J. (2006). Hardware efficient architectures for eigenvalue computation. In Proceedings of the Design Automation & Test in Europe Conference (vol. 1, pp. 1–6). IEEE.

  86. Srivastava, S. (2018). Memory interface design for integrating accelerators with Xilinx Zynq platform.

Download references

Author information

Authors and Affiliations

Authors

Contributions

J. Fryer was responsible for software development (IMP-Lang) and its description. P. Garcia was responsible for conceptualization and article writing.

Corresponding author

Correspondence to Paulo Garcia.

Ethics declarations

Ethics Approval

Not Applicable.

Consent to Participate

Not applicable.

Consent for Publication

Not applicable.

Conflict of Interest

No conflict of interest to report.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Fryer, J., Garcia, P. The Good, the Bad and the Ugly: Practices and Perspectives on Hardware Acceleration for Embedded Image Processing. J Sign Process Syst 95, 1181–1201 (2023). https://doi.org/10.1007/s11265-023-01885-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11265-023-01885-5

Keywords

Navigation