Abstract
Modern embedded image processing deployment systems are heterogeneous combinations of general-purpose and specialized processors, custom ASIC accelerators and bespoke hardware accelerators. This paper offers a primer on hardware acceleration of image processing, focusing on embedded, real-time applications. We then survey the landscape of High Level Synthesis technologies that are amenable to the domain, as well as new-generation Hardware Description Languages, and present our ongoing work on IMP-lang, a language for early stage design of heterogeneous image processing systems. We show that hardware acceleration is not just a process of converting a piece of computation into an equivalent hardware system: that naive approach offers, in most cases, little benefit. Instead, acceleration must take into account how data is streamed throughout the system, and optimize that streaming accordingly. We show that the choice of tooling plays an important role in the results of acceleration. Different tools, in function of the underlying language paradigm, produce wildly different results across performance, size, and power consumption metrics. Finally, we show that bringing heterogeneous considerations to the language level offers significant advantages to early design estimation, allowing designers to partition their algorithms more efficiently, iterating towards a convergent design that can then be implemented across heterogeneous elements accordingly.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Availability of Data and Material
Not applicable.
Code Availability
Code publicly available under a Creative Commons License.
References
Fu, K.-S., et al. (1976). Pattern recognition and image processing. IEEE Transactions on Computers, 100(12), 1336–1346.
Chen, Y., Yang, X.-H., Wei, Z., Heidari, A. A., Zheng, N., Li, Z., Chen, H., Hu, H., Zhou, Q., & Guan, Q. (2022). Generative adversarial networks in medical image augmentation: A review. Computers in Biology and Medicine, 105382.
Salembier, P., & Garrido, L. (2000). Binary partition tree as an efficient representation for image processing, segmentation, and information retrieval. IEEE Transactions on Image Processing, 9(4), 561–576.
Abràmoff, M. D., Magalhães, P. J., & Ram, S. J. (2004). Image processing with imagej. Biophotonics International, 11(7), 36–42.
Bond, J. (1997). The drivers of the information revolution: Cost, computing power, and convergence.
Mittal, S., Gupta, S., & Dasgupta, S. (2008). FPGA: An efficient and promising platform for real-time image processing applications. In National Conference on Research and Development in Hardware Systems (CSI-RDHS).
Huang, L., & Barth, M. (2009). Tightly-coupled lidar and computer vision integration for vehicle detection. In 2009 IEEE Intelligent Vehicles Symposium (pp. 604–609). IEEE.
Brunetti, A., Buongiorno, D., Trotta, G. F., & Bevilacqua, V. (2018). Computer vision and deep learning techniques for pedestrian detection and tracking: A survey. Neurocomputing, 300, 17–33.
Zhang, X., Chen, Z., Wu, Q. J., Cai, L., Lu, D., & Li, X. (2018). Fast semantic segmentation for scene perception. IEEE Transactions on Industrial Informatics, 15(2), 1183–1192.
Al-Kaff, A., Martin, D., Garcia, F., de la Escalera, A., & Armingol, J. M. (2018). Survey of computer vision algorithms and applications for unmanned aerial vehicles. Expert Systems with Applications, 92, 447–463.
Feng, X., Jiang, Y., Yang, X., Du, M., & Li, X. (2019). Computer vision algorithms and hardware implementations: A survey. Integration, 69, 309–320.
Voulodimos, A., Doulamis, N., Doulamis, A., & Protopapadakis, E. (2018). Deep learning for computer vision: A brief review. Computational Intelligence and Neuroscience, 2018.
Jinghong, D., Yaling, D., & Kun, L. (2007). Development of image processing system based on DSP and FPGA. In 2007 8th International Conference on Electronic Measurement and Instruments (pp. 2–791). IEEE.
Castaño-Díez, D., Moser, D., Schoenegger, A., Pruggnaller, S., & Frangakis, A. S. (2008). Performance evaluation of image processing algorithms on the GPU. Journal of Structural Biology, 164(1), 153–160.
Saegusa, T., Maruyama, T., & Yamaguchi, Y. (2008). How fast is an FPGA in image processing? In 2008 International Conference on Field Programmable Logic and Applications (pp. 77–82). IEEE.
Bhowmik, D., Garcia, P., Wallace, A., Stewart, R., & Michaelson, G. (2017). Power efficient dataflow design for a heterogeneous smart camera architecture. In 2017 Conference on Design and Architectures for Signal and Image Processing (DASIP) (p. 8122128). IEEE.
Rt-shadows. (2015). Real-time system hardware for agnostic and deterministic OSES within softcore. In 2015 IEEE 20th Conference on Emerging Technologies & Factory Automation (ETFA) (pp. 1–4). IEEE.
Arató, P., Juhász, S., Mann, Z. Á., Orbán, A., & Papp, D. (2003). Hardware-software partitioning in embedded system design. In IEEE International Symposium on Intelligent Signal Processing, 2003 (pp. 197–202). IEEE.
Fryer, J., & Garcia, P. (2020). Towards a programming paradigm for reconfigurable computing: Asynchronous graph programming. In 2020 25th IEEE International Conference on Emerging Technologies and Factory Automation (ETFA) (vol. 1, pp. 1721–1728). IEEE.
Brebner, G. (1999). Tooling up for reconfigurable system design. In IEE Colloquium on Reconfigurable Systems (Ref. No. 1999/061) (pp. 2–1). IET.
HajiRassouliha, A., Taberner, A. J., Nash, M. P., & Nielsen, P. M. (2018). Suitability of recent hardware accelerators (DSPS, FPGAS, and GPUS) for computer vision and image processing algorithms. Signal Processing: Image Communication, 68, 101–119.
Coussy, P., Gajski, D. D., Meredith, M., & Takach, A. (2009). An introduction to high-level synthesis. IEEE Design & Test of Computers, 26(4), 8–17.
Borkar, A., Hayes, M., & Smith, M. T. (2009). Robust lane detection and tracking with Ransac and Kalman filter. In 2009 16th IEEE International Conference on Image Processing (ICIP) (pp. 3261–3264). IEEE.
Martin, G., & Smith, G. (2009). High-level synthesis: Past, present, and future. IEEE Design & Test of Computers, 4, 18–25.
Nane, R., Sima, V. M., Pilato, C., Choi, J., Fort, B., Canis, A., Chen, Y. T., Hsiao, H., Brown, S., Ferrandi, F., Anderson, J., & Bertels, K. (2016). A survey and evaluation of FPGA high-level synthesis tools. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, PP(99), 1–1. https://doi.org/10.1109/TCAD.2015.2513673
Trimberger, S. M. (2015). Three ages of FPGAs: a retrospective on the first thirty years of FPGA technology. Proceedings of the IEEE, 103(3), 318–331.
Meeus, W., Van Beeck, K., Goedemé, T., Meel, J., & Stroobandt, D. (2012). An overview of today’s high-level synthesis tools. Design Automation for Embedded Systems, 16(3), 31–51.
Cong, J., Liu, B., Neuendorffer, S., Noguera, J., Vissers, K., & Zhang, Z. (2011). High-level synthesis for FPGAs: From prototyping to deployment. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 30(4), 473–491. https://doi.org/10.1109/TCAD.2011.2110592
Zhang, X., & Ng, K. W. (2000). A review of high-level synthesis for dynamically reconfigurable FPGAs. Microprocessors and Microsystems, 24(4), 199–211. https://doi.org/10.1016/S0141-9331(00)00074-0
Compton, K., & Hauck, S. (2002). Reconfigurable computing: A survey of systems and software. ACM Computing Surveys (csuR), 34(2), 171–210.
Cardoso, J. M., Diniz, P. C., & Weinhardt, M. (2010). Compiling for reconfigurable computing: A survey. ACM Computing Surveys (CSUR), 42(4), 13.
Lhairech-Lebreton, G., Coussy, P., & Martin, E. (2010). Hierarchical and multiple-clock domain high-level synthesis for low-power design on FPGA. In 2010 International Conference on Field Programmable Logic and Applications (pp. 464–468). https://doi.org/10.1109/FPL.2010.94
Panda, P. R. (2001). SystemC: A modeling platform supporting multiple design abstractions. In Proceedings of the 14th International Symposium on System Synthesis, 2001 (pp. 75–80). IEEE.
Loo, S., Wells, B. E., Freije, N., & Kulick, J. (2002). Handel-C for rapid prototyping of VLSI coprocessors for real time systems. In Proceedings of the Thirty-Fourth Southeastern Symposium on System Theory, 2002 (pp. 6–10). IEEE.
Vanmeerbeeck, G., Schaumont, P., Vernalde, S., Engels, M., & Bolsens, I. (2001). Hardware/software partitioning of embedded system in OCAPI-xl. In Proceedings of the Ninth International Symposium on Hardware/Software Codesign, 2001, CODES 2001 (pp. 30–35). IEEE.
Bollaert, T. (2008). Catapult synthesis: A practical introduction to interactive C synthesis. In High-Level Synthesis (pp. 29–52). Springer.
Feist, T. (2012). Vivado design suite. White Paper, 5.
Xu, J., Subramanian, N., Alessio, A., & Hauck, S. (2010). Impulse C vs. VHDL for accelerating tomographic reconstruction. In 2010 18th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) (pp. 171–174). IEEE.
Cadence. C-to-Silicon Compiler High-Level Synthesis. Retrieved November 1, 2022, from https://www.cadence.com/rl/Resources/datasheets/C2Silicon_ds.pdf
Synopsis. Synphony C Compiler. Retrieved November 1, 2022, from https://www.synopsys.com/Tools/Implementation/RTLSynthesis/Pages/SynphonyC-Compiler.aspx
Cadence. Cynthesizer Solution. Retrieved November 1, 2022, from http://www.cadence.com/rl/Resources/datasheets/cynthesizer_ds.pdf
Canis, A., Choi, J., Aldham, M., Zhang, V., Kammoona, A., Anderson, J. H., Brown, S., & Czajkowski, T. (2011). Legup: High-level synthesis for FPGA-based processor/accelerator systems. In Proceedings of the 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays (pp. 33–36). ACM.
Mencer, O. (2006). ASC: A stream compiler for computing with FPGAs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 25(9), 1603–1617.
Nios, I. (2007). C2h compiler users guide. Altera.
Putnam, A., Bennett, D., Dellinger, E., Mason, J., Sundararajan, P., & Eggers, S. (2008). Chimps: A C-level compilation flow for hybrid CPU-FPGA architectures. In International Conference on Field Programmable Logic and Applications, 2008, FPL 2008. IEEE.
Villarreal, J., Park, A., Najjar, W., & Halstead, R. (2010). Designing modular hardware accelerators in C with ROCCC 2.0. In 2010 18th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) (pp. 127–134). IEEE.
Coussy, P., Lhairech-Lebreton, G., Heller, D., & Martin, E. (2010). Gaut–a free and open source high-level synthesis tool.
Tripp, J. L., Gokhale, M. B., & Peterson, K. D. (2007). Trident: From high-level language to hardware circuitry. Computer, 3, 28–37.
Settle, S. O. (2013). High-performance dynamic programming on FPGAS with OpenCL. In Proceedings on IEEE High Performance Extreme Computing Conference (HPEC) (pp. 1–6).
Fifield, J., Keryell, R., Ratigner, H., Styles, H., & Wu, J. (2016). Optimizing OpenCL applications on Xilinx FPGA. In Proceedings of the 4th International Workshop on OpenCL (p. 5). ACM.
Papakonstantinou, A., Gururaj, K., Stratton, J. A., Chen, D., Cong, J., & Hwu, W.-M. W. (2009). FCUDA: Enabling efficient compilation of Cuda Kernels onto FPGAs. In IEEE 7th Symposium on Application Specific Processors, 2009. SASP’09 (pp. 35–42). IEEE.
Auerbach, J., Bacon, D. F., Cheng, P., & Rabbah, R. (2010). Lime: A Java-compatible and synthesizable language for heterogeneous architectures. In ACM Sigplan Notices (vol. 45, pp. 89–108). ACM.
Singh, S., & Greaves, D. (2008). Kiwi: Synthesis of FPGA circuits from parallel programs. In 16th International Symposium On Field-Programmable Custom Computing Machines, 2008. FCCM’08 (pp. 3–12). IEEE.
Nane, R., Sima, V.-M., Olivier, B., Meeuws, R., Yankova, Y., & Bertels, K. (2012). Dwarv 2.0: A cosy-based C-to-VHDL hardware compiler. In 2012 22nd International Conference on Field Programmable Logic and Applications (FPL) (pp. 619–622). IEEE.
Pilato, C., & Ferrandi, F. (2013). Bambu: A modular framework for the high level synthesis of memory-intensive applications. In 2013 23rd International Conference on Field Programmable Logic and Applications (FPL) (pp. 1–4). IEEE.
Kavvadias, N., & Masselos, K. (2015). Source and IR-level optimisations in the hercules high-level synthesis tool. International Journal of Innovation and Regional Development, 6(3), 243–266.
Harmsen, R. (2012). Compiling recursion to reconfigurable hardware using clash.
Li, Y., & Leeser, M. HML: an innovative hardware description language and its translation to VHDL. In Proceedings of the ASP-DAC’95/CHDL’95/VLSI’95., IFIP International Conference on Hardware Description Languages. IFIP International Conference on Very Large Scal (pp. 691–696). IEEE.
Sander, I., Acosta, A., & Jantsch, A. (2009). Hardware design and synthesis in ForSyDe. In Workshop on Hardware Design Using Functional Languages (HFL 09).
Singh, S., & Sheeran, M. (2004). Designing FPGA circuits in lava. Unpublished paper. Retrieved October 15, 2022, from https://www.gla.ac.uk/satnam/lava/lava_intro.pdf
Hannig, F., Ruckdeschel, H., Dutta, H., & Teich, J. (2008). Paro: Synthesis of hardware accelerators for multi-dimensional dataflow-intensive applications. In Reconfigurable Computing: Architectures, Tools and Applications (pp. 287–293). Springer.
Hammarberg, J., & Nadjm-Tehrani, S. (2003). Development of safety-critical reconfigurable hardware with Esterel. Electronic Notes in Theoretical Computer Science, 80, 219–234.
Derrien, S., & Risset, T. (2000). Interfacing compiled FPGA programs: The MMAlpha approach. In PDPTA.
Aguilar-Pelaez, E., Bayliss, S., Smith, A., Winterstein, F., Ghica, D. R., Thomas, D., & Constantinides, G. A. (2014). Compiling higher order functional programs to composable digital hardware. In 2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) (pp. 234–234). IEEE.
Procter, A., Harrison, W. L., Graves, I., Becchi, M., & Allwein, G. (2015). Semantics driven hardware design, implementation, and verification with rewire. SIGPLAN Not., 50(5), 13–11310. https://doi.org/10.1145/2808704.2754970
Sharp, R. (2004). 5. high-level synthesis of SAFL. In Higher-Level Hardware Synthesis (pp. 65–86). Springer.
Sérot, J., & Michaelson, G. (2012). Harnessing parallelism in FPGAs using the hume language. In Proceedings of the 1st ACM SIGPLAN Workshop on Functional High-performance Computing (pp. 27–36). ACM.
Hegarty, J., Brunhaver, J., DeVito, Z., Ragan-Kelley, J., Cohen, N., Bell, S., Vasilyev, A., Horowitz, M., & Hanrahan, P. (2014). Darkroom: Compiling high-level image processing code into hardware pipelines.
Membarth, R., Reiche, O., Hannig, F., Teich, J., Körner, M., & Eckert, W. (2016). Hipacc: A domain-specific language and compiler for image processing. IEEE Transactions on Parallel and Distributed Systems, 27(1), 210–224. https://doi.org/10.1109/TPDS.2015.2394802
Cuadrado, J. S., & Molina, J. G. (2007). Building domain-specific languages for model-driven development. IEEE Software, 24(5), 48–55.
Lattner, C., & Adve, V. (2004). LLVM: A compilation framework for lifelong program analysis & transformation. In International Symposium on Code Generation and Optimization, 2004. CGO 2004 (pp. 75–86). IEEE.
Wipliez, M., Roquier, G., & Nezan, J.-F. (2011). Software code generation for the RVC-CAL language. Journal of Signal Processing Systems, 63(2), 203–213.
Bezati, E., Mattavelli, M., & Janneck, J. W. (2013). High-level synthesis of dataflow programs for signal processing systems. In 2013 8th International Symposium on Image and Signal Processing and Analysis (ISPA) (pp. 750–754). IEEE.
Yviquel, H., Lorence, A., Jerbi, K., Cocherel, G., Sanchez, A., & Raulet, M. (2013). ORCC: Multimedia development made easy. In Proceedings of the 21st ACM International Conference on Multimedia (pp. 863–866). ACM.
Hormati, A., Kudlur, M., Mahlke, S., Bacon, D., & Rabbah, R. (2008). Optimus: Efficient realization of streaming applications on FPGAs. In Proceedings of the 2008 International Conference on Compilers, Architectures and Synthesis for Embedded Systems (pp. 41–50). ACM.
Thies, W., Karczmarek, M., & Amarasinghe, S. (2002). StreamIt: A language for streaming applications. In Compiler Construction (pp. 179–196). Springer.
Püschel, M., Moura, J. M., Johnson, J. R., Padua, D., Veloso, M. M., Singer, B. W., Xiong, J., Franchetti, F., Gačic, A., Voronenko, Y., et al. (2005). Spiral: Code generation for DSP transforms. Proceedings of the IEEE, 93(2), 232–275.
D’Alberto, P., Milder, P. A., Sandryhaila, A., Franchetti, F., Hoe, J. C., Moura, J. M., Puschel, M., & Johnson, J. R. (2007). Generating FPGA-accelerated DFT libraries. In 15th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, 2007. FCCM 2007 (pp. 173–184). IEEE.
Stewart, R., Duncan, K., Michaelson, G., Garcia, P., Bhowmik, D., & Wallace, A. (2018). RIPL: A parallel image processing language for FPGAs. ACM Transactions on Reconfigurable Technology and Systems, 11(1). https://doi.org/10.1145/3180481
Nikhil, R. (2004). Bluespec system Verilog: Efficient, correct RTL from high level specifications. In Proceedings. Second ACM and IEEE International Conference on Formal Methods and Models for Co-Design, 2004. MEMOCODE’04 (pp. 69–70). IEEE.
Bachrach, J., Vo, H., Richards, B., Lee, Y., Waterman, A., Avižienis, R., Wawrzynek, J., & Asanović, K. (2012). Chisel: Constructing hardware in a scala embedded language. In Proceedings of the 49th Annual Design Automation Conference (pp. 1216–1225). ACM.
Synflow. Introducing Cx. Retrieved November 1, 2022, from http://cx-lang.org/
Edwards, S. A. (2000). Kahn process networks. In Languages for Digital Embedded Systems (pp. 189–195). Springer.
Fleming, S. T., Beretta, I., Thomas, D. B., Constantinides, G. A., & Ghica, D. R. (2015). PushPush: Seamless integration of hardware and software objects via function calls over AXI. In 2015 25th International Conference on Field Programmable Logic and Applications (FPL) (pp. 1–8). https://doi.org/10.1109/FPL.2015.7294024
Liu, Y., Bouganis, C.-S., Cheung, P. Y., Leong, P. H., & Motley, S. J. (2006). Hardware efficient architectures for eigenvalue computation. In Proceedings of the Design Automation & Test in Europe Conference (vol. 1, pp. 1–6). IEEE.
Srivastava, S. (2018). Memory interface design for integrating accelerators with Xilinx Zynq platform.
Author information
Authors and Affiliations
Contributions
J. Fryer was responsible for software development (IMP-Lang) and its description. P. Garcia was responsible for conceptualization and article writing.
Corresponding author
Ethics declarations
Ethics Approval
Not Applicable.
Consent to Participate
Not applicable.
Consent for Publication
Not applicable.
Conflict of Interest
No conflict of interest to report.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Fryer, J., Garcia, P. The Good, the Bad and the Ugly: Practices and Perspectives on Hardware Acceleration for Embedded Image Processing. J Sign Process Syst 95, 1181–1201 (2023). https://doi.org/10.1007/s11265-023-01885-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11265-023-01885-5