Journal of Signal Processing Systems

, Volume 90, Issue 1, pp 3–27 | Cite as

Loop Parallelization Techniques for FPGA Accelerator Synthesis

  • Oliver Reiche
  • M. Akif Özkan
  • Frank Hannig
  • Jürgen Teich
  • Moritz Schmid


Current tools for High-Level Synthesis (HLS) excel at exploiting Instruction-Level Parallelism (ILP). The support for Data-Level Parallelism (DLP), one of the key advantages of Field programmable Gate Arrays (FPGAs), is in contrast very limited. This work examines the exploitation of DLP on FPGAs using code generation for C-based HLS of image filters and streaming pipelines. In addition to well-known loop tiling techniques, we propose loop coarsening, which delivers superior performance and scalability. Loop tiling corresponds to splitting an image into separate regions, which are then processed in parallel by replicated accelerators. For data streaming, this also requires the generation of glue logic for the distribution of image data. Conversely, loop coarsening allows processing multiple pixels in parallel, whereby only the kernel operator is replicated within a single accelerator. We present concrete implementations of tiling and coarsening for Vivado HLS and Altera OpenCL. Furthermore, we present a comparison of our implementations to the keyword-driven parallelization support provided by the Altera Offline Compiler. We augment the FPGA back end of the heterogeneous Domain-Specific Language (DSL) framework HIPAcc to generate loop coarsening implementations for Vivado HLS and Altera OpenCL. Moreover, we compare the resulting FPGA accelerators to highly optimized software implementations for Graphics Processing Units (GPU), all generated from exactly the same code base.


Altera OpenCL Vivado HLS Vectorization Loop coarsening Loop tiling 



Adaptive Look-Up Table


Altera Offline Compiler


Altera SDK for OpenCL


Application-Specific Integrated Circuit


Block Random Access Memory


Compute Unified Device Architecture


Data-Level Parallelism




Domain-Specific Language


Digital Signal Processor


Electronic Design Automation


embedded GPU






First In First Out


Field Programmable Gate Array


Graphics Processing Unit


half-Adaptive Logic Module


Hardware Description Language


Heterogeneous Image Processing Acceleration


High-Level Synthesis


Integrated Development Environment


Initiation Interval


Instruction-Level Parallelism






Logic Utilization


Look-Up Table


Open Computing Language


Post Place and Route


Red Green Blue Alpha


Register Transfer Level


Single Instruction Multiple Data


Single Program Multiple Data




Theoretical Speedup





This work is partly supported by the German Research Foundation (DFG), as part of the Research Training Group 1773 “Heterogeneous Image Systems”, and as part of the Transregional Collaborative Research Center “Invasive Computing” (SFB/TR 89). The Tesla K20 used for this research was donated by the Nvidia Corporation.


  1. 1.
    Aditya, S., & Kathail, V. (2008). Algorithmic synthesis using PICO: An integrated framework for application engine synthesis and verification from high level C algorithms. In P. Coussy & A. Morawiec (Eds.), High-level synthesis: from algorithm to digital circuit (chap. 4, pp. 53–74). Springer. doi: 10.1007/978-1-4020-8588-8_4.
  2. 2.
    Alias, C., Darte, A., & Plesco, A. (2013). Optimizing remote accesses for offloaded kernels: application to high-level synthesis for FPGA, Proceedings of the conference on design, automation and test in europe (DATE) (pp. 575–580).Google Scholar
  3. 3.
    Amdahl, G. (1967). Validity of the single processor approach to achieving large scale computing capabilities, Proceedings of the spring joint computer conference (AFIPS) (pp. 483–485).Google Scholar
  4. 4.
    Bailey, D. (2011). Design for embedded image processing on FPGAs. Wiley.Google Scholar
  5. 5.
    Bondhugula, U., Hartono, A., Ramanujam, J., & Sadayappan, P. (2008). A practical automatic polyhedral parallelizer and locality optimizer (Vol. 43, no. 6, pp. 101–113).Google Scholar
  6. 6.
    Canis, A., Choi, J., Aldham, M., Zhang, V., Kammoona, A., Anderson, J., Brown, S., & Czajkowski, T. (2011). LegUp: High-level synthesis for FPGA-based processor/accelerator systems, Proceedings of the international symposium on field programmable gate arrays (FPGA) (pp. 33–36).Google Scholar
  7. 7.
    Choi, J., Brown, S., & Anderson, J. (2013). From software threads to parallel hardware in high-level synthesis for FPGAs, Proceedings of the international conference on field-programmable technology (FPT) (pp. 270–277).Google Scholar
  8. 8.
    Feautrier, P., & Lengauer, C. (2011). Polyhedron model. In D. Padua (Ed.), Encyclopedia of parallel computing (pp. 1581–1592). Springer. doi: 10.1007/978-0-387-09766-4_502.
  9. 9.
    George, N., Novo, D., Rompf, T., Odersky, M., & Ienne, P. (2013). Making domain-specific hardware synthesis tools cost-efficient, Proceedings of the international conference on field-programmable technology (FPT) (pp. 120–127).Google Scholar
  10. 10.
    Hannig, F., Ruckdeschel, H., Dutta, H., & Teich, J. (2008). PARO: Synthesis of hardware accelerators for multi-dimensional dataflow-intensive applications, Proceedings of the 4th international workshop on applied reconfigurable computing (ARC), Lecture Notes in Computer Science (LNCS) (Vol. 4943, pp. 287–293).  10.1007/978-3-540-78610-8_30: Springer.
  11. 11.
    Hegarty, J., Brunhaver, J., DeVito, Z., Ragan-Kelley, J., Cohen, N., Bell, S., Vasilyev, A., Horowitz, M., & Hanrahan, P. (2014). Darkroom: Compiling high-level image processing code into hardware pipelines, Proceedings of the 41st international conference on computer graphics and interactive techniques (SIGGRAPH) (pp. 144:1–144:11).Google Scholar
  12. 12.
    Hwang, D., Cho, S., Kim, Y., & Han, S. (1993). Exploiting spatial and temporal parallelism in the multithreaded node architecture implemented on superscalar RISC processors, Proceedings of the international conference on parallel processing (ICPP) (pp. 51–54).Google Scholar
  13. 13.
    Lam, M. (1988). Software pipelining: An effective scheduling technique for VLIW machines, Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). doi: 10.1145/53990.54022 (pp. 318–328).Google Scholar
  14. 14.
    Lattuada, M., & Ferrandi, F. (2015). Exploiting outer loops vectorization in high level synthesis, Proceedings of the 28th international conference on architecture of computing systems (ARCS), lecture notes in computer science (LNCS) (Vol. 9017, pp. 31–42) . Springer.Google Scholar
  15. 15.
    Li, P., Pouchet, L. N., & Cong, J. (2014). Throughput optimization for high-level synthesis using resource constraints. In S. Rajopadhye, & S. Verdoolaege (Eds.), Proceedings of the 4th international workshop on polyhedral compilation techniques. Vienna, Austria.Google Scholar
  16. 16.
    Membarth, R., Reiche, O., Hannig, F., & Teich, J. (2014). Code Generation for Embedded Heterogeneous Architectures on Android, Proceedings of the conference on design, automation and test in Europe (DATE). doi: 10.7873/DATE.2014.099 (pp. 86:1–86:6). Dresden, Germany: IEEE.Google Scholar
  17. 17.
    Membarth, R., Reiche, O., Hannig, F., Teich, J., Körner, M., & Eckert, W. (2016). HIPAcc: A domain-specific language and compiler for image processing. IEEE Transactions on Parallel and Distributed Systems, 27(1), 210–224. doi: 10.1109/TPDS.2015.2394802.CrossRefGoogle Scholar
  18. 18.
    Mentor Graphics (2016). Catapult High-Level Synthesis.
  19. 19.
    Meredith, M. (2008). High-level SystemC synthesis with Forte’s Cynthesizer. In P. Coussy & A. Morawiec (Eds.), High-level synthesis: from algorithm to digital circuit (chap. 5, pp. 75–97). Springer. doi: 10.1007/978-1-4020-8588-8_5.
  20. 20.
    Owaida, M., Bellas, N., Daloukas, K., & Antonopoulos, C. (2011). Synthesis of platform architectures from openCL programs, Proceedings of the international symposium on field-programmable custom computing machines (FCCM) (pp. 186–193).Google Scholar
  21. 21.
    Özkan, M., Reiche, O., Hannig, F., & Teich, J. FPGA-based accelerator design from a domain-specific language, Proceedings of the 26th international conference on field-programmable logic and applications (FPL). doi: 10.1109/FPL.2016.7577357.
  22. 22.
    Papakonstantinou, A., Gururaj, K., Stratton, J., Chen, D., Cong, J., & Hwu, W. M. (2009). FCUDA: Enabling efficient compilation of CUDA kernels onto FPGAs, Proceedings of the IEEE 7th symposium on application specific processors (SASP). doi: 10.1109/SASP.2009.5226333 (pp. 35–42).Google Scholar
  23. 23.
    Plavec, F., Vranesic, Z., & Brown, S. (2013). Exploiting task- and data-level parallelism in streaming applications implemented in FPGAs. ACM Transactions on Reconfigurable Technology and Systems (TRETS), 6(4), 16:1–16:37.Google Scholar
  24. 24.
    Pouchet, L. N., Zhang, P., Sadayappan, P., & Cong, J. (2013). Polyhedral-based data reuse optimization for configurable computing, Proceedings of the ACM/SIGDA international symposium on field programmable gate arrays (pp. 29–38). ACM.Google Scholar
  25. 25.
    Püschel, M., Franchetti, F., & Voronenko, Y. (2011). Spiral. In D. Padua (Ed.), Encyclopedia of parallel computing (pp. 1920–1933). Springer. doi: 10.1007/978-0-387-09766-4.
  26. 26.
    Ratha, N., & Jain, A. (1999). Computer vision algorithms on reconfigurable logic arrays. IEEE Transactions on Parallel and Distributed Systems (TPDS), 10(1), 29–43.CrossRefGoogle Scholar
  27. 27.
    Reiche, O., Schmid, M., Hannig, F., Membarth, R., & Teich, J. (2014). Code generation from a domain-specific language for C-based HLS of hardware accelerators, Proceedings of the international conference on hardware/software codesign and system synthesis (CODES+ISSS) (pp. 17:1–17:10).  10.1145/2656075.2656081: ACM.
  28. 28.
    Schmid, M., Reiche, O., Hannig, F., & Teich, J. (2015). Loop coarsening in C-based high-level synthesis, Proceedings of the 26th IEEE international conference on application-specific systems, architectures and processors (ASAP) (pp. 166–173). IEEE.Google Scholar
  29. 29.
    Schmidt, M., Reichenbach, M., & Fey, D. (2012). A generic VHDL template for 2D stencil code applications on FPGAs, Proceedings of the 15th IEEE international symposium on object/component/service-oriented real-time distributed computing workshops (ISORCW). doi: 10.1109/ISORCW.2012.39 (pp. 180–187).Google Scholar
  30. 30.
    Singh, D. (2011). Implementing FPGA design with the openCL standard Altera whitepaper.Google Scholar
  31. 31.
    Tomasi, C., & Manduchi, R. (1998). Bilateral filtering for gray and color images, Proceedings of the 6th international conference on computer vision (ICCV) (pp. 839–846). IEEE.Google Scholar
  32. 32.
    Trifunovic, K., Nuzman, D., Cohen, A., Zaks, A., & Rosen, I. (2009). Polyhedral-model guided loop-nest auto-vectorization, Proceedings of the 18th international conference on parallel architectures and compilation techniques (PACT) (pp. 327–337). IEEE.Google Scholar
  33. 33.
    Villarreal, J., Park, A., Najjar, W., & Halstead, R. (2010). Designing modular hardware accelerators in C with ROCCC 2.0, Proceedings of the international symposium on field-programmable custom computing machines (FCCM) (pp. 127–134).Google Scholar
  34. 34.
    Wakabayashi, K., & Okamoto, T. (2000). C-based SoC design flow and EDA tools: An ASIC and system vendor perspective. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), 19(12), 1507–1522.CrossRefGoogle Scholar
  35. 35.
    Wang, C., Yuan, F. L., Yu, T. H., & Markovic, D. (2014). 27.5 a multi-granularity FPGA with hierarchical interconnects for efficient and flexible mobile computing, Proceedings of the IEEE international solid-state circuits conference - digest of technical papers (pp. 460–461).Google Scholar
  36. 36.
    Wolfe, M. (1989). More iteration space tiling, Proceedings of the 1989 ACM/IEEE conference on supercomputing (pp. 655–664).CrossRefGoogle Scholar
  37. 37.
  38. 38.
    Zhang, Z., Fan, Y., Jiang, W., Han, G., Yang, C., & Cong, J. (2008). AutoPilot: A platform-based ESL synthesis system. In P. Coussy & A. Morawiec (Eds.), High-level synthesis: from algorithm to digital circuit (chap. 6, pp. 99–112). Springer. doi: 10.1007/978-1-4020-8588-8_6.

Copyright information

© Springer Science+Business Media New York 2017

Authors and Affiliations

  • Oliver Reiche
    • 1
  • M. Akif Özkan
    • 1
  • Frank Hannig
    • 1
  • Jürgen Teich
    • 1
  • Moritz Schmid
    • 2
  1. 1.Hardware/Software Co-Design, Department of Computer ScienceFriedrich-Alexander University, Erlangen-Nürnberg (FAU)ErlangenGermany
  2. 2.Siemens Healthcare GmbHForchheimGermany

Personalised recommendations