Abstract
With the increasing demand for high performance computing in application domains with stringent power budgets, coarse-grained reconfigurable array (CGRA) architectures have become a popular choice among researchers and manufacturers. Loops are the hot-spots of kernels running on CGRAs and hence several techniques have been devised to optimize the loop execution. However, works in this direction are predominantly software-based solutions. This paper addresses the optimization opportunities at a deeper level and introduces a hardware based loop control mechanism that can support arbitrarily nested loops up to four levels. Major contributions of this work are, a lightweight Hardware Loop Block (HLB) for CGRAs that eliminates control instruction overhead of loops and an acyclic graph transformation that removes loop branches from the application CDFG. When tested on a set of kernels chosen from various application domains, the design could achieve a maximum of 1.9\(\times \) and an average of 1.5\(\times \) speed-up against the conventional approach. The total number of instructions executed is reduced to half for almost all the kernels with an area and power consumption overhead of 2.6% and 0.8% respectively.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Bajwa, R.S., et al.: Instruction buffering to reduce power in processors for signal processing. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 5(4), 417–424 (1997)
Balasubramanian, M., Dave, S., Shrivastava, A., Jeyapaul, R.: LASER: a hardware/software approach to accelerate complicated loops on CGRAs. In: 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 1069–1074. IEEE (2018)
Das, S., Martin, K.J., Coussy, P., Rossi, D.: A heterogeneous cluster with reconfigurable accelerator for energy efficient near-sensor data analytics. In: 2018 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1–5. IEEE (2018)
Das, S., Martin, K.J., Coussy, P., Rossi, D., Benini, L.: Efficient mapping of CDFG onto coarse-grained reconfigurable array architectures. In: 2017 22nd Asia and South Pacific Design Automation Conference (ASP-DAC), pp. 127–132. IEEE (2017)
Das, S., Martin, K.J., Rossi, D., Coussy, P., Benini, L.: An energy-efficient integrated programmable array accelerator and compilation flow for near-sensor ultralow power processing. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 38(6), 1095–1108 (2018)
Dragomir, O.S., Bertels, K.: Extending loop unrolling and shifting for reconfigurable architectures. In: Architectures and Compilers for Embedded Systems (ACES), pp. 61–64 (2010)
Gautschi, M., et al.: Near-threshold RISC-V core with DSP extensions for scalable IoT endpoint devices. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 25(10), 2700–2713 (2017)
Hamzeh, M., Shrivastava, A., Vrudhula, S.: EPIMap: using epimorphism to map applications on CGRAs. In: Proceedings of the 49th Annual Design Automation Conference, pp. 1284–1291 (2012)
Kavvadias, N., Nikolaidis, S.: Elimination of overhead operations in complex loop structures for embedded microprocessors. IEEE Trans. Comput. 57(2), 200–214 (2008)
Liu, D., Yin, S., Liu, L., Wei, S.: Polyhedral model based mapping optimization of loop nests for CGRAs. In: Proceedings of the 50th Annual Design Automation Conference, pp. 1–8 (2013)
Masuyama, K., Fujita, Y., Okuhara, H., Amano, H.: A 297mops/0.4 mw ultra low power coarse-grained reconfigurable accelerator CMA-SOTB-2. In: 2015 International Conference on ReConFigurable Computing and FPGAs (ReConFig), pp. 1–6. IEEE (2015)
Mathew, B., Davis, A.: A loop accelerator for low power embedded VLIW processors. In: Proceedings of the 2nd IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis, pp. 6–11 (2004)
Park, H., Fan, K., Mahlke, S.A., Oh, T., Kim, H., Kim, H.S.: Edge-centric modulo scheduling for coarse-grained reconfigurable architectures. In: Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, pp. 166–176 (2008)
Prabhakar, R., et al.: Plasticine: a reconfigurable architecture for parallel patterns. In: 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), pp. 389–402. IEEE (2017)
Tsao, Y.L., Chen, W.H., Cheng, W.S., Lin, M.C., Jou, S.J.: Hardware nested looping of parameterized and embedded DSP core. In: Proceedings of IEEE International [Systems-on-Chip] SOC Conference, pp. 49–52. IEEE (2003)
Vadivel, K., Wijtvliet, M., Jordans, R., Corporaal, H.: Loop overhead reduction techniques for coarse grained reconfigurable architectures. In: 2017 Euromicro Conference on Digital System Design (DSD), pp. 14–21. IEEE (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Sunny, C., Das, S., Martin, K.J.M., Coussy, P. (2021). Hardware Based Loop Optimization for CGRA Architectures. In: Derrien, S., Hannig, F., Diniz, P.C., Chillet, D. (eds) Applied Reconfigurable Computing. Architectures, Tools, and Applications. ARC 2021. Lecture Notes in Computer Science(), vol 12700. Springer, Cham. https://doi.org/10.1007/978-3-030-79025-7_5
Download citation
DOI: https://doi.org/10.1007/978-3-030-79025-7_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-79024-0
Online ISBN: 978-3-030-79025-7
eBook Packages: Computer ScienceComputer Science (R0)