Journal of Signal Processing Systems

, Volume 77, Issue 1–2, pp 5–29 | Cite as

Compact Code Generation for Tightly-Coupled Processor Arrays

  • Srinivas Boppu
  • Frank Hannig
  • Jürgen Teich


In this paper, we consider programmable tightly-coupled processor arrays consisting of interconnected small light-weight VLIW cores, which can exploit both loop-level parallelism and instruction-level parallelism. These arrays are well suited for compute-intensive nested loop applications often providing a higher power and area efficiency compared with commercial off-the-shelf processors. They are ideal candidates for accelerating the computation of nested loop programs in future heterogeneous systems, where energy efficiency is one of the most important design goals for overall system-on-chip design. In this context, we present a novel design methodology for the mapping of nested loop programs onto such processor arrays. Key features of our approach are: (1) Design entry in form of a functional programming language and loop parallelization in the polyhedron model, (2) support of zero-overhead looping not only for innermost loops but also for arbitrarily nested loops. Processors of such arrays are often limited in instruction memory size to reduce the area and power consumption. Hence, (3) we present methods for code compaction and code generation, and integrated these methods into a design tool. Finally, (4) we evaluated selected benchmarks by comparing our code generator with the Trimaran and VEX compiler frameworks. As the results show, our approach can reduce the size of the generated processor codes up to 64 % (Trimaran) and 55 % (VEX) while at the same time achieving a significant higher throughput.


Massively parallel processor arrays Code generation Coarse-grained reconfigurable architectures Compilers Accelerators 



This work was supported by the German Research Foundation (DFG) as part of the Transregional Collaborative Research Centre “Invasive Computing” (SFB/TR 89).


  1. 1.
    Boppu, S., Hannig, F., Teich, J. (2013). Loop program mapping and compact code generation for programmable hardware accelerators. In:Proceedings of the 24th IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP), (pp. 10–17): IEEE.Google Scholar
  2. 2.
    Dutta, H., Hannig, F., Teich, J. (2006). Hierarchical partitioning for piecewise linear algorithms. In Proceedings of the 5th International Conference on Parallel Computing in Electrical Engineering (PARELEC), (pp. 153–160): IEEE Computer Society.Google Scholar
  3. 3.
    Feautrier, P., & Lengauer, C. (2011). Polyhedron model In Padua, D. (Ed.), Encyclopedia of Parallel Computing, (pp. 1581–1592): Springer.Google Scholar
  4. 4.
    Fisher, J. (1983). Very long instruction word architectures and the ELI-512. In Proceedings of the 10th Annual International Symposium on Computer Architecture (ISCA), (pp. 140–150): IEEE.Google Scholar
  5. 5.
    GCC. the GNU Compiler Collection.
  6. 6.
    Gupta, S., Gupta, R., Dutt, N., Nicolau, A. (2004). SPARK: A Parallelizing Approach to the High-Level Synthesis of Digital Circuits: Kluwer Academic Publishers.Google Scholar
  7. 7.
    Hannig, F. (2009). Scheduling Techniques for High-Throughput Loop Accelerators. Ph.D. thesis. Germany: University of Erlangen-Nuremberg. Verlag Dr. Hut, Munich,Germany,ISBN: 978-3-86853-220-3.Google Scholar
  8. 8.
    Hannig, F., Dutta, H., Teich, J. (2006). Mapping a Class of Dependence Algorithms to Coarse-Grained Reconfigurable Arrays: Architectural parameters and methodology. International Journal of Embedded Systems, 2(1/2), 114–127. doi: 10.1504/IJES.2006.010170.CrossRefGoogle Scholar
  9. 9.
    Hannig, F., Lari, V., Boppu, S., Tanase, A., Reiche, O. (2014). Invasive Tightly-Coupled Processor Arrays: A Domain-Specific Architecture/Compiler Co-Design Approach. ACM Transactions on Embedded Computing Systems (TECS), 13(4s), 133:1–133:29. doi: 10.1145/2584660.Google Scholar
  10. 10.
    Hannig, F., Ruckdeschel, H., Dutta, H., Teich, J. (2008). PARO: synthesis of hardware accelerators for multi-dimensional dataflow-intensive applications. In Proceedings of the 4th International Workshop on Applied Reconfigurable Computing (ARC), Lecture Notes in Computer Science (LNCS) (Vol. 4943, pp. 287–293): Springer.Google Scholar
  11. 11.
    Hannig, F., Ruckdeschel, H., Teich, J. (2008). The PAULA language for designing multi-dimensional dataflow-intensive applications. In Proceedings of the GI/ITG/GMM-Workshop – Methoden und Beschreibungssprachen zur Modellierung und Verifikation von Schaltungen und Systemen, (pp. 129–138): Shaker.Google Scholar
  12. 12.
    Hannig, F., & Teich, J. (2004). Resource constrained and speculative scheduling of an algorithm class with run-time dependent conditionals. In Proceedings of the 15th IEEE International Conference on Application-specific Systems, Architectures, and Processors (ASAP), (pp. 17–27): IEEE Computer Society.Google Scholar
  13. 13.
    Hewlett-Packard Laboratories: Vex toolchain.
  14. 14.
    ILOG (2011). CPLEX Division:ILOG CPLEX 12.1,User’s Manual.Google Scholar
  15. 15.
    Irigoin, F., & Triolet, R. (1988). Supernode partitioning. In Proceedings of the 15th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL), (pp. 319–329). USA. ACM, San Diego, CA.Google Scholar
  16. 16.
    Jainandunsing, K. (1986). Optimal partitioning scheme for wavefront/systolic array processors. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS), (pp. 940–943). USA: San Jose, CA.Google Scholar
  17. 17.
    Kissler, D., Hannig, F., Kupriyanov, A., Teich, J. (2006). A dynamically reconfigurable weakly programmable processor array architecture template. In Proceedings of the 2nd International Workshop on Reconfigurable Communication Centric System-on-Chips (ReCoSoC), (pp. 31–37). Montpellier.Google Scholar
  18. 18.
    Kissler, D., Hannig, F., Kupriyanov, A., Teich, J. (2006). A highly parameterizable parallel processor array architecture. In Proceedings of the International Conference on Field Programmable Technology (FPT), (pp. 105–112): IEEE.Google Scholar
  19. 19.
    Kroupis, N., Raghavan, P., Jayapala, M., Catthoor, F., Soudris, D. (2009). Compilation technique for loop overhead minimization. In Proceedings of 12th Euromicro Conference on Digital System Design, Architectures, Methods and Tools (DSD), (pp. 419–426).Google Scholar
  20. 20.
    Lattner, C., & Adve, V. (2004). LLVM: a compilation framework for lifelong program analysis & transformation. In Proceedings of the International Symposium on Code Generation and Optimization (CGO), (pp. 75–86).Google Scholar
  21. 21.
    Lee, J., Choi, K., Dutt, N. (2003). An algorithm for mapping loops onto coarse-grained reconfigurable architectures. In Proceedings of Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES), (pp. 183–188): ACM.Google Scholar
  22. 22.
    Lengauer, C. (1993). Loop parallelization in the polytope model. In Best, E. (Ed.) Proceedings of the 4th International Conference on Concurrency Theory (CONCUR), Lecture Notes in Computer Science (LNCS) (Vol. 715, pp. 398–416). Hildesheim: Springer.Google Scholar
  23. 23.
    Lengauer, C., Barnett, M., III, D.G.H. (1991). Towards Systolizing Compilation. Distributed Computing, 5, 7–24. doi: 10.1007/BF02311229.CrossRefzbMATHGoogle Scholar
  24. 24.
    Mei, B., Vernalde, S., Verkest, D., De Man, H., Lauwereins, R. (2002). DRESC: a retargetable compiler for coarse-grained reconfigurable architectures. In Proceedings of the IEEE International Conference on Field-Programmable Technology (FPT), (pp. 166–173).Google Scholar
  25. 25.
    Mei, B., Vernalde, S., Verkest, D., Lauwereins, R. (2004). Design methodology for a tightly coupled VLIW/reconfigurable matrix architecture: a case study. In Proceedings of Design, Automation and Test in Europe Conference and Exhibition (DATE) (Vol. 2, pp. 1224–1229).Google Scholar
  26. 26.
    Melpignano, D., Benini, L., Flamand, E., Jego, B., Lepley, T., Haugou, G., Clermidy, F., Dutoit, D. (2012). Platform 2012, a many-core computing accelerator for embedded SoCs: performance evaluation of visual analytics applications. In Proceedings of the 49th Annual Design Automation Conference (DAC), (pp. 1137–1142): ACM.Google Scholar
  27. 27.
    Moldovan, D. (1983). On the Design of Algorithms for Vlsi Systolic Arrays. In Proceedings of the IEEE, 71(1), 113–120.CrossRefGoogle Scholar
  28. 28.
    Muddasani, S., Boppu, S., Hannig, F., Kuzmin, B., Lari, V., Teich, J. (2012). A prototype of an invasive tightly-coupled processor array. In Proceedings of the Conference on Design and Architectures for Signal and Image Processing (DASIP), (pp. 393–394): IEEE.Google Scholar
  29. 29.
    Munshi, A. (2012). The OpenCL specification version 1.2: Khronos OpenCL Working Group.Google Scholar
  30. 30.
    Rau, B.R. (1994). Iterative modulo scheduling: An algorithm for software pipelining loops. In Proceedings of the 27th Annual International Symposium on Microarchitecture, (pp. 63–74).Google Scholar
  31. 31.
    Rau, B.R., Schlansker, M.S., Tirumalai, P.P. (1992). CodeGeneration Schema for Modulo Scheduled Loops. SIGMICRO Newsletter, 23(1–2), 158–169.CrossRefGoogle Scholar
  32. 32.
    Schmid, M., Hannig, F., Tanase, A., Teich, J. (2014). High-level synthesis revised – generation of FPGA accelerators from a domain-specific language using the polyhedron model. Parallel Computing: Accelerating Computational Science and Engineering (CSE), Advances in Parallel Computing (Vol. 25, pp. 497–506). Amsterdam: IOS Press.Google Scholar
  33. 33.
    Singh, H., Lee, M., Lu, G., Bagherzadeh, N., Kurdahi, F., Filho, E. (2000). MorphoSys: An integrated Reconfigurable System for Data-Parallel and Computation-Intensive Applications. IEEE Transactions on Computers, 49(5), 465–481. doi: 10.1109/12.859540.CrossRefGoogle Scholar
  34. 34.
    Sousa, É., Tanase, A., Hannig, F., Teich, J. (2013). Accuracy and performance analysis of harris corner computation on tightly-coupled processor arrays. In Proceedings of the Conference on Design and Architectures for Signal and Image Processing (DASIP), (pp. 88–95): IEEE.Google Scholar
  35. 35.
    Teich, J. (1993). A Compiler for Application-Specific Processor Arrays. Saarbrücken: Shaker Verlag. Ph.D. thesis,Institut für Mikroelektronik, Universität des Saarlandes,ISBN: 3-86111-701-0.Google Scholar
  36. 36.
    Teich, J., & Thiele, L. (1993). Partitioning of Processor Arrays: A Piecewise Regular Approach. Integration, the VLSI Journal, 14(3), 297–332. doi: 10.1016/0167-9260(93)90013-3.CrossRefzbMATHGoogle Scholar
  37. 37.
    Teich, J., Thiele, L., Zhang, L. (1997). Partitioning Processor Arrays Under Resource Constraints. Journal of VLSI Signal Processing, 17(1), 5–20. doi: 10.1023/A:1007935215591.CrossRefzbMATHGoogle Scholar
  38. 38.
    The Trimaran Consortium: An infrastructure for research in backend compilation and architecture exploration.
  39. 39.
    Thiele, L. (1988). On the hierarchical design of VLSI processor arrays. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS) (Vol. 3, pp. 2517–2520).Google Scholar
  40. 40.
    Thiele, L. (1995). Resource Constrained Scheduling of Uniform Algorithms. Journal of VLSI Signal Processing, 10, 295–310.CrossRefGoogle Scholar
  41. 41.
    Thiele, L., & Roychowdhury, V. (1991). Systematic design of local processor arrays for numerical algorithms. In Deprettere, E., & van der Veen, A. (Eds.) Proceedings of the International Workshop on Algorithms and Parallel VLSI Architectures 1990 (Vol. A:Tutorials, pp. 329–339). Amsterdam: Elsevier.Google Scholar
  42. 42.
    Uh, G.R., Wang, Y., Whalley, D., Jinturkar, S., Burns, C., Cao, V. (1999). Effective exploitation of a zero overhead loop buffer. In Proceedings of the Workshop on Languages, Compilers, and Tools for Embedded Systems (LCTES), (pp. 10–19).Google Scholar
  43. 43.
    Venkataramani, G., Najjar, W., Kurdahi, F., Bagherzadeh, N., Böhm, W., Hammes, J (2003). Automatic compilation to a coarse-grained reconfigurable system-on-chip. ACM Transactions on Embedded Computing Systems (TECS), 2(4), 560–589. doi: 10.1145/950162.950167.CrossRefGoogle Scholar
  44. 44.
    Wolfe, M. (1996). High Performance Compilers for Parallel Computing: Addison-Wesley.Google Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  1. 1.Hardware/Software Co-Design, Department of Computer ScienceUniversity of Erlangen-NurembergErlangenGermany

Personalised recommendations