Abstract
Heterogeneous systems including power-efficient hardware accelerators are dominating the design of nowadays and future embedded computer architectures—as a requirement for energy-efficient system design. In this context, we discuss the main principles of invasive computing, then, we subsequently present the concept and structure of invasive tightly coupled processor arrays (TCPAs), which form the basis for our experiments throughout the book. For the efficient utilization of an invasive TCPA, through the concrete invasive language InvadeX10, compiler support is paramount. Without such support, programming that leverages the abundant parallelism in such architectures is very difficult, tedious, and error-prone. Unfortunately, even nowadays, there is a lack of compiler frameworks for generating efficient parallel code for massively parallel architectures. In this chapter, we therefore present LoopInvader, the first compiler for mapping nested loop programs onto invasive TCPAs. We furthermore discuss the fundamentals and background of the underlying models for algorithm and application specification.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
infect is implemented in terms of X10 places; an i-let is represented by an activity in X10, which is a lightweight thread.
- 2.
For the sake of better visibility, control registers and control I/O ports are not shown in Figure 2.3.
- 3.
In case of arrays, that each array element (indexed variable) is assigned only once.
- 4.
- 5.
We define the rectangular hull \(\mathrm {rectHull}(\bigcup _{i=1}^G \mathcal {I}_i)\) as the space containing all iterations of all equations S i , with 1 ≤ i ≤ G. For the sake of simplicity, we assume that the rectangular hull origins at 0. This can be always achieved by a simple translation (i.e., lower bound is equal to zero).
- 6.
Including single assignment conversion see Section 2.3.
- 7.
For the rest of this book, we assume this functionality.
- 8.
Invasive X10 loops will be automatically transformed by the LoopInvader’s front end into PAULA, as described in Section 2.3.
- 9.
For example, map computations onto a fixed number of processors, local memory/register sizes, and communication bandwidth.
- 10.
For this example, we assume an Locally Sequential Globally Parallel (LSGP) (see Section 2.3.5.4) mapping technique, where each tile—with the tile sizes described by a static tiling matrix P = diag(T, 3)— corresponds to one processor, which executes the iterations within the tile in a sequential manner.
- 11.
Such dimensions (with zero iterations) are automatically removed in PARO through a source-to-source transformation.
- 12.
It is assumed in the following that each \(\mathcal {F}_i\) can be mapped to a functional unit of a TCPA as a basic instruction. If \(\mathcal {F}_i\) is a more complex mathematical expression, the corresponding equation must be split into equations of this granularity [Tei93].
- 13.
The formula is exact if the iteration space \(\mathcal {I}\) is dense, i.e., does not contain any iteration vectors where no equation is defined.
References
Braun, M., Buchwald, S., Hack, S., Leißa, R., Mallon, C., & Zwinkau A. (2013). Simple and efficient construction of static single assignment form. In R. Jhala & K. Bosschere (Eds.), Compiler construction. Lecture notes in computer science (Vol. 7791, pp. 102–122). Berlin: Springer.
Braun, M., Buchwald, S., Mohr, M., & Zwinkau, A. (2012). An X10 Compiler for Invasive Architectures. Technical Report 9, Karlsruhe Institute of Technology.
Bastoul, C., Cohen, A., Girbal, S., Sharma, S., & Temam, O. (2003). Putting polyhedral loop transformations to work. In Workshop on Languages and Compilers for Parallel Computing (LCPC), College Station, TX, USA, October 2003. Lecture notes in computer science (Vol. 2958, pp. 23–30). Berlin: Springer.
Boppu, S., Hannig, F., & Teich, J. (2013). Loop program mapping and compact code generation for programmable hardware accelerators. In Proceedings of the 24th IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP) (pp. 10–17). New york: IEEE.
Boppu, S. (2015). Code Generation for Tightly Coupled Processor Arrays. Dissertation, Hardware/Software Co-Design, Department of Computer Science, Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany.
Bhadouria, V. S., Tanase, A., Schmid, M., Hannig, F., Teich, J., & Ghoshal, D. (2016). A novel image impulse noise removal algorithm optimized for hardware accelerators. Journal of Signal Processing Systems, 89(2), 225–245.
Charles, P., Grothoff, C., Saraswat, V., Donawa, C., Kielstra, A., Ebcioglu, K., et al. (2005). X10: An object-oriented approach to non-uniform cluster computing. ACM SIGPLAN Notices, 40(10), 519–538.
Feautrier, P. (1991). Dataflow analysis of array and scalar references. International Journal of Parallel Programming, 20(1), 23–53.
Feautrier, P., & Lengauer, C. (2011). Polyhedron model. In Encyclopedia of parallel computing (pp. 1581–1592).
Grudnitsky, A., Bauer, L., & Henkel, J. (2017). Efficient partial online synthesis of special instructions for reconfigurable processors. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 25(2), 594–607.
Gangadharan, D., Sousa, E., Lari, V., Hannig, F., & Teich, J. (2015). Application-driven reconfiguration of shared resources for timing predictability of MPSoC platforms. In Proceedings of Asilomar Conference on Signals, Systems, and Computers (ASILOMAR) (pp. 398–403). Washington, DC, USA: IEEE Computer Society.
Gangadharan, D., Tanase, A., Hannig, F., & Teich, J. (2014). Timing analysis of a heterogeneous architecture with massively parallel processor arrays. In DATE Friday Workshop on Performance, Power and Predictability of Many-Core Embedded Systems (3PMCES). ECSI.
Hannig, F. (2009). Scheduling Techniques for High-throughput Loop Accelerators. Dissertation, University of Erlangen-Nuremberg, Germany, Verlag Dr. Hut, Munich, Germany. ISBN: 978-3-86853-220-3.
Henkel, J., Herkersdorf, A., Bauer, L., Wild, T., Hübner, M., Pujari, R. K., et al. (2012). Invasive manycore architectures. In 17th Asia and South Pacific Design Automation Conference (ASP-DAC) (pp. 193–200). New York: IEEE.
Hannig, F., Lari, V., Boppu, S., Tanase, A., & Reiche, O. (2014). Invasive tightly-coupled processor arrays: A domain-specific architecture/compiler co-design approach. ACM Transactions on Embedded Computing Systems (TECS), 13(4s), 133:1–133:29.
Hannig, F., Ruckdeschel, H., Dutta, H., & Teich, J. (2008). PARO: Synthesis of hardware accelerators for multi-dimensional dataflow-intensive applications. In Proceedings of the Fourth International Workshop on Applied Reconfigurable Computing (ARC). Lecture notes in computer science, March 2008 (Vol. 4943, pp. 287–293). London, UK: Springer.
Hannig, F., Roloff, S., Snelting, G., Teich, J., & Zwinkau, A. (2011). Resource-aware programming and simulation of MPSoC architectures through extension of X10. In Proceedings of the 14th International Workshop on Software and Compilers for Embedded Systems (pp. 48–55). New York: ACM.
Hannig, F., Ruckdeschel, H., & Teich, J. (2008). The PAULA language for designing multi-dimensional dataflow-intensive applications. In Proceedings of the GI/ITG/GMM-Workshop – Methoden und Beschreibungssprachen zur Modellierung und Verifikation von Schaltungen und Systemen (pp. 129–138). Freiburg, Germany: Shaker.
Hannig, F., Schmid, M., Lari, V., Boppu, S., & Teich, J. (2013). System integration of tightly-coupled processor arrays using reconfigurable buffer structures. In Proceedings of the ACM International Conference on Computing Frontiers (CF) (pp. 2:1–2:4). New York: ACM.
Hannig, F., & Teich, J. (2004). Dynamic piecewise linear/regular algorithms. In International Conference on Parallel Computing in Electrical Engineering. PARELEC’04 (pp. 79–84). New York: IEEE.
Heisswolf, J., Zaib, A., Weichslgartner, A., Karle, M., Singh, M., Wild, T., et al. (2014). The invasive network on chip - a multi-objective many-core communication infrastructure. In ARCS’14; Workshop Proceedings on Architecture of Computing Systems (pp. 1–8).
Jainandunsing, K. (1986). Optimal partitioning scheme for wavefront/systolic array processors. In Proceedings of IEEE Symposium on Circuits and Systems (pp. 940–943).
Kissler, D., Hannig, F., Kupriyanov, A., & Teich, J. (2006). A dynamically reconfigurable weakly programmable processor array architecture template. In Proceedings of the International Workshop on Reconfigurable Communication Centric System-on-Chips (ReCoSoC) (pp. 31–37).
Kissler, D., Hannig, F., Kupriyanov, A., & Teich, J. (2006). A highly parameterizable parallel processor array architecture. In Proceedings of the IEEE International Conference on Field Programmable Technology (FPT), (pp. 105–112). New York: IEEE.
Karp, R. M., Miller, R. E., & Winograd, S. (1967). The organization of computations for uniform recurrence equations. Journal of the ACM, 14(3), 563–590.
Klues, K., Rhoden, B., Zhu, Y., Waterman, A., & Brewer, E. (2010). Processes and resource management in a scalable many-core OS. In HotPar10, Berkeley, CA, 2010.
Kissler, D, Strawetz, A., Hannig, F., & Teich, J. (2009). Power-efficient reconfiguration control in coarse-grained dynamically reconfigurable architectures. Journal of Low Power Electronics, 5(1), 96–105.
Kupriyanov, O. (2009). Modeling and Efficient Simulation of Complex System-on-a-Chip Architectures. PhD thesis, Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany.
Lari, V. (2016). Invasive tightly coupled processor arrays. In Springer Book Series on Computer Architecture and Design Methodologies. Berlin: Springer. ISBN: 978-981-10-1058-3.
Lindenmaier, G., Beck, M., Boesler, B., & Geiß, R. (2005). FIRM, An Intermediate Language for Compiler Research. Technical Report 2005-8, Fakultät für Informatik, Universität Karlsruhe, Karlsruhe, Germany.
Lengauer, C. (1993). Loop parallelization in the polytope model. In CONCUR (Vol. 715, pp. 398–416).
Lindenmaier, G. (2006). libFIRM – A Library for Compiler Optimization Research Implementing FIRM. Technical Report 2002-5, Fakultät für Informatik, Universität Karlsruhe, Karlsruhe, Germany.
Lari, V., Narovlyanskyy, A., Hannig, F., & Teich, J. (2011). Decentralized dynamic resource management support for massively parallel processor arrays. In Proceedings of the 22nd IEEE International Conference on Application-specific Systems, Architectures, and Processors (ASAP), Santa Monica, CA, USA, September 2011.
Lindholm, E., Nickolls, J., Oberman, S., & Montrym, J. (2008). NVIDIA Tesla: A unified graphics and computing architecture. IEEE Micro, 28(2), 39–55.
Lari, V., Weichslgartner, A., Tanase, A., Witterauf, M., Khosravi, F., Teich, J., et al. (2016). Providing fault tolerance through invasive computing. Information Technology, 58(6), 309–328.
Moldovan, D. I., & Fortes, J. A. B. (1986). Partitioning and mapping algorithms into fixed size systolic arrays. IEEE Transactions on Computers, C-35(1), 1–12.
Mehrara, M., Jablin, T. B., Upton, D., August, D. I., Hazelwood, K., & Mahlke, S. (2009). Compilation strategies and challenges for multicore signal processing. IEEE Signal Processing Magazine, 26(6), 55–63.
Munshi, A. (2012). The OpenCL Specification Version 1.2. Khronos OpenCL Working Group.
Oechslein, B., Schedel, J., Kleinöder, J., Bauer, L., Henkel, J., Lohmann, D., et al. (2011). OctoPOS: A parallel operating system for invasive computing. In R. McIlroy, J. Sventek, T. Harris, & T. Roscoe (Eds.), Proceedings of the International Workshop on Systems for Future Multi-Core Architectures (SFMA). USB Proceedings of Sixth International ACM/EuroSys European Conference on Computer Systems (EuroSys), EuroSys, 2011 (pp. 9–14).
Rao, S. K. (1985). Regular Iterative Algorithms and Their Implementations on Processor Arrays. PhD thesis, Stanford University.
Rau, B. R. (1994). Iterative modulo scheduling: An algorithm for software pipelining loops. In Proceedings of the 27th Annual International Symposium on Microarchitecture (MICRO), San Jose, CA, USA, November 1994 (pp. 63–74).
Rosen, B. K., Wegman, M. N., & Zadeck, F. K. (1988). Global value numbers and redundant computations. In Proceedings of the 15th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL’88, New York, NY, USA (pp. 12–27).
Schmid, M., Hannig, F., Tanase, A., & Teich, J. (2014). High-level synthesis revised – Generation of FPGA accelerators from a domain-specific language using the polyhedral model. In Parallel Computing: Accelerating Computational Science and Engineering (CSE), Advances in Parallel Computing (Vol. 25, pp. 497–506). Amsterdam, The Netherlands: IOS Press.
Schmid, M., Tanase, A., Bhadouria, V. S., Hannig, F., Teich, J., & Ghoshal, D. (2014). Domain-specific augmentations for high-level synthesis. In Proceedings of the 25th IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP) (pp. 173–177). New York: IEEE.
Sousa, E. R., Tanase, A., Hannig, F., & Teich, J. (2013). A prototype of an adaptive computer vision algorithm on MPSoC architecture. In Proceedings of the Conference on Design and Architectures for Signal and Image Processing (DASIP), October 2013 (pp. 361–362). ECSI Media.
Sousa, E. R., Tanase, A., Hannig, F., & Teich, J. (2013). Accuracy and performance analysis of Harris corner computation on tightly-coupled processor arrays. In Proceedings of the Conference on Design and Architectures for Signal and Image Processing (DASIP) (pp. 88–95). New York: IEEE.
Sousa, E. R., Tanase, A., Lari, V., Hannig, F., Teich, J., Paul, J., et al. (2013). Acceleration of optical flow computations on tightly-coupled processor arrays. In Proceedings of the 25th Workshop on Parallel Systems and Algorithms (PARS), Mitteilungen – Gesellschaft für Informatik e. V., Parallel-Algorithmen und Rechnerstrukturen (Vol. 30, pp. 80–89). Gesellschaft für Informatik e. V.
Teich, J. (1993). A compiler for application specific processor arrays. Reihe Elektrotechnik. Freiburg, Germany: Shaker. ISBN: 9783861117018.
Teich, J. (2008). Invasive algorithms and architectures. Information Technology, 50(5), 300–310.
Teich, J., Glaß, M., Roloff, S., Schröder-Preikschat, W., Snelting, G., Weichslgartner, A., et al. (2016). Language and compilation of parallel programs for *-predictable MPSoC execution using invasive computing. In 2016 IEEE 10th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSOC) (pp. 313–320).
Teich, J., Henkel, J., Herkersdorf, A., Schmitt-Landsiedel, D., Schröder-Preikschat, W., & Snelting, G. (2011). Multiprocessor System-on-Chip: Hardware Design and Tool Integration. Invasive computing: An overview (Chap. 11, pp. 241–268). Berlin: Springer.
Thiele, L. (1988). On the hierarchical design of vlsi processor arrays. In IEEE International Symposium on Circuits and Systems, 1988 (pp. 2517–2520). New York: IEEE.
Thiele, L. (1989). On the design of piecewise regular processor arrays. In IEEE International Symposium on Circuits and Systems (Vol. 3, pp. 2239–2242).
Tanase, A., Lari, V., Hannig, F., & Teich, J. (2012). Exploitation of quality/throughput tradeoffs in image processing through invasive computing. In Proceedings of the International Conference on Parallel Computing (ParCo) (pp. 53–62).
Thiele, L., & Roychowdhury, V. P. (1991). Systematic design of local processor arrays for numerical algorithms. In Proceedings of the International Workshop on Algorithms and Parallel VLSI Architectures, Amsterdam, The Netherlands, 1991 (Vol. A: Tutorials, pp. 329–339).
Teich, J., & Thiele, L. (1991). Control generation in the design of processor arrays. Journal of VLSI Signal Processing Systems for Signal, Image and Video Technology, 3(1), 77–92.
Teich, J., & Thiele, L. (1993). Partitioning of processor arrays: A piecewise regular approach. Integration-The Vlsi Journal,14(3), 297–332.
Teich, J., & Thiele, L. (1996). A new approach to solving resource-constrained scheduling problems based on a flow-model. Technical Report 17, TIK, Swiss Federal Institute of Technology (ETH) Zürich.
Teich, J., & Thiele, L. (2002). Exact partitioning of affine dependence algorithms. In Embedded Processor Design Challenges. Lecture notes in computer science (Vol. 2268, pp. 135–151). Berlin, Germany: Springer.
Teich, J., Thiele, L., & Zhang, L. (1996). Scheduling of partitioned regular algorithms on processor arrays with constrained resources. In Proceedings of the IEEE International Conference on Application-Specific Systems, Architectures, and Processors, ASAP’96 (p. 131). Washington, DC, USA: IEEE Computer Society.
Teich, J., Thiele, L., & Zhang, L. (1997). Scheduling of partitioned regular algorithms on processor arrays with constrained resources. Journal of VLSI Signal Processing, 17(1), 5–20.
Teich, J., Thiele, L., & Zhang, L. (1997). Partitioning processor arrays under resource constraints. Journal of VLSI Signal Processing Systems for Signal, Image, and Video Technology, 17, 5–20.
Teich, J., Weichslgartner, A., Oechslein, B., & Schröder-Preikschat, W. (2012). Invasive computing - concepts and overheads. In Proceeding of the 2012 Forum on Specification and Design Languages (pp. 217–224).
Tanase, A., Witterauf, M., Sousa, É. R., Lari, V., Hannig, F., & Teich, J. (2016). LoopInvader: A Compiler for Tightly Coupled Processor Arrays. Tool Presentation at the University Booth at Design, Automation and Test in Europe (DATE), Dresden, Germany.
Verdoolaege, S. (2010). ISL: An integer set library for the polyhedral model. In Proceedings of the Third International Congress Conference on Mathematical Software (ICMS), Kobe, Japan, 2010 (pp. 299–302). Berlin: Springer.
Verdoolaege, S., & Grosser, T. (2012). Polyhedral extraction tool. In Second International Workshop on Polyhedral Compilation Techniques (IMPACT’12), Paris, France.
Wildermann, S., Bader, M., Bauer, L., Damschen, M., Gabriel, D., Gerndt, M., et al. (2016). Invasive computing for timing-predictable stream processing on MPSoCs. Information Technology, 58(6), 267–280.
Wolfe, M. J. (1996). High performance compilers for parallel computing. Boston, MA, USA: Addison-Wesley.
Xue, J. (1997). On tiling as a loop transformation. Parallel Processing Letters, 7(4), 409–424.
Xue, J. (2000). Loop tiling for parallelism. Norwell, MA, USA: Kluwer Academic Publishers.
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG
About this chapter
Cite this chapter
Tanase, AP., Hannig, F., Teich, J. (2018). Fundamentals and Compiler Framework. In: Symbolic Parallelization of Nested Loop Programs. Springer, Cham. https://doi.org/10.1007/978-3-319-73909-0_2
Download citation
DOI: https://doi.org/10.1007/978-3-319-73909-0_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-73908-3
Online ISBN: 978-3-319-73909-0
eBook Packages: EngineeringEngineering (R0)