Fundamentals and Compiler Framework

Tanase, Alexandru-Petru; Hannig, Frank; Teich, Jürgen

doi:10.1007/978-3-319-73909-0_2

Fundamentals and Compiler Framework

Alexandru-Petru Tanase⁴,
Frank Hannig⁴ &
Jürgen Teich⁴

Chapter
First Online: 23 February 2018

310 Accesses

Abstract

Heterogeneous systems including power-efficient hardware accelerators are dominating the design of nowadays and future embedded computer architectures—as a requirement for energy-efficient system design. In this context, we discuss the main principles of invasive computing, then, we subsequently present the concept and structure of invasive tightly coupled processor arrays (TCPAs), which form the basis for our experiments throughout the book. For the efficient utilization of an invasive TCPA, through the concrete invasive language InvadeX10, compiler support is paramount. Without such support, programming that leverages the abundant parallelism in such architectures is very difficult, tedious, and error-prone. Unfortunately, even nowadays, there is a lack of compiler frameworks for generating efficient parallel code for massively parallel architectures. In this chapter, we therefore present LoopInvader, the first compiler for mapping nested loop programs onto invasive TCPAs. We furthermore discuss the fundamentals and background of the underlying models for algorithm and application specification.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
infect is implemented in terms of X10 places; an i-let is represented by an activity in X10, which is a lightweight thread.
2.
For the sake of better visibility, control registers and control I/O ports are not shown in Figure 2.3.
3.
In case of arrays, that each array element (indexed variable) is assigned only once.
4.
Throughout this book we assume w.l.o.g. that we start from a UDA, as any PLA may be systematically transformed into a UDA using localization [Thi89, TR91] (see Section 2.3.5.2) which is automatically performed in PARO.
5.
We define the rectangular hull \(\mathrm {rectHull}(\bigcup _{i=1}^G \mathcal {I}_i)\) as the space containing all iterations of all equations S _i, with 1 ≤ i ≤ G. For the sake of simplicity, we assume that the rectangular hull origins at 0. This can be always achieved by a simple translation (i.e., lower bound is equal to zero).
6.
Including single assignment conversion see Section 2.3.
7.
For the rest of this book, we assume this functionality.
8.
Invasive X10 loops will be automatically transformed by the LoopInvader’s front end into PAULA, as described in Section 2.3.
9.
For example, map computations onto a fixed number of processors, local memory/register sizes, and communication bandwidth.
10.
For this example, we assume an Locally Sequential Globally Parallel (LSGP) (see Section 2.3.5.4) mapping technique, where each tile—with the tile sizes described by a static tiling matrix P = diag(T, 3)— corresponds to one processor, which executes the iterations within the tile in a sequential manner.
11.
Such dimensions (with zero iterations) are automatically removed in PARO through a source-to-source transformation.
12.
It is assumed in the following that each \(\mathcal {F}_i\) can be mapped to a functional unit of a TCPA as a basic instruction. If \(\mathcal {F}_i\) is a more complex mathematical expression, the corresponding equation must be split into equations of this granularity [Tei93].
13.
The formula is exact if the iteration space \(\mathcal {I}\) is dense, i.e., does not contain any iteration vectors where no equation is defined.

References

Braun, M., Buchwald, S., Hack, S., Leißa, R., Mallon, C., & Zwinkau A. (2013). Simple and efficient construction of static single assignment form. In R. Jhala & K. Bosschere (Eds.), Compiler construction. Lecture notes in computer science (Vol. 7791, pp. 102–122). Berlin: Springer.
Google Scholar
Braun, M., Buchwald, S., Mohr, M., & Zwinkau, A. (2012). An X10 Compiler for Invasive Architectures. Technical Report 9, Karlsruhe Institute of Technology.
Google Scholar
Bastoul, C., Cohen, A., Girbal, S., Sharma, S., & Temam, O. (2003). Putting polyhedral loop transformations to work. In Workshop on Languages and Compilers for Parallel Computing (LCPC), College Station, TX, USA, October 2003. Lecture notes in computer science (Vol. 2958, pp. 23–30). Berlin: Springer.
Google Scholar
Boppu, S., Hannig, F., & Teich, J. (2013). Loop program mapping and compact code generation for programmable hardware accelerators. In Proceedings of the 24th IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP) (pp. 10–17). New york: IEEE.
Google Scholar
Boppu, S. (2015). Code Generation for Tightly Coupled Processor Arrays. Dissertation, Hardware/Software Co-Design, Department of Computer Science, Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany.
Google Scholar
Bhadouria, V. S., Tanase, A., Schmid, M., Hannig, F., Teich, J., & Ghoshal, D. (2016). A novel image impulse noise removal algorithm optimized for hardware accelerators. Journal of Signal Processing Systems, 89(2), 225–245.
Article Google Scholar
Charles, P., Grothoff, C., Saraswat, V., Donawa, C., Kielstra, A., Ebcioglu, K., et al. (2005). X10: An object-oriented approach to non-uniform cluster computing. ACM SIGPLAN Notices, 40(10), 519–538.
Article Google Scholar
Feautrier, P. (1991). Dataflow analysis of array and scalar references. International Journal of Parallel Programming, 20(1), 23–53.
Article MATH Google Scholar
Feautrier, P., & Lengauer, C. (2011). Polyhedron model. In Encyclopedia of parallel computing (pp. 1581–1592).
Google Scholar
Grudnitsky, A., Bauer, L., & Henkel, J. (2017). Efficient partial online synthesis of special instructions for reconfigurable processors. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 25(2), 594–607.
Article Google Scholar
Gangadharan, D., Sousa, E., Lari, V., Hannig, F., & Teich, J. (2015). Application-driven reconfiguration of shared resources for timing predictability of MPSoC platforms. In Proceedings of Asilomar Conference on Signals, Systems, and Computers (ASILOMAR) (pp. 398–403). Washington, DC, USA: IEEE Computer Society.
Google Scholar
Gangadharan, D., Tanase, A., Hannig, F., & Teich, J. (2014). Timing analysis of a heterogeneous architecture with massively parallel processor arrays. In DATE Friday Workshop on Performance, Power and Predictability of Many-Core Embedded Systems (3PMCES). ECSI.
Google Scholar
Hannig, F. (2009). Scheduling Techniques for High-throughput Loop Accelerators. Dissertation, University of Erlangen-Nuremberg, Germany, Verlag Dr. Hut, Munich, Germany. ISBN: 978-3-86853-220-3.
Google Scholar
Henkel, J., Herkersdorf, A., Bauer, L., Wild, T., Hübner, M., Pujari, R. K., et al. (2012). Invasive manycore architectures. In 17th Asia and South Pacific Design Automation Conference (ASP-DAC) (pp. 193–200). New York: IEEE.
Chapter Google Scholar
Hannig, F., Lari, V., Boppu, S., Tanase, A., & Reiche, O. (2014). Invasive tightly-coupled processor arrays: A domain-specific architecture/compiler co-design approach. ACM Transactions on Embedded Computing Systems (TECS), 13(4s), 133:1–133:29.
Google Scholar
Hannig, F., Ruckdeschel, H., Dutta, H., & Teich, J. (2008). PARO: Synthesis of hardware accelerators for multi-dimensional dataflow-intensive applications. In Proceedings of the Fourth International Workshop on Applied Reconfigurable Computing (ARC). Lecture notes in computer science, March 2008 (Vol. 4943, pp. 287–293). London, UK: Springer.
Google Scholar
Hannig, F., Roloff, S., Snelting, G., Teich, J., & Zwinkau, A. (2011). Resource-aware programming and simulation of MPSoC architectures through extension of X10. In Proceedings of the 14th International Workshop on Software and Compilers for Embedded Systems (pp. 48–55). New York: ACM.
Google Scholar
Hannig, F., Ruckdeschel, H., & Teich, J. (2008). The PAULA language for designing multi-dimensional dataflow-intensive applications. In Proceedings of the GI/ITG/GMM-Workshop – Methoden und Beschreibungssprachen zur Modellierung und Verifikation von Schaltungen und Systemen (pp. 129–138). Freiburg, Germany: Shaker.
Google Scholar
Hannig, F., Schmid, M., Lari, V., Boppu, S., & Teich, J. (2013). System integration of tightly-coupled processor arrays using reconfigurable buffer structures. In Proceedings of the ACM International Conference on Computing Frontiers (CF) (pp. 2:1–2:4). New York: ACM.
Google Scholar
Hannig, F., & Teich, J. (2004). Dynamic piecewise linear/regular algorithms. In International Conference on Parallel Computing in Electrical Engineering. PARELEC’04 (pp. 79–84). New York: IEEE.
Google Scholar
Heisswolf, J., Zaib, A., Weichslgartner, A., Karle, M., Singh, M., Wild, T., et al. (2014). The invasive network on chip - a multi-objective many-core communication infrastructure. In ARCS’14; Workshop Proceedings on Architecture of Computing Systems (pp. 1–8).
Google Scholar
Jainandunsing, K. (1986). Optimal partitioning scheme for wavefront/systolic array processors. In Proceedings of IEEE Symposium on Circuits and Systems (pp. 940–943).
Google Scholar
Kissler, D., Hannig, F., Kupriyanov, A., & Teich, J. (2006). A dynamically reconfigurable weakly programmable processor array architecture template. In Proceedings of the International Workshop on Reconfigurable Communication Centric System-on-Chips (ReCoSoC) (pp. 31–37).
Google Scholar
Kissler, D., Hannig, F., Kupriyanov, A., & Teich, J. (2006). A highly parameterizable parallel processor array architecture. In Proceedings of the IEEE International Conference on Field Programmable Technology (FPT), (pp. 105–112). New York: IEEE.
Google Scholar
Karp, R. M., Miller, R. E., & Winograd, S. (1967). The organization of computations for uniform recurrence equations. Journal of the ACM, 14(3), 563–590.
Article MathSciNet MATH Google Scholar
Klues, K., Rhoden, B., Zhu, Y., Waterman, A., & Brewer, E. (2010). Processes and resource management in a scalable many-core OS. In HotPar10, Berkeley, CA, 2010.
Google Scholar
Kissler, D, Strawetz, A., Hannig, F., & Teich, J. (2009). Power-efficient reconfiguration control in coarse-grained dynamically reconfigurable architectures. Journal of Low Power Electronics, 5(1), 96–105.
Article Google Scholar
Kupriyanov, O. (2009). Modeling and Efficient Simulation of Complex System-on-a-Chip Architectures. PhD thesis, Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany.
Google Scholar
Lari, V. (2016). Invasive tightly coupled processor arrays. In Springer Book Series on Computer Architecture and Design Methodologies. Berlin: Springer. ISBN: 978-981-10-1058-3.
Google Scholar
Lindenmaier, G., Beck, M., Boesler, B., & Geiß, R. (2005). FIRM, An Intermediate Language for Compiler Research. Technical Report 2005-8, Fakultät für Informatik, Universität Karlsruhe, Karlsruhe, Germany.
Google Scholar
Lengauer, C. (1993). Loop parallelization in the polytope model. In CONCUR (Vol. 715, pp. 398–416).
Google Scholar
Lindenmaier, G. (2006). libFIRM – A Library for Compiler Optimization Research Implementing FIRM. Technical Report 2002-5, Fakultät für Informatik, Universität Karlsruhe, Karlsruhe, Germany.
Google Scholar
Lari, V., Narovlyanskyy, A., Hannig, F., & Teich, J. (2011). Decentralized dynamic resource management support for massively parallel processor arrays. In Proceedings of the 22nd IEEE International Conference on Application-specific Systems, Architectures, and Processors (ASAP), Santa Monica, CA, USA, September 2011.
Google Scholar
Lindholm, E., Nickolls, J., Oberman, S., & Montrym, J. (2008). NVIDIA Tesla: A unified graphics and computing architecture. IEEE Micro, 28(2), 39–55.
Article Google Scholar
Lari, V., Weichslgartner, A., Tanase, A., Witterauf, M., Khosravi, F., Teich, J., et al. (2016). Providing fault tolerance through invasive computing. Information Technology, 58(6), 309–328.
Google Scholar
Moldovan, D. I., & Fortes, J. A. B. (1986). Partitioning and mapping algorithms into fixed size systolic arrays. IEEE Transactions on Computers, C-35(1), 1–12.
Article MATH Google Scholar
Mehrara, M., Jablin, T. B., Upton, D., August, D. I., Hazelwood, K., & Mahlke, S. (2009). Compilation strategies and challenges for multicore signal processing. IEEE Signal Processing Magazine, 26(6), 55–63.
Article Google Scholar
Munshi, A. (2012). The OpenCL Specification Version 1.2. Khronos OpenCL Working Group.
Google Scholar
Oechslein, B., Schedel, J., Kleinöder, J., Bauer, L., Henkel, J., Lohmann, D., et al. (2011). OctoPOS: A parallel operating system for invasive computing. In R. McIlroy, J. Sventek, T. Harris, & T. Roscoe (Eds.), Proceedings of the International Workshop on Systems for Future Multi-Core Architectures (SFMA). USB Proceedings of Sixth International ACM/EuroSys European Conference on Computer Systems (EuroSys), EuroSys, 2011 (pp. 9–14).
Google Scholar
Rao, S. K. (1985). Regular Iterative Algorithms and Their Implementations on Processor Arrays. PhD thesis, Stanford University.
Google Scholar
Rau, B. R. (1994). Iterative modulo scheduling: An algorithm for software pipelining loops. In Proceedings of the 27th Annual International Symposium on Microarchitecture (MICRO), San Jose, CA, USA, November 1994 (pp. 63–74).
Google Scholar
Rosen, B. K., Wegman, M. N., & Zadeck, F. K. (1988). Global value numbers and redundant computations. In Proceedings of the 15th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL’88, New York, NY, USA (pp. 12–27).
Google Scholar
Schmid, M., Hannig, F., Tanase, A., & Teich, J. (2014). High-level synthesis revised – Generation of FPGA accelerators from a domain-specific language using the polyhedral model. In Parallel Computing: Accelerating Computational Science and Engineering (CSE), Advances in Parallel Computing (Vol. 25, pp. 497–506). Amsterdam, The Netherlands: IOS Press.
Google Scholar
Schmid, M., Tanase, A., Bhadouria, V. S., Hannig, F., Teich, J., & Ghoshal, D. (2014). Domain-specific augmentations for high-level synthesis. In Proceedings of the 25th IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP) (pp. 173–177). New York: IEEE.
Google Scholar
Sousa, E. R., Tanase, A., Hannig, F., & Teich, J. (2013). A prototype of an adaptive computer vision algorithm on MPSoC architecture. In Proceedings of the Conference on Design and Architectures for Signal and Image Processing (DASIP), October 2013 (pp. 361–362). ECSI Media.
Google Scholar
Sousa, E. R., Tanase, A., Hannig, F., & Teich, J. (2013). Accuracy and performance analysis of Harris corner computation on tightly-coupled processor arrays. In Proceedings of the Conference on Design and Architectures for Signal and Image Processing (DASIP) (pp. 88–95). New York: IEEE.
Google Scholar
Sousa, E. R., Tanase, A., Lari, V., Hannig, F., Teich, J., Paul, J., et al. (2013). Acceleration of optical flow computations on tightly-coupled processor arrays. In Proceedings of the 25th Workshop on Parallel Systems and Algorithms (PARS), Mitteilungen – Gesellschaft für Informatik e. V., Parallel-Algorithmen und Rechnerstrukturen (Vol. 30, pp. 80–89). Gesellschaft für Informatik e. V.
Google Scholar
Teich, J. (1993). A compiler for application specific processor arrays. Reihe Elektrotechnik. Freiburg, Germany: Shaker. ISBN: 9783861117018.
Google Scholar
Teich, J. (2008). Invasive algorithms and architectures. Information Technology, 50(5), 300–310.
Google Scholar
Teich, J., Glaß, M., Roloff, S., Schröder-Preikschat, W., Snelting, G., Weichslgartner, A., et al. (2016). Language and compilation of parallel programs for *-predictable MPSoC execution using invasive computing. In 2016 IEEE 10th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSOC) (pp. 313–320).
Google Scholar
Teich, J., Henkel, J., Herkersdorf, A., Schmitt-Landsiedel, D., Schröder-Preikschat, W., & Snelting, G. (2011). Multiprocessor System-on-Chip: Hardware Design and Tool Integration. Invasive computing: An overview (Chap. 11, pp. 241–268). Berlin: Springer.
Google Scholar
Thiele, L. (1988). On the hierarchical design of vlsi processor arrays. In IEEE International Symposium on Circuits and Systems, 1988 (pp. 2517–2520). New York: IEEE.
Chapter Google Scholar
Thiele, L. (1989). On the design of piecewise regular processor arrays. In IEEE International Symposium on Circuits and Systems (Vol. 3, pp. 2239–2242).
Google Scholar
Tanase, A., Lari, V., Hannig, F., & Teich, J. (2012). Exploitation of quality/throughput tradeoffs in image processing through invasive computing. In Proceedings of the International Conference on Parallel Computing (ParCo) (pp. 53–62).
Google Scholar
Thiele, L., & Roychowdhury, V. P. (1991). Systematic design of local processor arrays for numerical algorithms. In Proceedings of the International Workshop on Algorithms and Parallel VLSI Architectures, Amsterdam, The Netherlands, 1991 (Vol. A: Tutorials, pp. 329–339).
Google Scholar
Teich, J., & Thiele, L. (1991). Control generation in the design of processor arrays. Journal of VLSI Signal Processing Systems for Signal, Image and Video Technology, 3(1), 77–92.
Article Google Scholar
Teich, J., & Thiele, L. (1993). Partitioning of processor arrays: A piecewise regular approach. Integration-The Vlsi Journal,14(3), 297–332.
Google Scholar
Teich, J., & Thiele, L. (1996). A new approach to solving resource-constrained scheduling problems based on a flow-model. Technical Report 17, TIK, Swiss Federal Institute of Technology (ETH) Zürich.
Google Scholar
Teich, J., & Thiele, L. (2002). Exact partitioning of affine dependence algorithms. In Embedded Processor Design Challenges. Lecture notes in computer science (Vol. 2268, pp. 135–151). Berlin, Germany: Springer.
Google Scholar
Teich, J., Thiele, L., & Zhang, L. (1996). Scheduling of partitioned regular algorithms on processor arrays with constrained resources. In Proceedings of the IEEE International Conference on Application-Specific Systems, Architectures, and Processors, ASAP’96 (p. 131). Washington, DC, USA: IEEE Computer Society.
Google Scholar
Teich, J., Thiele, L., & Zhang, L. (1997). Scheduling of partitioned regular algorithms on processor arrays with constrained resources. Journal of VLSI Signal Processing, 17(1), 5–20.
Article Google Scholar
Teich, J., Thiele, L., & Zhang, L. (1997). Partitioning processor arrays under resource constraints. Journal of VLSI Signal Processing Systems for Signal, Image, and Video Technology, 17, 5–20.
Article MATH Google Scholar
Teich, J., Weichslgartner, A., Oechslein, B., & Schröder-Preikschat, W. (2012). Invasive computing - concepts and overheads. In Proceeding of the 2012 Forum on Specification and Design Languages (pp. 217–224).
Google Scholar
Tanase, A., Witterauf, M., Sousa, É. R., Lari, V., Hannig, F., & Teich, J. (2016). LoopInvader: A Compiler for Tightly Coupled Processor Arrays. Tool Presentation at the University Booth at Design, Automation and Test in Europe (DATE), Dresden, Germany.
Google Scholar
Verdoolaege, S. (2010). ISL: An integer set library for the polyhedral model. In Proceedings of the Third International Congress Conference on Mathematical Software (ICMS), Kobe, Japan, 2010 (pp. 299–302). Berlin: Springer.
Google Scholar
Verdoolaege, S., & Grosser, T. (2012). Polyhedral extraction tool. In Second International Workshop on Polyhedral Compilation Techniques (IMPACT’12), Paris, France.
Google Scholar
Wildermann, S., Bader, M., Bauer, L., Damschen, M., Gabriel, D., Gerndt, M., et al. (2016). Invasive computing for timing-predictable stream processing on MPSoCs. Information Technology, 58(6), 267–280.
Google Scholar
Wolfe, M. J. (1996). High performance compilers for parallel computing. Boston, MA, USA: Addison-Wesley.
MATH Google Scholar
Xue, J. (1997). On tiling as a loop transformation. Parallel Processing Letters, 7(4), 409–424.
Article MathSciNet Google Scholar
Xue, J. (2000). Loop tiling for parallelism. Norwell, MA, USA: Kluwer Academic Publishers.
Book MATH Google Scholar

Download references

Author information

Authors and Affiliations

Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Erlangen, Germany
Alexandru-Petru Tanase, Frank Hannig & Jürgen Teich

Authors

Alexandru-Petru Tanase
View author publications
You can also search for this author in PubMed Google Scholar
Frank Hannig
View author publications
You can also search for this author in PubMed Google Scholar
Jürgen Teich
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Tanase, AP., Hannig, F., Teich, J. (2018). Fundamentals and Compiler Framework. In: Symbolic Parallelization of Nested Loop Programs. Springer, Cham. https://doi.org/10.1007/978-3-319-73909-0_2

Download citation

DOI: https://doi.org/10.1007/978-3-319-73909-0_2
Published: 23 February 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-73908-3
Online ISBN: 978-3-319-73909-0
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics