Skip to main content

Fundamentals and Compiler Framework

  • Chapter
  • First Online:
  • 310 Accesses

Abstract

Heterogeneous systems including power-efficient hardware accelerators are dominating the design of nowadays and future embedded computer architectures—as a requirement for energy-efficient system design. In this context, we discuss the main principles of invasive computing, then, we subsequently present the concept and structure of invasive tightly coupled processor arrays (TCPAs), which form the basis for our experiments throughout the book. For the efficient utilization of an invasive TCPA, through the concrete invasive language InvadeX10, compiler support is paramount. Without such support, programming that leverages the abundant parallelism in such architectures is very difficult, tedious, and error-prone. Unfortunately, even nowadays, there is a lack of compiler frameworks for generating efficient parallel code for massively parallel architectures. In this chapter, we therefore present LoopInvader, the first compiler for mapping nested loop programs onto invasive TCPAs. We furthermore discuss the fundamentals and background of the underlying models for algorithm and application specification.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    infect is implemented in terms of X10 places; an i-let is represented by an activity in X10, which is a lightweight thread.

  2. 2.

    For the sake of better visibility, control registers and control I/O ports are not shown in Figure 2.3.

  3. 3.

    In case of arrays, that each array element (indexed variable) is assigned only once.

  4. 4.

    Throughout this book we assume w.l.o.g. that we start from a UDA, as any PLA may be systematically transformed into a UDA using localization [Thi89, TR91] (see Section 2.3.5.2) which is automatically performed in PARO.

  5. 5.

    We define the rectangular hull \(\mathrm {rectHull}(\bigcup _{i=1}^G \mathcal {I}_i)\) as the space containing all iterations of all equations S i , with 1 ≤ i ≤ G. For the sake of simplicity, we assume that the rectangular hull origins at 0. This can be always achieved by a simple translation (i.e., lower bound is equal to zero).

  6. 6.

    Including single assignment conversion see Section 2.3.

  7. 7.

    For the rest of this book, we assume this functionality.

  8. 8.

    Invasive X10 loops will be automatically transformed by the LoopInvader’s front end into PAULA, as described in Section 2.3.

  9. 9.

    For example, map computations onto a fixed number of processors, local memory/register sizes, and communication bandwidth.

  10. 10.

    For this example, we assume an Locally Sequential Globally Parallel (LSGP) (see Section 2.3.5.4) mapping technique, where each tile—with the tile sizes described by a static tiling matrix P = diag(T,  3)— corresponds to one processor, which executes the iterations within the tile in a sequential manner.

  11. 11.

    Such dimensions (with zero iterations) are automatically removed in PARO through a source-to-source transformation.

  12. 12.

    It is assumed in the following that each \(\mathcal {F}_i\) can be mapped to a functional unit of a TCPA as a basic instruction. If \(\mathcal {F}_i\) is a more complex mathematical expression, the corresponding equation must be split into equations of this granularity [Tei93].

  13. 13.

    The formula is exact if the iteration space \(\mathcal {I}\) is dense, i.e., does not contain any iteration vectors where no equation is defined.

References

  1. Braun, M., Buchwald, S., Hack, S., Leißa, R., Mallon, C., & Zwinkau A. (2013). Simple and efficient construction of static single assignment form. In R. Jhala & K. Bosschere (Eds.), Compiler construction. Lecture notes in computer science (Vol. 7791, pp. 102–122). Berlin: Springer.

    Google Scholar 

  2. Braun, M., Buchwald, S., Mohr, M., & Zwinkau, A. (2012). An X10 Compiler for Invasive Architectures. Technical Report 9, Karlsruhe Institute of Technology.

    Google Scholar 

  3. Bastoul, C., Cohen, A., Girbal, S., Sharma, S., & Temam, O. (2003). Putting polyhedral loop transformations to work. In Workshop on Languages and Compilers for Parallel Computing (LCPC), College Station, TX, USA, October 2003. Lecture notes in computer science (Vol. 2958, pp. 23–30). Berlin: Springer.

    Google Scholar 

  4. Boppu, S., Hannig, F., & Teich, J. (2013). Loop program mapping and compact code generation for programmable hardware accelerators. In Proceedings of the 24th IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP) (pp. 10–17). New york: IEEE.

    Google Scholar 

  5. Boppu, S. (2015). Code Generation for Tightly Coupled Processor Arrays. Dissertation, Hardware/Software Co-Design, Department of Computer Science, Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany.

    Google Scholar 

  6. Bhadouria, V. S., Tanase, A., Schmid, M., Hannig, F., Teich, J., & Ghoshal, D. (2016). A novel image impulse noise removal algorithm optimized for hardware accelerators. Journal of Signal Processing Systems, 89(2), 225–245.

    Article  Google Scholar 

  7. Charles, P., Grothoff, C., Saraswat, V., Donawa, C., Kielstra, A., Ebcioglu, K., et al. (2005). X10: An object-oriented approach to non-uniform cluster computing. ACM SIGPLAN Notices, 40(10), 519–538.

    Article  Google Scholar 

  8. Feautrier, P. (1991). Dataflow analysis of array and scalar references. International Journal of Parallel Programming, 20(1), 23–53.

    Article  MATH  Google Scholar 

  9. Feautrier, P., & Lengauer, C. (2011). Polyhedron model. In Encyclopedia of parallel computing (pp. 1581–1592).

    Google Scholar 

  10. Grudnitsky, A., Bauer, L., & Henkel, J. (2017). Efficient partial online synthesis of special instructions for reconfigurable processors. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 25(2), 594–607.

    Article  Google Scholar 

  11. Gangadharan, D., Sousa, E., Lari, V., Hannig, F., & Teich, J. (2015). Application-driven reconfiguration of shared resources for timing predictability of MPSoC platforms. In Proceedings of Asilomar Conference on Signals, Systems, and Computers (ASILOMAR) (pp. 398–403). Washington, DC, USA: IEEE Computer Society.

    Google Scholar 

  12. Gangadharan, D., Tanase, A., Hannig, F., & Teich, J. (2014). Timing analysis of a heterogeneous architecture with massively parallel processor arrays. In DATE Friday Workshop on Performance, Power and Predictability of Many-Core Embedded Systems (3PMCES). ECSI.

    Google Scholar 

  13. Hannig, F. (2009). Scheduling Techniques for High-throughput Loop Accelerators. Dissertation, University of Erlangen-Nuremberg, Germany, Verlag Dr. Hut, Munich, Germany. ISBN: 978-3-86853-220-3.

    Google Scholar 

  14. Henkel, J., Herkersdorf, A., Bauer, L., Wild, T., Hübner, M., Pujari, R. K., et al. (2012). Invasive manycore architectures. In 17th Asia and South Pacific Design Automation Conference (ASP-DAC) (pp. 193–200). New York: IEEE.

    Chapter  Google Scholar 

  15. Hannig, F., Lari, V., Boppu, S., Tanase, A., & Reiche, O. (2014). Invasive tightly-coupled processor arrays: A domain-specific architecture/compiler co-design approach. ACM Transactions on Embedded Computing Systems (TECS), 13(4s), 133:1–133:29.

    Google Scholar 

  16. Hannig, F., Ruckdeschel, H., Dutta, H., & Teich, J. (2008). PARO: Synthesis of hardware accelerators for multi-dimensional dataflow-intensive applications. In Proceedings of the Fourth International Workshop on Applied Reconfigurable Computing (ARC). Lecture notes in computer science, March 2008 (Vol. 4943, pp. 287–293). London, UK: Springer.

    Google Scholar 

  17. Hannig, F., Roloff, S., Snelting, G., Teich, J., & Zwinkau, A. (2011). Resource-aware programming and simulation of MPSoC architectures through extension of X10. In Proceedings of the 14th International Workshop on Software and Compilers for Embedded Systems (pp. 48–55). New York: ACM.

    Google Scholar 

  18. Hannig, F., Ruckdeschel, H., & Teich, J. (2008). The PAULA language for designing multi-dimensional dataflow-intensive applications. In Proceedings of the GI/ITG/GMM-Workshop – Methoden und Beschreibungssprachen zur Modellierung und Verifikation von Schaltungen und Systemen (pp. 129–138). Freiburg, Germany: Shaker.

    Google Scholar 

  19. Hannig, F., Schmid, M., Lari, V., Boppu, S., & Teich, J. (2013). System integration of tightly-coupled processor arrays using reconfigurable buffer structures. In Proceedings of the ACM International Conference on Computing Frontiers (CF) (pp. 2:1–2:4). New York: ACM.

    Google Scholar 

  20. Hannig, F., & Teich, J. (2004). Dynamic piecewise linear/regular algorithms. In International Conference on Parallel Computing in Electrical Engineering. PARELEC’04 (pp. 79–84). New York: IEEE.

    Google Scholar 

  21. Heisswolf, J., Zaib, A., Weichslgartner, A., Karle, M., Singh, M., Wild, T., et al. (2014). The invasive network on chip - a multi-objective many-core communication infrastructure. In ARCS’14; Workshop Proceedings on Architecture of Computing Systems (pp. 1–8).

    Google Scholar 

  22. Jainandunsing, K. (1986). Optimal partitioning scheme for wavefront/systolic array processors. In Proceedings of IEEE Symposium on Circuits and Systems (pp. 940–943).

    Google Scholar 

  23. Kissler, D., Hannig, F., Kupriyanov, A., & Teich, J. (2006). A dynamically reconfigurable weakly programmable processor array architecture template. In Proceedings of the International Workshop on Reconfigurable Communication Centric System-on-Chips (ReCoSoC) (pp. 31–37).

    Google Scholar 

  24. Kissler, D., Hannig, F., Kupriyanov, A., & Teich, J. (2006). A highly parameterizable parallel processor array architecture. In Proceedings of the IEEE International Conference on Field Programmable Technology (FPT), (pp. 105–112). New York: IEEE.

    Google Scholar 

  25. Karp, R. M., Miller, R. E., & Winograd, S. (1967). The organization of computations for uniform recurrence equations. Journal of the ACM, 14(3), 563–590.

    Article  MathSciNet  MATH  Google Scholar 

  26. Klues, K., Rhoden, B., Zhu, Y., Waterman, A., & Brewer, E. (2010). Processes and resource management in a scalable many-core OS. In HotPar10, Berkeley, CA, 2010.

    Google Scholar 

  27. Kissler, D, Strawetz, A., Hannig, F., & Teich, J. (2009). Power-efficient reconfiguration control in coarse-grained dynamically reconfigurable architectures. Journal of Low Power Electronics, 5(1), 96–105.

    Article  Google Scholar 

  28. Kupriyanov, O. (2009). Modeling and Efficient Simulation of Complex System-on-a-Chip Architectures. PhD thesis, Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany.

    Google Scholar 

  29. Lari, V. (2016). Invasive tightly coupled processor arrays. In Springer Book Series on Computer Architecture and Design Methodologies. Berlin: Springer. ISBN: 978-981-10-1058-3.

    Google Scholar 

  30. Lindenmaier, G., Beck, M., Boesler, B., & Geiß, R. (2005). FIRM, An Intermediate Language for Compiler Research. Technical Report 2005-8, Fakultät für Informatik, Universität Karlsruhe, Karlsruhe, Germany.

    Google Scholar 

  31. Lengauer, C. (1993). Loop parallelization in the polytope model. In CONCUR (Vol. 715, pp. 398–416).

    Google Scholar 

  32. Lindenmaier, G. (2006). libFIRM – A Library for Compiler Optimization Research Implementing FIRM. Technical Report 2002-5, Fakultät für Informatik, Universität Karlsruhe, Karlsruhe, Germany.

    Google Scholar 

  33. Lari, V., Narovlyanskyy, A., Hannig, F., & Teich, J. (2011). Decentralized dynamic resource management support for massively parallel processor arrays. In Proceedings of the 22nd IEEE International Conference on Application-specific Systems, Architectures, and Processors (ASAP), Santa Monica, CA, USA, September 2011.

    Google Scholar 

  34. Lindholm, E., Nickolls, J., Oberman, S., & Montrym, J. (2008). NVIDIA Tesla: A unified graphics and computing architecture. IEEE Micro, 28(2), 39–55.

    Article  Google Scholar 

  35. Lari, V., Weichslgartner, A., Tanase, A., Witterauf, M., Khosravi, F., Teich, J., et al. (2016). Providing fault tolerance through invasive computing. Information Technology, 58(6), 309–328.

    Google Scholar 

  36. Moldovan, D. I., & Fortes, J. A. B. (1986). Partitioning and mapping algorithms into fixed size systolic arrays. IEEE Transactions on Computers, C-35(1), 1–12.

    Article  MATH  Google Scholar 

  37. Mehrara, M., Jablin, T. B., Upton, D., August, D. I., Hazelwood, K., & Mahlke, S. (2009). Compilation strategies and challenges for multicore signal processing. IEEE Signal Processing Magazine, 26(6), 55–63.

    Article  Google Scholar 

  38. Munshi, A. (2012). The OpenCL Specification Version 1.2. Khronos OpenCL Working Group.

    Google Scholar 

  39. Oechslein, B., Schedel, J., Kleinöder, J., Bauer, L., Henkel, J., Lohmann, D., et al. (2011). OctoPOS: A parallel operating system for invasive computing. In R. McIlroy, J. Sventek, T. Harris, & T. Roscoe (Eds.), Proceedings of the International Workshop on Systems for Future Multi-Core Architectures (SFMA). USB Proceedings of Sixth International ACM/EuroSys European Conference on Computer Systems (EuroSys), EuroSys, 2011 (pp. 9–14).

    Google Scholar 

  40. Rao, S. K. (1985). Regular Iterative Algorithms and Their Implementations on Processor Arrays. PhD thesis, Stanford University.

    Google Scholar 

  41. Rau, B. R. (1994). Iterative modulo scheduling: An algorithm for software pipelining loops. In Proceedings of the 27th Annual International Symposium on Microarchitecture (MICRO), San Jose, CA, USA, November 1994 (pp. 63–74).

    Google Scholar 

  42. Rosen, B. K., Wegman, M. N., & Zadeck, F. K. (1988). Global value numbers and redundant computations. In Proceedings of the 15th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL’88, New York, NY, USA (pp. 12–27).

    Google Scholar 

  43. Schmid, M., Hannig, F., Tanase, A., & Teich, J. (2014). High-level synthesis revised – Generation of FPGA accelerators from a domain-specific language using the polyhedral model. In Parallel Computing: Accelerating Computational Science and Engineering (CSE), Advances in Parallel Computing (Vol. 25, pp. 497–506). Amsterdam, The Netherlands: IOS Press.

    Google Scholar 

  44. Schmid, M., Tanase, A., Bhadouria, V. S., Hannig, F., Teich, J., & Ghoshal, D. (2014). Domain-specific augmentations for high-level synthesis. In Proceedings of the 25th IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP) (pp. 173–177). New York: IEEE.

    Google Scholar 

  45. Sousa, E. R., Tanase, A., Hannig, F., & Teich, J. (2013). A prototype of an adaptive computer vision algorithm on MPSoC architecture. In Proceedings of the Conference on Design and Architectures for Signal and Image Processing (DASIP), October 2013 (pp. 361–362). ECSI Media.

    Google Scholar 

  46. Sousa, E. R., Tanase, A., Hannig, F., & Teich, J. (2013). Accuracy and performance analysis of Harris corner computation on tightly-coupled processor arrays. In Proceedings of the Conference on Design and Architectures for Signal and Image Processing (DASIP) (pp. 88–95). New York: IEEE.

    Google Scholar 

  47. Sousa, E. R., Tanase, A., Lari, V., Hannig, F., Teich, J., Paul, J., et al. (2013). Acceleration of optical flow computations on tightly-coupled processor arrays. In Proceedings of the 25th Workshop on Parallel Systems and Algorithms (PARS), Mitteilungen – Gesellschaft für Informatik e. V., Parallel-Algorithmen und Rechnerstrukturen (Vol. 30, pp. 80–89). Gesellschaft für Informatik e. V.

    Google Scholar 

  48. Teich, J. (1993). A compiler for application specific processor arrays. Reihe Elektrotechnik. Freiburg, Germany: Shaker. ISBN: 9783861117018.

    Google Scholar 

  49. Teich, J. (2008). Invasive algorithms and architectures. Information Technology, 50(5), 300–310.

    Google Scholar 

  50. Teich, J., Glaß, M., Roloff, S., Schröder-Preikschat, W., Snelting, G., Weichslgartner, A., et al. (2016). Language and compilation of parallel programs for *-predictable MPSoC execution using invasive computing. In 2016 IEEE 10th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSOC) (pp. 313–320).

    Google Scholar 

  51. Teich, J., Henkel, J., Herkersdorf, A., Schmitt-Landsiedel, D., Schröder-Preikschat, W., & Snelting, G. (2011). Multiprocessor System-on-Chip: Hardware Design and Tool Integration. Invasive computing: An overview (Chap. 11, pp. 241–268). Berlin: Springer.

    Google Scholar 

  52. Thiele, L. (1988). On the hierarchical design of vlsi processor arrays. In IEEE International Symposium on Circuits and Systems, 1988 (pp. 2517–2520). New York: IEEE.

    Chapter  Google Scholar 

  53. Thiele, L. (1989). On the design of piecewise regular processor arrays. In IEEE International Symposium on Circuits and Systems (Vol. 3, pp. 2239–2242).

    Google Scholar 

  54. Tanase, A., Lari, V., Hannig, F., & Teich, J. (2012). Exploitation of quality/throughput tradeoffs in image processing through invasive computing. In Proceedings of the International Conference on Parallel Computing (ParCo) (pp. 53–62).

    Google Scholar 

  55. Thiele, L., & Roychowdhury, V. P. (1991). Systematic design of local processor arrays for numerical algorithms. In Proceedings of the International Workshop on Algorithms and Parallel VLSI Architectures, Amsterdam, The Netherlands, 1991 (Vol. A: Tutorials, pp. 329–339).

    Google Scholar 

  56. Teich, J., & Thiele, L. (1991). Control generation in the design of processor arrays. Journal of VLSI Signal Processing Systems for Signal, Image and Video Technology, 3(1), 77–92.

    Article  Google Scholar 

  57. Teich, J., & Thiele, L. (1993). Partitioning of processor arrays: A piecewise regular approach. Integration-The Vlsi Journal,14(3), 297–332.

    Google Scholar 

  58. Teich, J., & Thiele, L. (1996). A new approach to solving resource-constrained scheduling problems based on a flow-model. Technical Report 17, TIK, Swiss Federal Institute of Technology (ETH) Zürich.

    Google Scholar 

  59. Teich, J., & Thiele, L. (2002). Exact partitioning of affine dependence algorithms. In Embedded Processor Design Challenges. Lecture notes in computer science (Vol. 2268, pp. 135–151). Berlin, Germany: Springer.

    Google Scholar 

  60. Teich, J., Thiele, L., & Zhang, L. (1996). Scheduling of partitioned regular algorithms on processor arrays with constrained resources. In Proceedings of the IEEE International Conference on Application-Specific Systems, Architectures, and Processors, ASAP’96 (p. 131). Washington, DC, USA: IEEE Computer Society.

    Google Scholar 

  61. Teich, J., Thiele, L., & Zhang, L. (1997). Scheduling of partitioned regular algorithms on processor arrays with constrained resources. Journal of VLSI Signal Processing, 17(1), 5–20.

    Article  Google Scholar 

  62. Teich, J., Thiele, L., & Zhang, L. (1997). Partitioning processor arrays under resource constraints. Journal of VLSI Signal Processing Systems for Signal, Image, and Video Technology, 17, 5–20.

    Article  MATH  Google Scholar 

  63. Teich, J., Weichslgartner, A., Oechslein, B., & Schröder-Preikschat, W. (2012). Invasive computing - concepts and overheads. In Proceeding of the 2012 Forum on Specification and Design Languages (pp. 217–224).

    Google Scholar 

  64. Tanase, A., Witterauf, M., Sousa, É. R., Lari, V., Hannig, F., & Teich, J. (2016). LoopInvader: A Compiler for Tightly Coupled Processor Arrays. Tool Presentation at the University Booth at Design, Automation and Test in Europe (DATE), Dresden, Germany.

    Google Scholar 

  65. Verdoolaege, S. (2010). ISL: An integer set library for the polyhedral model. In Proceedings of the Third International Congress Conference on Mathematical Software (ICMS), Kobe, Japan, 2010 (pp. 299–302). Berlin: Springer.

    Google Scholar 

  66. Verdoolaege, S., & Grosser, T. (2012). Polyhedral extraction tool. In Second International Workshop on Polyhedral Compilation Techniques (IMPACT’12), Paris, France.

    Google Scholar 

  67. Wildermann, S., Bader, M., Bauer, L., Damschen, M., Gabriel, D., Gerndt, M., et al. (2016). Invasive computing for timing-predictable stream processing on MPSoCs. Information Technology, 58(6), 267–280.

    Google Scholar 

  68. Wolfe, M. J. (1996). High performance compilers for parallel computing. Boston, MA, USA: Addison-Wesley.

    MATH  Google Scholar 

  69. Xue, J. (1997). On tiling as a loop transformation. Parallel Processing Letters, 7(4), 409–424.

    Article  MathSciNet  Google Scholar 

  70. Xue, J. (2000). Loop tiling for parallelism. Norwell, MA, USA: Kluwer Academic Publishers.

    Book  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Tanase, AP., Hannig, F., Teich, J. (2018). Fundamentals and Compiler Framework. In: Symbolic Parallelization of Nested Loop Programs. Springer, Cham. https://doi.org/10.1007/978-3-319-73909-0_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-73909-0_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-73908-3

  • Online ISBN: 978-3-319-73909-0

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics