Journal of Signal Processing Systems

, Volume 77, Issue 1–2, pp 31–59 | Cite as

Symbolic Mapping of Loop Programs onto Processor Arrays

Article

Abstract

In this paper, we present a solution to the problem of joint tiling and scheduling a given loop nest with uniform data dependencies symbolically. This challenge arises when the size and number of available processors for parallel loop execution is not known at compile time. But still, in order to avoid any overhead of dynamic (run-time) recompilation, a schedule of loop iterations shall be computed and optimized statically. In this paper, it will be shown that it is possible to derive parameterized latency-optimal schedules statically by proposing a two step approach: First, the iteration space of a loop program is tiled symbolically into orthotopes of parametrized extensions. Subsequently, the resulting tiled program is also scheduled symbolically, resulting in a set of latency-optimal parameterized schedule candidates. At run time, once the size of the processor array becomes known, simple comparisons of latency-determining expressions finally steer which of these schedules will be dynamically selected and the corresponding program configuration executed on the resulting processor array so to avoid any further run-time optimization or expensive recompilation. Our theory of symbolic loop parallelization is applied to a number of loop programs from the domains of signal processing and linear algebra. Finally, as a proof of concept, we demonstrate our proposed methodology for a massively parallel processor array architecture called tightly coupled processor array (TCPA) on which applications may dynamically claim regions of processors in the context of invasive computing.

Keywords

Symbolic Loop Parallelisation 

References

  1. 1.
    Baskaran, M.M., Ramanujam, J., Sadayappan, P. (2010). Automatic C-to-CUDA code generation for affine programs. In Proceedings of the 19th joint European conference on theory and practice of software, international conference on compiler construction (pp. 244–263). Paphos, Cyprus: Springer.Google Scholar
  2. 2.
    Bondhugula, U., Hartono, A., Ramanujam, J., Sadayappan, P. (2008). A practical automatic polyhedral parallelizer and locality optimizer. ACM SIGPLAN Notices, 43(6), 101–113.CrossRefGoogle Scholar
  3. 3.
    Boppu, S., Hannig, F., Teich, J. (2013). Loop program mapping and compact code generation for programmable hardware accelerators. In Proceedings of the 24th IEEE international conference on application-specific systems, architectures and processors (ASAP) (pp. 10–17). IEEE.Google Scholar
  4. 4.
    Boppu, S., Hannig, F., Teich, J., Perez-Andrade, R. (2011). Towards symbolic run-time reconfiguration in tightly-coupled processor arrays. In ReConFig (pp. 392–397).Google Scholar
  5. 5.
    Darte, A., Khachiyan, L., Robert, Y. (1992). Linear scheduling is close to optimality. In Proceedings of the international conference on application specific array processors (ASAP) (pp. 37–46). Berkeley, CA, USA. doi:10.1109/ASAP.1992.218583.
  6. 6.
    Darte, A., & Robert, Y. (1995). Affine-by-statement scheduling of uniform and affine loop nests over parametric domains. Journal of Parallel and Distributed Computing, 29(1), 43–59.CrossRefGoogle Scholar
  7. 7.
    Darte, A., Schreiber, R., Rau, B.R., Vivien, F. (2000). A constructive solution to the juggling problem in systolic array synthesis. In Proceedings of the international parallel and distributed processing symposium (IPDPS) (pp. 815–821).Google Scholar
  8. 8.
    Di, P., & Xue, J. (2011). Model-driven tile size selection for doacross loops on gpus. In Proceedings of the 17th international conference on parallel processing - Volume Part II, Euro-Par (pp. 401–412). Berlin, Heidelberg: Springer-Verlag.Google Scholar
  9. 9.
    Di, P., Ye, D., Su, Y., Sui, Y., Xue, J. (2012). Automatic parallelization of tiled loop nests with enhanced fine-grained parallelism on GPUs. In Proceedings of the 41st international conference on parallel processing (ICPP) (pp. 350–359). Pittsburgh: IEEE Computer Society.Google Scholar
  10. 10.
    Hannig, F. (2009). Scheduling techniques for high-throughput loop accelerators. Dissertation, University of Erlangen-Nuremberg, Germany. Verlag Dr. Hut, Munich, Germany.Google Scholar
  11. 11.
    Hannig, F., Dutta, H., Teich, J. (2006). Mapping a class of dependence algorithms to coarse-grained reconfigurable arrays: Architectural parameters and methodology. International Journal of Embedded Systems, 2(1/2), 114–127. doi:10.1504/IJES.2006.010170.CrossRefGoogle Scholar
  12. 12.
    Hannig, F., Lari, V., Boppu, S., Tanase, A., Reiche, O. (2014). Invasive tightly-coupled processor arrays: A domain-specific architecture/compiler co-design approach. ACM Transactions on Embedded Computing Systems (TECS), 13(4s), 133:1–133:29. doi:10.1145/2584660.CrossRefGoogle Scholar
  13. 13.
    Hannig, F., Roloff, S., Snelting, G., Teich, J., Zwinkau, A. (2011). Resource-aware programming and simulation of MPSoC architectures through extension of X10. In Proceedings of the 14th international workshop on software and compilers for embedded systems (SCOPES) (pp. 48–55). ACM Press. doi:10.1145/1988932.1988941.
  14. 14.
    Hannig, F., Ruckdeschel, H., Dutta, H., Teich, J. (2008). PARO: Synthesis of hardware accelerators for multi-dimensional dataflow-intensive applications. In Proceedings of the Fourth international workshop on applied reconfigurable computing (ARC), Lecture Notes in Computer Science (LNCS) (vol. 4943, pp. 287–293). Springer.Google Scholar
  15. 15.
    Hannig, F., Schmid, M., Lari, V., Boppu, S., Teich, J. (2013). System integration of tightly-coupled processor arrays using reconfigurable buffer structures. In Proceedings of the ACM international conference on computing frontiers (CF) (pp. 2:1–2:4). ACM. doi:10.1145/2482767.2482770.
  16. 16.
    Hartono, A., Baskaran, M., Ramanujam, J., Sadayappan, P. (2010). DynTile: Parametric tiled loop generation for parallel execution on multicore processors. In Proceedings of the international parallel and distributed processing symposium (IPDPS) (pp. 1–12). Atlanta: IEEE.Google Scholar
  17. 17.
    Hartono, A., Baskaran, M.M., Bastoul, C., Cohen, A., Krishnamoorthy, S., Norris, B., Ramanujam, J., Sadayappan, P. (2009). Parametric multi-level tiling of imperfectly nested loops. In Proceedings of the 23rd international conference on supercomputing (ICS) (pp. 147–157). Yorktown Heights: ACM.Google Scholar
  18. 18.
    Henkel, J., Narayanan, V., Parameswaran, S., Teich, J. (2013). Run-time adaptation for highly-complex multi-core systems. In Proceedings of the IEEE international conference on hardware/software codesign and system synthesis (CODES+ISSS).Google Scholar
  19. 19.
    Högstedt, K., Carter, L., Ferrante, J. (1999). Selecting tile shape for minimal execution time. In Proceedings of the 11th annual ACM symposium on parallel algorithms and architectures (pp. 201–211. Saint Malo, France.Google Scholar
  20. 20.
    Irigoin, F., & Triolet, R. (1988). Supernode partitioning. In Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on principles of programming languages (POPL) (pp. 319–329). San Diego: ACM.Google Scholar
  21. 21.
    Kissler, D., Gran, D., Salcic, Z., Hannig, F., Teich, J. (2011). Scalable many-domain power gating in coarse-grained reconfigurable processor arrays. IEEE Embedded Systems Letters, 3(2), 58–61.CrossRefGoogle Scholar
  22. 22.
    Kissler, D., Hannig, F., Kupriyanov, A., Teich, J. (2006). A highly parameterizable parallel processor array architecture. In Proceedings of the IEEE International Conference on Field Programmable Technology (FPT) (pp. 105–112). Bangkok: IEEE.Google Scholar
  23. 23.
    Lamport, L. (1974). The parallel execution of DO loops. Communications of the ACM, 17(2), 83–93. doi:10.1145/360827.360844.CrossRefMATHMathSciNetGoogle Scholar
  24. 24.
    Lari, V., Hannig, F., Teich, J. (2011). Distributed resource reservation in massively parallel processor arrays. In Proceedings of the international parallel and distributed processing symposium workshops (IPDPSW) (pp. 313–316). IEEE Computer Society. doi:10.1109/IPDPS.2011.157.
  25. 25.
    Lari, V., Muddasani, S., Boppu, S., Hannig, F., Schmid, M., Teich, J. (2012). Hierarchical power management for adaptive tightly-coupled processor arrays. ACM Transactions on Design Automation of Electronic Systems (TODAES), 18(1), 2:1–2:25. doi:10.1145/2390191.2390193.CrossRefGoogle Scholar
  26. 26.
    Lari, V., Narovlyanskyy, A., Hannig, F., Teich, J. (2011). Decentralized dynamic resource management support for massively parallel processor arrays. In Proceedings of the 22nd IEEE International Conference on Application-specific Systems, Architectures, and Processors (ASAP) (pp. 87–94). IEEE Computer Society. doi:10.1109/ASAP.2011.6043240.
  27. 27.
    Mehrara, M., Jablin, T., Upton, D., August, D., Hazelwood, K., Mahlke, S. (2009). Compilation strategies and challenges for multicore signal processing. IEEE Signal Processing Magazine, 26(6), 55–63.CrossRefGoogle Scholar
  28. 28.
    Moore, G. (1965). Cramming more components onto integrated circuits. Electronics, 38(8), 114–117.Google Scholar
  29. 29.
    Muchnick, S. (1997). Advanced compiler design and implementation. Morgan Kaufmann.Google Scholar
  30. 30.
    Radivojevic, I.P., & Brewer, F. (1995). Symbolic scheduling techniques. IEICE Transactions, 78-D(3), 224–230.Google Scholar
  31. 31.
    Rao, S., & Kailath, T. (1988). Regular iterative algorithms and their implementation on processor arrays. Proceedings of the IEEE, 76(3), 259–269. doi:10.1109/5.4402.CrossRefGoogle Scholar
  32. 32.
    Renganarayanan, L., Kim, D., Rajopadhye, S., Strout, M.M. (2007). Parameterized tiled loops for free. In Proceedings of the Conference on Programming Language Design and Implementation (pp. 405–414). San Diego.Google Scholar
  33. 33.
    Renganarayanan, L., Kim, D., Strout, M.M., Rajopadhye, S. (2012). Parameterized loop tiling. ACM Transactions on Programming Languages and Systems, 34(1), 3:1–3:41.CrossRefGoogle Scholar
  34. 34.
    Shang, W., & Fortes, J.A.B. (1991). Time optimal linear schedules for algorithms with uniform dependencies. IEEE Transactions on Computers, 40(6), 723–742.CrossRefMathSciNetGoogle Scholar
  35. 35.
    Tavarageri, S., Hartono, A., Baskaran, M., Pouchet, L.N., Ramanujam, J., Sadayappan, P. (2010). Parametric tiling of affine loop nests. In Proceedings of the 15th workshop on compilers for parallel computing (CPC). Vienna, Austria.Google Scholar
  36. 36.
    Teich, J. (2008). Invasive algorithms and architectures. Information Technology, 50(5), 300–310.Google Scholar
  37. 37.
    Teich, J., Henkel, J., Herkersdorf, A., Schmitt-Landsiedel, D., Schröder-Preikschat, W., Snelting, G. (2011). Invasive computing: an overview. In Multiprocessor system-on-chip – hardware design and tool integration (pp. 241–268). Springer.Google Scholar
  38. 38.
    Teich, J., Tanase, A., Hannig, F. (2013). Symbolic parallelization of loop programs for massively parallel processor arrays. In Proceedings of the 24th IEEE international conference on application-specific systems, architectures and processors (ASAP) (pp. 1–9). IEEE.Google Scholar
  39. 39.
    Teich, J., & Thiele, L. (2002). Exact partitioning of affine dependence algorithms. In Embedded processor design challenges (pp. 135–153).Google Scholar
  40. 40.
    Teich, J., Thiele, L., Zhang, L. (1997). Scheduling of partitioned regular algorithms on processor arrays with constrained resources. Journal of VLSI Signal Processing, 17(1), 5–20.CrossRefMATHGoogle Scholar
  41. 41.
    Teich, J., Weichslgartner, A., Oechslein, B., Schröder-Preikschat, W. (2012). Invasive computing - concepts and overheads. In Forum on specification & design languages (FDL) (pp. 193–200).Google Scholar
  42. 42.
    Thiele, L. (1989). On the design of piecewise regular processor arrays. In IEEE international symposium on circuits and systems (vol. 3, pp. 2239–2242).Google Scholar
  43. 43.
    Thiele, L., & Roychowdhury, V. (1991). Systematic design of local processor arrays for numerical algorithms. In Proceedings of the international workshop on algorithms and parallel VLSI architectures, vol. A: Tutorials (pp. 329–339). Amsterdam: Elsevier.Google Scholar
  44. 44.
    Xue, J. (2000). Loop tiling for parallelism. Norwell: Kluwer Academic Publishers.Google Scholar
  45. 45.
    Yang, T., & Ibarra, O.H. (1995). On symbolic scheduling and parallel complexity of loops. In Proceedings IEEE symposium on parallel and distributed processing (pp. 360–367).Google Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  1. 1.University of Erlangen-Nüremberg (FAU)ErlangenGermany

Personalised recommendations