Advertisement

Loop parallelization in the polytope model

  • Christian Lengauer
Invited Talk
Part of the Lecture Notes in Computer Science book series (LNCS, volume 715)

Abstract

During the course of the last decade, a mathematical model for the parallelization of FOR-loops has become increasingly popular. In this model, a (perfect) nest of r FOR-loops is represented by a convex polytope in ℤr. The boundaries of each loop specify the extent of the polytope in a distinct dimension. Various ways of slicing and segmenting the polytope yield a multitude of guaranteed correct mappings of the loops' operations in space-time. These transformations have a very intuitive interpretation and can be easily quantified and automated due to their mathematical foundation in linear programming and linear algebra. With the recent availability of massively parallel computers, the idea of loop parallelization is gaining significance, since it promises execution speed-ups of orders of magnitude. The polytope model for loop parallelization has its origin in systolic design, but it applies in more general settings and methods based on it will become a part of future parallelizing compilers. This paper provides an overview and future perspective of the polytope model and methods based on it.

Keywords

Dependence Graph Systolic Array Processor Array Index Space VLSI Signal Processing 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    C. Ancourt and F. Irigoin. Scanning polyhedra with DO loops. In Proc. 3rd ACM SIGPLAN Symp. on Principles & Practice of Parallel Programming (PPoPP), pages 39–50. ACM Press, 1991.Google Scholar
  2. 2.
    M. Barnett and C. Lengauer. Unimodularity and the parallelization of loops. Parallel Processing Letters, 2(2–3):273–281, 1992.CrossRefGoogle Scholar
  3. 3.
    M. Barnett and C. Lengauer. Unimodularity considered non-essential (extended abstract). In L. Bougé, M. Cosnard, Y. Robert, and D. Trystram, editors, Parallel Processing: CONPAR 92-VAPP V, Lecture Notes in Computer Science 634, pages 659–664. Springer-Verlag, 1992.Google Scholar
  4. 4.
    M. Barnett and C. Lengauer. A systolizing compilation scheme for nested loops with linear bounds. In P. E. Lauer, R. Janicki, and J. Zucker, editors, Functional Programming, Concurrency, Simulation and Automated Reasoning (FPCSAR), Lecture Notes in Computer Science. Springer-Verlag, 1993. To appear.Google Scholar
  5. 5.
    J. Bu. Systematic Design of Regular VLSI Processor Arrays. PhD thesis, Department of Electrical Engineering, Delft University of Technology, May 1990.Google Scholar
  6. 6.
    J. Bu and E. F. Deprettere. Processor clustering for the design of optimal fixed-size systolic arrays. In E. F. Deprettere and A.-J. van der Veen, editors, Algorithms and Parallel VLSI-Architectures, volume A, pages 341–362. Elsevier (North-Holland), 1991.Google Scholar
  7. 7.
    P. R. Cappello. A processor-time-minimal systolic array for cubical mesh algorithms. IEEE Trans. on Parallel and Distributed Systems, 3(1):4–13, January 1992.CrossRefGoogle Scholar
  8. 8.
    P. R. Cappello and K. Steiglitz. Unifying VLSI array design with linear transformations of space-time. In F. P. Preparata, editor, Advances in Computing Research, Vol. 2: VLSI Theory, pages 23–65. JAI Press, 1984.Google Scholar
  9. 9.
    M. Chen, Y. Choo, and J. Li. Crystal: Theory and pragmatics of generating efficient parallel code. In B. K. Szymanski, editor, Parallel Functional Languages and Compilers, Frontier Series, chapter 7. ACM Press, 1991.Google Scholar
  10. 10.
    M. C. Chen. A design methodology for synthesizing parallel algorithms and architectures. J. Parallel and Distributed Computing, 3(4):461–491, 1986.CrossRefGoogle Scholar
  11. 11.
    M. C. Chen, Y. Choo, and J. Li. Compiling parallel programs by optimizing performance. J. Supercomputing, 2:171–207, 1988.CrossRefGoogle Scholar
  12. 12.
    P. Clauss, C. Mongenet, and G. R. Perrin. Calculus of space-optimal mappings of systolic algorithms on processor arrays. J. VLSI Signal Processing, 4(1):27–36, February 1992.CrossRefGoogle Scholar
  13. 13.
    A. Darte. Regular partitioning for synthesizing fixed-size systolic arrays. Integration, 12(3):293–304, December 1991.Google Scholar
  14. 14.
    E. W. Dijkstra and C. S. Scholten. Predicate Calculus and Program Semantics. Texts and Monographs in Computer Science. Springer-Verlag, 1990.Google Scholar
  15. 15.
    P. Feautrier. Parametric integer programming. Operations Research, 22(3):243–268, 1988.Google Scholar
  16. 16.
    P. Feautrier. Semantical analysis and mathematical programming. In M. Cosnard, Y. Robert, P. Quinton, and M. Raynal, editors, Parallel & Distributed Algorithms, pages 309–320. North-Holland, 1989.Google Scholar
  17. 17.
    P. Feautrier. Dataflow analysis of array and scalar references. Int. J. Parallel Programming, 20(1):23–53, February 1991.Google Scholar
  18. 18.
    M. R. Garey and D. S. Johnson. Computers and Intractability. Freeman, 1979.Google Scholar
  19. 19.
    P. Held and E. F. Deprettere. HiFi: From parallel algorithm to fixed-size VLSI processor array. In F. Catthoor and L. Svensson, editors, Application-Driven Architecture Synthesis, pages 71–92. Kluwer Academic Publishers, 1993.Google Scholar
  20. 20.
    C.-H. Huang and P. Sadayappan. Communication-free hyperplane partitioning of nested loops. In D. Gelernter, A. Nicolau, and D. Padua, editors, Languages and Compilers for Parallel Computing. The MIT Press, 1990.Google Scholar
  21. 21.
    R. M. Karp, R. E. Miller, and S. Winograd. The organization of computations for uniform recurrence equations. J. ACM, 14(3):563–590, July 1967.Google Scholar
  22. 22.
    R. H. Kuhn. Optimization and Interconnection Complexity for Parallel Processors, Single-Stage Networks and Decision Trees. PhD thesis, University of Illinois at Urbana-Champaign, 1980.Google Scholar
  23. 23.
    H. T. Kung and C. E. Leiserson. Algorithms for VLSI processor arrays. In C. Mead and L. Conway, editors, Introduction to VLSI Systems, chapter 8.3. Addison-Wesley, 1980. Previously published as: Systolic arrays for VLSI, in SIAM Sparse Matrix Proceedings, 1978, 245–282.Google Scholar
  24. 24.
    L. Lamport. The parallel execution of DO loops. Comm. ACM, 17(2):83–93, February 1974.Google Scholar
  25. 25.
    H. Le Verge. Reduction operators in ALPHA. In D. Etiemble and J.-C. Syre, editors, Parallel Architectures and Languages Europe (PARLE '92), Lecture Notes in Computer Science 605, pages 397–410. Springer-Verlag, 1992.Google Scholar
  26. 26.
    H. Le Verge, C. Mauras, and P. Quinton. The ALPHA language and its use for the design of systolic arrays. J. VLSI Signal Processing, 3:173–182, 1991.Google Scholar
  27. 27.
    P. Lee and Z. Kedem. Synthesizing linear-array algorithms from nested for loop algorithms. IEEE Trans. on Computers, C-37(12):1578–1598, December 1988.Google Scholar
  28. 28.
    C. Lengauer and J. Xue. A systolic array for pyramidal algorithms. J. VLSI Signal Processing, 3(3):239–259, 1991.Google Scholar
  29. 29.
    J. Li and M. Chen. The data alignment phase in compiling programs for distributed memory machines. J. Parallel and Distributed Computing, 13(2):213–221, October 1991.Google Scholar
  30. 30.
    W. Li and K. Pingali. A singular loop transformation framework based on non-singular matrices. Technical Report TR 92-1294, Department of Computer Science, Cornell University, July 1992.Google Scholar
  31. 31.
    W. L. Miranker and A. Winkler. Space-time representation of computational structures. Computing, pages 93–114, 1984.Google Scholar
  32. 32.
    D. I. Moldovan. On the design of algorithms for VLSI systolic arrays. Proc. IEEE, 71(1):113–120, January 1983.Google Scholar
  33. 33.
    D. I. Moldovan and J. A. B. Fortes. Partitioning and mapping algorithms into fixed-size systolic arrays. IEEE Trans. on Computers, C-35(1):1–12, January 1986.Google Scholar
  34. 34.
    G. L. Nemhauser and L. A. Wolsey. Integer and Combinatorial Optimization. Interscience Series in Discrete Mathematics and Optimization. Wiley & Sons, 1988.Google Scholar
  35. 35.
    D. D. Prest. Translation of abstract distributed programs to occam 2. 4th-Year Report, Department of Computer Science, University of Edinburgh, June 1992.Google Scholar
  36. 36.
    P. Quinton. Automatic synthesis of systolic arrays from uniform recurrent equations. In Proc. 11th Ann. Int. Symp. on Computer Architecture, pages 208–214. IEEE Computer Society Press, 1984.Google Scholar
  37. 37.
    P. Quinton and Y. Robert. Systolic Algorithms and Architectures. Prentice-Hall, 1990.Google Scholar
  38. 38.
    P. Quinton and V. van Dongen. The mapping of linear recurrence equations on regular arrays. J. VLSI Signal Processing, 1(2):95–113, October 1989.Google Scholar
  39. 39.
    S. V. Rajopadhye. Algebraic transformations in systolic array synthesis: A case study. In L. J. M. Claesen, editor, Formal VLSI Specification and Synthesis (VLSI Design Methods-I), pages 361–370. North-Holland, 1990.Google Scholar
  40. 40.
    S. V. Rajopadhye and M. Muddarangegowda. Parallel assignment, reduction and communication. In Proc. SIAM Conference on Parallel Processing for Scientific Computing, pages 849–853. SIAM, 1993.Google Scholar
  41. 41.
    J. Ramanujam and P. Sadayappan. Compile-time techniques for data distribution in distributed memory machines. IEEE Trans, on Parallel and Distributed Systems, 2(4):472–482, 1991.Google Scholar
  42. 42.
    S. K. Rao. Regular Iterative Algorithms and their Implementations on Processor Arrays. PhD thesis, Department of Electrical Engineering, Stanford University, October 1985.Google Scholar
  43. 43.
    S. K. Rao and T. Kailath. Regular iterative algorithms and their implementations on processor arrays. Proc. IEEE, 76 (3):259–282, March 1988.Google Scholar
  44. 44.
    H. B. Ribas. Automatic Generation of Systolic Programs from Nested Loops. PhD thesis, Department of Computer Science, Carnegie-Mellon University, June 1990. Technical Report CMU-CS-90-143.Google Scholar
  45. 45.
    Y. Robert and S. W. Song. Revisiting cycle shrinking. Parallel Computing, 18:481–496, 1992.Google Scholar
  46. 46.
    V. Roychowdhury, L. Thiele, S. K. Rao, and T. Kailath. On the localization of algorithms for VLSI processor arrays. In R. Brodersen and H. Moscovitz, editors, VLSI Signal Processing III, pages 459–470. IEEE Press, 1988.Google Scholar
  47. 47.
    A. Schrijver. Theory of Linear and Integer Programming. Series in Discrete Mathematics. Wiley & Sons, 1986.Google Scholar
  48. 48.
    J. Teich and L. Thiele. Control generation in the design of processor arrays. J. VLSI Signal Processing, 3(1–2):77–92, 1991.Google Scholar
  49. 49.
    J. Teich and L. Thiele. Partitioning of processor arrays: A piecewise regular approach. INTEGRATION, 14(3):297–332, 1993.Google Scholar
  50. 50.
    L. Thiele. CAD for signal processing architectures. In P. Dewilde, editor, The State of the Art in Computer Systems and Software Engineering, pages 101–151. Kluwer Academic Publishers, 1992.Google Scholar
  51. 51.
    A. van der Hoeven. Concepts and Implementation of a Design System for Digital Signal Processing. PhD thesis, Department of Electrical Engineering, Delft University of Technology, October 1992.Google Scholar
  52. 52.
    V. van Dongen. Quasi-regular arrays: Definition and design methodology. In J. V. McCanny, J. McWhirter, and E. E. Swartzlander, editors, Systolic Array Processors, pages 126–135. Prentice Hall, 1989.Google Scholar
  53. 53.
    V. van Dongen and M. Petit. PRESAGE: A tool for the parallelization of nested loop programs. In L. J. M. Claesen, editor, Formal VLSI Specification and Synthesis (VLSI Design Methods-I), pages 341–359. North-Holland, 1990.Google Scholar
  54. 54.
    M. Wolf and M. Lam. A loop transformation theory and an algorithm to maximize parallelism. IEEE Trans. on Parallel and Distributed Systems, 2(4):452–471, October 1991.Google Scholar
  55. 55.
    M. Wolfe. Multiprocessor synchronization for concurrent loops. IEEE Software, pages 34–42, January 1988.Google Scholar
  56. 56.
    M. Wolfe. Optimizing Supercompilers for Supercomputers. Research Monographs in Parallel and Distributed Computing. MIT Press, 1989.Google Scholar
  57. 57.
    Y. Wong and J. M. Delosme. Optimal systolic implementations of n-dimensional recurrences. In Proc. IEEE Int. Conf. on Computer Design (ICCD 85), pages 618–621. IEEE Press, 1985. Also: Technical Report 8810, Department of Computer Science, Yale University, 1988.Google Scholar
  58. 58.
    J. Xue. Specifying control signals for systolic arrays by uniform recurrence equations. Parallel Processing Letters, 1(2):83–93, 1992.Google Scholar
  59. 59.
    J. Xue and C. Lengauer. The synthesis of control signals for one-dimensional systolic arrays. INTEGRATION, 14:1–32, 1992.Google Scholar
  60. 60.
    H. Zima. Supercompilers for Parallel and Vector Computers. Frontier Series. Addison-Wesley (ACM Press), 1990.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 1993

Authors and Affiliations

  • Christian Lengauer
    • 1
  1. 1.Fakultät für Mathematik und InformatikUniversität PassauPassauGermany

Personalised recommendations