Skip to main content
Log in

A scalable method for run-time loop parallelization

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

Current parallelizing compilers do a reasonable job of extracting parallelism from programs with regular, well behaved, statically analyzable access patterns. However, they cannot extract a significant fraction of the avaialable, parallelism if the program has a complex and/or statically insufficiently defined access pattern, e.g., simulation programs with irregular domains and/or dynamically changing interactions. Since such programs represent a large fraction of all applications, techniques are needed for extracting their inherent parallelism at run-time. In this paper we give a new run-time technique for finding an optimal parallel execution schedule for a partially parallel loop, i.e., a loop whose parallelization requires synchronization to ensure that the iterations are executed in the correct order. Given the original loop, the compiler generatesinspector code that performas run-time preprocessing of the loop's access pattern, andscheduler code that schedules (and executes) the loop interations. The inspector is fully parallel, uses no sychronization, and can be applied to any loop (from which an inspector can be extracted). In addition, it can implement at run-time the two most effective transformations for increasing the amount of parallelism in a loop:array privatization andreduction parallelization (elementwise). The ability to identify privatizable and reduction variables is very powerful since it eliminates the data dependences involving these variables and

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. L. Rauchwerger, N. Amato, and D. Padua, Run-Time Methods for Parallelizing Partially Parallel Loops,Proc. of the Inst. Conf. on Supercomputing, Barcelona, Spain, pp. 137–146 (July 1995).

  2. D. A. Padua and M. J. Wolfe, Advanced Complier Optimizations for Supercomputers,Comm. ACM,29: 1184–1201 (December 1986).

    Article  Google Scholar 

  3. M. Wolfe,Optimizing Compilers for Supercomputers, The MIT Press, Boston, Massachusetts (1989).

    Google Scholar 

  4. W. J. Camp, S. J. Plimton, B. A. Hendrickson, and R. W. Leland, Massively Parallel Methods for Engineering and Science Problems.Comm. ACM 37 (4): 31–41 (April 1994).

    Article  Google Scholar 

  5. W. Blume and R. Eigenmann, Performance Analysis of Parallelizing Compilers on the Perfect Benchmark Programs,IEEE Trans. on Parallel and Distr. Syst. 3 (6): 643–656. (November 1992).

    Article  Google Scholar 

  6. R. Eigenmann and W. Blume, An Effectiveness Study of Parallelizing Compiler Techniques,Proc. Int'l. Conf. on Parallel Processing, pp. 17–25 (August 1991).

  7. S. Leung and J. Zahorjan, Improving the Performance of Runtime Parallelization,4th PPOPP pp. 83–91 (May 1993).

  8. J. Saltz. R. Mirchandaney, and K. Crowley, Run-Time Parallelization and Scheduling of Loops,IEEE Trans. Comp. Vol. 40, No. 5 (May 1991).

  9. C. Zhu and P. C. Yew, A Scheme to Enforce Data Dependence on Large Multiprocessor Systems,IEEE Trans. Softw Eng. 13 (6): 726–739 (June 1987).

    MATH  Google Scholar 

  10. J. E. Thornton,Design of a Computer: The Control Data 6600, Scott, Foresman, Glenview, Illinois (1971).

    Google Scholar 

  11. R. M. Tomasulo, AnEfficient Algorithm for Exploiting Multiple Arithmetic Units, IBM Journal of Res. and Dev. 11, 25–33 (Januray 1967).

    Article  MATH  Google Scholar 

  12. B. J. Smith, A. Pipelined, Shared Resource MIMD Computer,Proc. of the Int'l. Conf. on Parallel Processing (1987).

  13. Daniel Gajski, David Kuck, Duncan Lawrie, and Ahmed Sameh, CEDAR—A Large Scale Multiprocessor,Proc. of the Int'l. Conf. on Parallel Processing, pp. 524–529 (August 1983).

  14. J.-K. Peir and D. D. Gajski, Data Flow Execution of Fortran Loops,Proc. First Int'l. Conf. on Supercomputing Systems [SCS 85], pp. 129–139 (December 1985).

  15. A. K. Jones and P. Schwartz, Using Multiprocessor Systems—A Status Report,ACM Computing Surveys 12 (2): 121–166 (1980).

    Article  Google Scholar 

  16. S. Midkiff and D. Padua, Compiler Algorithms for Synchronization,IEEE Trans. Comput. C-36 (12): 1485–1495 (1987).

    Article  MATH  Google Scholar 

  17. H. Berryman and J. Saltz, A Manual for PARTI Runtime Primitives, Interim Report 90-13, ICASE (1990).

  18. D. K. Chen, P. C. Yew, and J. Torrellas, An Efficient Algorithm for the Run-Time Parallelization of doacross Loops,Proc. of Supercomputing, pp. 518–527 (November 1994).

  19. V. Krothapalli and P. Sadayappan, An Approach to Synchronization of Parallel Computing,Proc. of the Int'l. Conf. on Supercomputing, pp. 573–581 (June 1988).

  20. C. Polychronopoulos, Compiler Optimizations for Enhancing Parallelism and Their Imp act on Architecture Design,IEEE Trans. Comput. C-37 (8): 991–1004 (August, 1988).

    Article  MathSciNet  Google Scholar 

  21. J. Saltz and R. Mirchandaney, The Preprocessed doacross Loop, In Dr. H. D. Schwetman, (ed.)Proc. of the Int'l. Conf. on Parallel Processing, Vol. II—Software, CRC Press Inc., pp. 174–178 (1991).

  22. J. Saltz, R. Mirchandaney, and K. Crowley, The doconsider Loop.Proc. of the Int'l. Conf. on Supercomputing pp 29–40 (June 1989).

  23. L. Rauchwerger and D. Padua, The Privatizing doall Test: A Run-Time Technique for doall Loop Identification and Array Privatization,Proc. of the Int'l. Conf. on Supercomputing, pp. 33–43 (July 1994).

  24. L. Rauchwerger and D. A. Padua, The LRPD Test: Speculative Run-Time Parallelization of Loops with Privatization and Reduction Parallelization.Proc. of the SIGPLAN Conf. on Progr. Design and Implementation, La Jolla, Califorania, pp. 218–232 (June 1995).

  25. L. Rauchwerger and D. Padua, Parallelizing WHILE Loops for Multiprocessor Systems,Proc. of 9th Int'l Parallel Processing Symp. (April 1995).

  26. J. Wu, J. Salz, S. Hiranandani, and H. Berryman, Runtime Complication Methods for Multicomputers, In Dr. H. D. Schwetman, (ed.)Proc. of the Int'l. Conf. on Parallel Processing, pp. 26–30. CRC Press Inc., Vol. II—Software. (1991).

  27. D. J. Kuck, R. H. Kuhn, D. A. Padua, B. Leasure, and M. Wolfe, Dependence Graphs and Compiler Optimizations,Proc. of the 8th ACM Symp. on Principles of Programming Languages, pp. 207–218 (January 1981).

  28. U. Banerjee,Dependence Analysis for Supercomputing, Kluwer, Boston, Massachusetts (1988).

    Google Scholar 

  29. H. Zima,Supercompilers for Parallel and Vector Computers, ACM Press, New York (1991).

    Google Scholar 

  30. M. Burke, R. Cytron, J. Ferrante, and W. Hsieh, Automatic Generation of Nested, Fork-Join Parallelism,Journal of Supercomputing pp. 71–88 (1989).

  31. Z. Li, Array Privatization for Parallel Execution of Loops,Proc. of the 19th Int'l Symp. on Computer Architecture, pp. 313–322 (1992).

  32. D. E. Maydan, S. P. Amarasinghe, and M. S. Lam, Data Dependence and Data-Flow Analysis of Arrays,Proc. 5th Workshop on Progr. Lang. and Compilers for Parallel Computing (August 1992).

  33. P. Tu and D. Padua, Array Privatization for Shared and Distributed Memory Machines,Proc. 2nd Workshop on Languages, Compilers, and Run-Time Environment for Distributed Memory Machines (September 1992).

  34. P. Tu and D. Padua, Automatic Array Privatization,Proc. 6th Annual Workshop on Languages and Compilers for Parallel Computing, Portland, Oregon (1993).

  35. R. Eigenmann, J. Hoeflinger, Z. Li, and D. Padua, Experience in the Automatic Parallization of Four Perfect-Benchmark Programs,Lecture Notes in Computer Science 589 Proc. of the Fourth Workshop on Language and Compilers for Parallel Computing, Santa Clara, California, pp. 65–83 (August 1991).

  36. C. Kruskal, Efficient Paralell Algorithms for Graph Problems,Proc. of the Int'l. Conf. on Parallel Processing, pp. 869–876 (August 1986).

  37. F. Thomson Leighton,Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes, Morgan Kaufmann (1992).

  38. J. E. Moreira and C. D. Polychronopoulos, Autoscheduling, in a Distributed Shared-Memory Environment, Technical Report 1373, University of Illionois at Urabana-Champaign, Center for Supercomputing Research and Development (June 1994).

  39. C. Polychronopoulos, Nawaf Bitar, and Steve Kleiman, nano Threads: A User-Level Threads Architecture.Proc. of the Int'l. Conf. on Parallel Computing Technologies, Moscow, Russia, (September 1993).

  40. Alliant Computer Systems Corporation.FX/Series Architecture Manual (1986).

  41. Alliant Computers Systems Corporation,Allian FX/2800 Series System Description (1991).

  42. M. Berry, D. Chen, P. Koss, D. Kuch, S. Lo, Y. Pang, R. Roloff, A. Sameh, E. Clementi, S. Chin, D. Schneider, G. Fox, P. Messina, D. Walker, C. Hsiung, J. Schwarzmeier, K. Lue, S. Orzag, F. Seidl, O. Johnson, G. Swanson, R. Goodrum, and J. Martin, The PERFECT Club Benchmarks: Effective Performance Evaluation of Supercomputers. Technical Report CSRD-827, Center for Supercomputing Research and Development. University of Illinois, Urbana, Illinois (May 1989).

    Google Scholar 

  43. M. Guzzi, D. Padua J. Hoeflinger, and D. Lawrie, Cedar Fortran and other Vector and Parallel Fortran Dialecets,J. Supercomput. 4 (1): 37–62 (March 1990).

    Article  Google Scholar 

  44. I. S. Duff, Ma28- A Set of Fortran Subroutines for Sparse Unsymmetric Linear Equations, Techniqual Report AERE R8730, HMSO, London (1977).

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Additional information

An abstract of this paper has been publsihed in Ref. 1.

Research supported in part by Army contract #DABT63-92-C-0033. This work is not necessarily representative of the positions or policies of the Army of the Government.

Research supported in part by Intel and NASA Graduate Fellowships.

Research supported in part by an AT&T Bell Laboratoroies Graduate Fellowship and by the International Computer Science Institute, Berkeley, California.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rauchwerger, L., Amato, N.M. & Padua, D.A. A scalable method for run-time loop parallelization. Int J Parallel Prog 23, 537–576 (1995). https://doi.org/10.1007/BF02577866

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF02577866

Keywords

Navigation