A Practical Approach to DOACROSS Parallelization

  • Priya Unnikrishnan
  • Jun Shirako
  • Kit Barton
  • Sanjay Chatterjee
  • Raul Silvera
  • Vivek Sarkar
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7484)


Loops with cross-iteration dependences (doacross loops) often contain significant amounts of parallelism that can potentially be exploited on modern manycore processors. However, most production-strength compilers focus their automatic parallelization efforts on doall loops, and consider doacross parallelism to be impractical due to the space inefficiencies and the synchronization overheads of past approaches. This paper presents a novel and practical approach to automatically parallelizing doacross loops for execution on manycore-SMP systems. We introduce a compiler-and-runtime optimization called dependence folding that bounds the number of synchronization variables allocated per worker thread (processor core) to be at most the maximum depth of a loop nest being considered for automatic parallelization. Our approach has been implemented in a development version of the IBM XL Fortran V13.1 commercial parallelizing compiler and runtime system. For four benchmarks where automatic doall parallelization was largely ineffective (speedups of under 2×), our implementation delivered speedups of 6.5×, 9.0×, 17.3×, and 17.5× on a 32-core IBM Power7 SMP system, thereby showing that doacross parallelization can be a valuable technique to complement doall parallelization.


Loop Nest Runtime System Iteration Vector Synchronization Overhead Automatic Parallelization 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Chen, D.K.: Compiler optimizations for parallel loops with fine-grained synchronization. PhD Thesis (1994)Google Scholar
  2. 2.
    Cytron, R.: Doacross: Beyond vectorization for multiprocessors. In: Proceedings of the 1986 International Conference for Parallel Processing, pp. 836–844 (August 1986)Google Scholar
  3. 3.
    Chen, D.-K., Torrellas, J., Yew, P.C.: An efficient algorithm for the run-time parallelization of doacross loops. In: Proc. Supercomputing 1994, pp. 518–527 (November 1994)Google Scholar
  4. 4.
    Gupta, R., Pande, S., Psarris, K., Sarkar, V.: Compilation techniques for parallel systems. Parallel Computing 25(13-14), 1741–1783 (1999)CrossRefGoogle Scholar
  5. 5.
    Li, Z.: Compiler algorithms for event variable synchronization. In: Proceedings of the 5th International Conference on Supercomputing, Cologne, West Germany, pp. 85–95 (June 1991)Google Scholar
  6. 6.
    Lowenthal, D.K.: Accurately selecting block size at run time in pipelined parallel programs. International Journal of Parallel Programming 28(3), 245–274 (2000)CrossRefGoogle Scholar
  7. 7.
    Midkiff, S.P., Padua, D.A.: Compiler algorithms for synchronization. IEEE Transactions on computers C 36, 1485–1495 (1987)zbMATHCrossRefGoogle Scholar
  8. 8.
    Tang, P., Yew, P., Zhu, C.: Compiler techniques for data synchronization in nested parallel loop. In: Proc. of 1990 ACM Intl. Conf. on Supercomputing, Amsterdam, pp. 177–186 (June 1990)Google Scholar
  9. 9.
    Rajamony, R., Cox, A.L.: Optimally synchronizing doacross loops on shared memory multiprocessors. In: Proc. of Intl. Conf. on Parallel Architectures and Compilation Techniques (November 1997)Google Scholar
  10. 10.
    Su, H.M., Yew, P.C.: On data synchronization for multiprocessors. In: Proc. of the 16th Annual International Symposium on Computer Architecture, Jerusalem, Israel, pp. 416–423 (April 1989)Google Scholar
  11. 11.
    Krothapalli, V.P., Sadayappan, P.: Removal of redundant dependences in doacross loops with constant dependences. IEEE Transactions on Parallel and Distributed Systems, 281–289 (July 1991)Google Scholar
  12. 12.
    Wolfe, M.: Multiprocessor synchronization for concurrent loops. IEEE Software 5(1), 34–42 (1988)CrossRefGoogle Scholar
  13. 13.
    Zhang, G., Unnikrishnan, P., Ren, J.: Experiments with auto-parallelizing SPEC2000FP benchmarks. In: 17th Intl Workshop on Languages and Compilers for Parallel Computing (2004)Google Scholar
  14. 14.
    Pan, Z., Armstrong, B., Bae, H., Eigenmann, R.: On the interaction of tiling and automatic parallelization. In: First International Workshop on OpenMP (Wompat) (June 2005)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Priya Unnikrishnan
    • 1
  • Jun Shirako
    • 2
  • Kit Barton
    • 1
  • Sanjay Chatterjee
    • 2
  • Raul Silvera
    • 1
  • Vivek Sarkar
    • 2
  1. 1.IBM Toronto LaboratoryCanada
  2. 2.Department of Computer ScienceRice UniversityUSA

Personalised recommendations