Efficient program partitioning based on compiler controlled communication

  • Ram Subramanian
  • Santosh Pande
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1586)


In this paper, we present an efficient framework for intraprocedural performance based program partitioning for sequential loop nests. Due to the limitations of static dependence analysis especially in the inter-procedural sense, many loop nests are identified as sequential but available task parallelism amongst them could be potentially exploited. Since this available parallelism is quite limited, performance based program analysis and partitioning which carefully analyzes the interaction between the loop nests and the underlying architectural characteristics must be undertaken to effectively use this parallelism.

We propose a compiler driven approach that configures underlying architecture to support a given communication mechanism. We then devise an iterative program partitioning algorithm that generates efficient program partitioning by analyzing interaction between effective cost of communication and the corresponding partitions. We model this problem as one of partitioning a directed acyclic task graph (DAG) in which each node is identified with a sequential loop nest and the edges denote the precedences and communication between the nodes corresponding to data transfer between loop nests. We introduce the concept of behavioral edges between edges and nodes in the task graph for capturing the interactions between computation and communication through parametric functions. We present an efficient iterative partitioning algorithm using the behavioral edge augmented PDG to incrementally compute and improve the schedule. A significant performance improvement (factor of 10 in many cases) is demonstrated by using our framework on some applications which exhibit this type of parallelism.


Schedule Algorithm Communication Cost Loop Nest Task Graph Schedule Length 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Banerjee, Utpal, Loop Parallelization, Kluwer Academic Publishers, 1994 (Loop Transformations for Restructuring Compilers Series).Google Scholar
  2. 2.
    Darbha S. and Agrawal D. P., “Optimal Scheduling Algorithm for Distributed-Memory Machines”, IEEE Transactions on Parallel and Distributed Systems, Vol. 9, No. 1, January, 1998, pp. 87–95.CrossRefGoogle Scholar
  3. 3.
    High Performance Fortran Forum. High Performance Fortran Language Specification, Version 1.0, Technical Report, CRPC-TR92225, Center for Research on Parallel Computation, Rice University, Houston, TX, 1992 (revised January 1993).Google Scholar
  4. 4.
    Kwok Y-K and Ahmad I., “Dynamic Critical-Path Scheduling: An Effective Technique for Allocating Task Graphs to Multiprocessors”, IEEE Transactions on Parallel and Distributed Systems, May 1996, Vol. 7, No. 5, pp. 506–521.CrossRefGoogle Scholar
  5. 5.
    Bau D., Kodukula I., Kotlyar V., Pingali K. and Stodghill P., “Solving Alignment Using Elementary Linear Algebra”, Proceedings of 7th International Workshop on Languages and Compilers for Parallel Computing, LNCS 892, 1994, pp. 46–60.Google Scholar
  6. 6.
    Gerasoulis A. and Yang T., “On Granularity and Clustering of Directed Acyclic Task Graphs”, IEEE Transactions on Parallel and Distributed Systems, Vol. 4, Number 6 June 1993, pp. 686–701.CrossRefGoogle Scholar
  7. 7.
    Sarkar V., Partitioning and Scheduling Parallel Programs for Multiprocessors, MIT Press, Cambridge, Mass. 1989.Google Scholar
  8. 8.
    Subhlok Jaspal and Vondran Gary, “Optimal Mapping of Sequences of Data Parallel Tasks”, Proceedings of Principles and Practice of Parallel Programming (PPoPP) ’95, pp. 134–143.Google Scholar
  9. 9.
    Chretienne P., ‘Tree Scheduling with Communication Delays’, Discrete Applied Mathematics, vol. 49, no. 1-3, p 129–141, 1994.zbMATHMathSciNetCrossRefGoogle Scholar
  10. 10.
    Yang, T. and Gerasoulis, A., ‘DSC: scheduling parallel tasks on an unbounded number of processors’, IEEE Transactions on Parallel and Distributed Systems, vol. 5, no. 9, 951–967, 1994.CrossRefGoogle Scholar
  11. 11.
    Darbha S. and Pande S. S., ‘A Robust Compile Time Method for Scheduling Task Parallelism on Distributed Memory Systems’, Proceedings of the 1996 ACM/IEEE Conference on Parallel Architectures and Complication Techniques (PACT ’96), pp. 156–162.Google Scholar
  12. 12.
    Pande S. S. and Psarris K., ‘Program Repartitioning on Varying Communication Cost Parallel Architectures’, Journal of Parallel and Distributed Computing 33, March 1996, pp. 205–213.CrossRefGoogle Scholar
  13. 13.
    Pande S. S., Agrawal D. P., and Mauney J., ‘A Scalable Scheduling Method for Functional Parallelism on Distributed Memory Multiprocessors’, IEEE Transactions on Parallel and Distributed Systems, Vol. 6, No. 4, April 1995, pp. 388–399.CrossRefGoogle Scholar
  14. 14.
    Fahringer, T., ‘Estimating and Optimizing Performance of Parallel Programs’, IEEE Computer: Special Issue on Parallel and Distributed Processing Tools, Vol. 28, No. 11, November 1995, pp. 47–56.Google Scholar
  15. 15.
    Miller Barton P., Callaghan M., Cargille J., et al. The Paradyn Parallel Performance Measurement Tool’, IEEE Computer: Special Issue on Parallel and Distributed Proceessing Tools, Vol. 28, No. 11, November 1995, pp. 37–46.Google Scholar
  16. 16.
    Reed D. A., et al., ’scalable Performance Analysis: The Pablo Performance Analysis Environment’, Proceedings of Scalable Parallel Libraries Conference, IEEE CS Press, 1993, pp. 104–113.Google Scholar
  17. 17.
    Balasundaram V., Fox G., Kennedy K. and Kremer U., ‘A Static Performance Estimator to Guide Data Partitioning Decisions’, Proceedings of 3rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 1991, pp. 213–223.Google Scholar
  18. 18.
    Karamcheti V. and Chien A., ’software Overhead in Messaging Layers: Where Does the Time Go?’, Proceedings of the 6th ACM International Conference on Architectural Support for Programming Languages and Systems (ASPLOS VI), pp. 51–60.Google Scholar
  19. 19.
    Blume W. and Eigenmann R., ’symbolic Range Propagation’, Proceedings of the 9th International Parallel Processing Symposium, April 1995.Google Scholar
  20. 20.
    Reinhardt, S., Hill M. D., Larus J. R., Lebeck A. et al., ‘The Wisconsin Wind Tunnel: Virtual Prototyping of Parallel Computers’, Proceedings of the 1993 ACM Sigmetrics Conference on Measurement and Modeling of Computer Systems, pp. 48–60, May 1993.Google Scholar
  21. 21.
    Garey, M.R. and Johnson, D.S., ‘Computers and Intractability: A guide to the theory of NP-Completeness’, Freeman and Company, 1979.Google Scholar
  22. 22.
    NAS Parallel Benchmarks, Scholar

Copyright information

© Springer-Verlag 1999

Authors and Affiliations

  1. 1.Dept. of ECECSUniv. of CincinnatiCincinnati

Personalised recommendations