Efficient program partitioning based on compiler controlled communication
In this paper, we present an efficient framework for intraprocedural performance based program partitioning for sequential loop nests. Due to the limitations of static dependence analysis especially in the inter-procedural sense, many loop nests are identified as sequential but available task parallelism amongst them could be potentially exploited. Since this available parallelism is quite limited, performance based program analysis and partitioning which carefully analyzes the interaction between the loop nests and the underlying architectural characteristics must be undertaken to effectively use this parallelism.
We propose a compiler driven approach that configures underlying architecture to support a given communication mechanism. We then devise an iterative program partitioning algorithm that generates efficient program partitioning by analyzing interaction between effective cost of communication and the corresponding partitions. We model this problem as one of partitioning a directed acyclic task graph (DAG) in which each node is identified with a sequential loop nest and the edges denote the precedences and communication between the nodes corresponding to data transfer between loop nests. We introduce the concept of behavioral edges between edges and nodes in the task graph for capturing the interactions between computation and communication through parametric functions. We present an efficient iterative partitioning algorithm using the behavioral edge augmented PDG to incrementally compute and improve the schedule. A significant performance improvement (factor of 10 in many cases) is demonstrated by using our framework on some applications which exhibit this type of parallelism.
KeywordsSchedule Algorithm Communication Cost Loop Nest Task Graph Schedule Length
Unable to display preview. Download preview PDF.
- 1.Banerjee, Utpal, Loop Parallelization, Kluwer Academic Publishers, 1994 (Loop Transformations for Restructuring Compilers Series).Google Scholar
- 3.High Performance Fortran Forum. High Performance Fortran Language Specification, Version 1.0, Technical Report, CRPC-TR92225, Center for Research on Parallel Computation, Rice University, Houston, TX, 1992 (revised January 1993).Google Scholar
- 5.Bau D., Kodukula I., Kotlyar V., Pingali K. and Stodghill P., “Solving Alignment Using Elementary Linear Algebra”, Proceedings of 7th International Workshop on Languages and Compilers for Parallel Computing, LNCS 892, 1994, pp. 46–60.Google Scholar
- 7.Sarkar V., Partitioning and Scheduling Parallel Programs for Multiprocessors, MIT Press, Cambridge, Mass. 1989.Google Scholar
- 8.Subhlok Jaspal and Vondran Gary, “Optimal Mapping of Sequences of Data Parallel Tasks”, Proceedings of Principles and Practice of Parallel Programming (PPoPP) ’95, pp. 134–143.Google Scholar
- 11.Darbha S. and Pande S. S., ‘A Robust Compile Time Method for Scheduling Task Parallelism on Distributed Memory Systems’, Proceedings of the 1996 ACM/IEEE Conference on Parallel Architectures and Complication Techniques (PACT ’96), pp. 156–162.Google Scholar
- 14.Fahringer, T., ‘Estimating and Optimizing Performance of Parallel Programs’, IEEE Computer: Special Issue on Parallel and Distributed Processing Tools, Vol. 28, No. 11, November 1995, pp. 47–56.Google Scholar
- 15.Miller Barton P., Callaghan M., Cargille J., et al. The Paradyn Parallel Performance Measurement Tool’, IEEE Computer: Special Issue on Parallel and Distributed Proceessing Tools, Vol. 28, No. 11, November 1995, pp. 37–46.Google Scholar
- 16.Reed D. A., et al., ’scalable Performance Analysis: The Pablo Performance Analysis Environment’, Proceedings of Scalable Parallel Libraries Conference, IEEE CS Press, 1993, pp. 104–113.Google Scholar
- 17.Balasundaram V., Fox G., Kennedy K. and Kremer U., ‘A Static Performance Estimator to Guide Data Partitioning Decisions’, Proceedings of 3rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 1991, pp. 213–223.Google Scholar
- 18.Karamcheti V. and Chien A., ’software Overhead in Messaging Layers: Where Does the Time Go?’, Proceedings of the 6th ACM International Conference on Architectural Support for Programming Languages and Systems (ASPLOS VI), pp. 51–60.Google Scholar
- 19.Blume W. and Eigenmann R., ’symbolic Range Propagation’, Proceedings of the 9th International Parallel Processing Symposium, April 1995.Google Scholar
- 20.Reinhardt, S., Hill M. D., Larus J. R., Lebeck A. et al., ‘The Wisconsin Wind Tunnel: Virtual Prototyping of Parallel Computers’, Proceedings of the 1993 ACM Sigmetrics Conference on Measurement and Modeling of Computer Systems, pp. 48–60, May 1993.Google Scholar
- 21.Garey, M.R. and Johnson, D.S., ‘Computers and Intractability: A guide to the theory of NP-Completeness’, Freeman and Company, 1979.Google Scholar
- 22.NAS Parallel Benchmarks, http://science.nas.nasa.gov/Software/NPB/Google Scholar