The performance impact of granularity control and functional parallelism

  • José E. Moreira
  • Dale Schouten
  • Constantine Polychronopoulos
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1033)


Task granularity and functional parallelism are fundamental issues in the optimization of parallel programs. Appropriate granularity for exploitation of parallelism is affected by characteristics of both the program and the execution environment. In this paper we demonstrate the efficacy of dynamic granularity control. The scheme we propose uses dynamic runtime information to select the task size of exploited parallelism at various stages of the execution of a program. We also demonstrate that functional parallelism can be an important factor in improving the performance of parallel programs, both in the presence and absence of loop-level parallelism. Functional parallelism can increase the amount of large-grain parallelism as well as provide finer-grain parallelism that leads to better load balance. Analytical models and benchmark results quantify the impact of granularity control and functional parallelism. The underlying implementation for this research is a low-overhead threads model based on user-level scheduling.


dynamic scheduling functional parallelism task granularity parallel processing threads 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Thomas Anderson, Edward Lazowska, and Henry Levy. The performance implications of thread management alternatives for shared-memory multiprocessors. IEEE Transactions on Computers, 38(12), December 1989.Google Scholar
  2. 2.
    Thomas E. Anderson, Brian N. Bershad, Edward D. Lazowska, and Henry M. Levy. Scheduler activations: Effective kernel support for the user-level management of parallelism. In 13th ACM Symposium on Operating Systems Principles, pages 95–109. ACM Sigops, October 1991.Google Scholar
  3. 3.
    Carl J. Beckmann. Hardware and Software for Functional and Fine Grain Parallelism. PhD thesis, Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, 1993.Google Scholar
  4. 4.
    Jyh-Herng Chow and Williams Ludwell Harrison. Switch-stacks: A scheme for microtasking nested parallel loops. In Supercomputing 90, pages 190–199, Nov. 1990.Google Scholar
  5. 5.
    Peter Dinda, Thomas Gross, David O'Hallaron, Edward Segall, James Stichnoth, Jaspal Subhlok, Jon Webb, and Bwolen Yang. The CMU task parallel program suite. Technical Report CMU-CS-94-131, School of Computer Science, Carnegie-Mellon University, March 1994.Google Scholar
  6. 6.
    Derek Eager and John Zahorjan. Chores: Enhanced run-time support for shared-memory parallel computing. ACM Transactions on Computer Systems, 11(1), February 1993.Google Scholar
  7. 7.
    Mike Galles and Eric Williams. Performance optimizations, implementation, and verification of the SGI Challenge multiprocessor. Silicon Graphics Technical Report. Available from http: //www. sgi. com.Google Scholar
  8. 8.
    M. Girkar and C. D. Polychronopoulos. The HTG: An intermediate representation for programs based on control and data dependences. Technical Report 1046, Center for Supercomputing Research and Development, University of Illinois at Urbana-Champaign, May 1991.Google Scholar
  9. 9.
    Milind Girkar. Functional Parallelism: Theoretical Foundations and Implementation. PhD thesis, Department of Computer Science, University of Illinois at Urbana-Champaign, 1992.Google Scholar
  10. 10.
    Milind Girkar and Constantine Polychronopoulos. Automatic detection and generation of unstructured parallelism in ordinary programs. IEEE Transactions on Parallel and Distributed Systems, 3(2), April 1992.Google Scholar
  11. 11.
    Gene H. Golub and Charles F. Van Loan. Matrix Computations. The Johns Hopkins University Press, Baltimore, MD, 1989.Google Scholar
  12. 12.
    Anoop Gupta, Andrew Tucker, and Luis Stevens. Making effective use of shared-memory multiprocessors: The process control approach. Technical Report CSL-TR-91-475A, Computer Systems Laboratory, Stanford University, 1991.Google Scholar
  13. 13.
    S. F. Hummel and E. Schonberg. Low-overhead scheduling of nested parallelism. IBM J. Res. Develp., 35(5/6):743–765, Sept/Nov 1991.Google Scholar
  14. 14.
    D. E. Knuth. The Art of Computer Programming, Vol. 3 Sorting and Searching. Addison-Wesley, Reading, Mass., 1973.Google Scholar
  15. 15.
    S. L. Lyons, T. J. Hanratty, and J. B. MacLaughlin. Large-scale computer simulation of fully developed channel flow with heat transfer. International Journal of Numerical Methods for Fluids, 13:999–1028, 1991.CrossRefGoogle Scholar
  16. 16.
    Brian D. Marsh, Michael L. Scott, Thomas J. LeBlanc, and Evangelos P. Markatos. Firstclass user-level threads. In Proceedings of the 13th ACM Symposium on Operating Systems Principles, pages 110–121, October 1991.Google Scholar
  17. 17.
    C. D. Polychronopoulos, M. B. Girkar, Mohammad R. Haghighat, C. L. Lee, B. Leung, and D. A. Schouten. Parafrase-2: An environment for parallelizing, partitioning, synchronizing, and scheduling programs on multiprocessors. International Journal of High Speed Computing, 1(1):45–72, May 1989.CrossRefGoogle Scholar
  18. 18.
    Constantine Polychronopoulos, Nawaf Bitar, and Steve Kleiman. nanothreads: A user-level threads architecture. In Proceedings of the ACM Symposium on Principles of Operating Systems, 1993.Google Scholar
  19. 19.
    Constantine D. Polychronopoulos. Autoscheduling: Control flow and data flow come together. Technical Report 1058, Center for Supercomputing Research and Development, University of Illinois at Urbana-Champaign, 1990.Google Scholar
  20. 20.
    Shankar Ramaswamy and Prithviraj Banerjee. Processor allocation and scheduling of macro dataflow graphs on distributed memory multicomputers by the PARADIGM compiler. In International Conference on Parallel Processing, pages 11:134–138, St. Charles, IL, August 1993.Google Scholar
  21. 21.
    Martin C. Rinard, Daniel J. Scales, and Monica S. Lam. Jade: A high-level machineindependent language for parallel programming. IEEE Computer, 26(6):28–38, June 1993.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 1996

Authors and Affiliations

  • José E. Moreira
    • 1
  • Dale Schouten
    • 2
  • Constantine Polychronopoulos
    • 2
  1. 1.IBM T. J. Watson Research CenterYorktown Heights
  2. 2.Center for Supercomputing Research and Development, Coordinated Science LaboratoryUniversity of Illinois at Urbana-ChampaignUrbana

Personalised recommendations