Skip to main content
Log in

Runtime vs. Manual Data Distribution for Architecture-Agnostic Shared-Memory Programming Models

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

This paper compares data distribution methodologies for scaling the performance of OpenMP on NUMA architectures. We investigate the performance of automatic page placement algorithms implemented in the operating system, runtime algorithms based on dynamic page migration, runtime algorithms based on loop scheduling transformations and manual data distribution. These techniques present the programmer with trade-offs between performance and programming effort. Automatic page placement algorithms are transparent to the programmer, but may compromise memory access locality. Dynamic page migration algorithms are also transparent, but require careful engineering and tuned implementations to be effective. Manual data distribution requires substantial programming effort and architecture-specific extensions to the API, but may localize memory accesses in a nearly optimal manner. Loop scheduling transformations may or may not require intervention from the programmer, but conform better to an architecture-agnostic programming paradigm like OpenMP. We identify the conditions under which runtime data distribution algorithms can optimize memory access locality in OpenMP. We also present two novel runtime data distribution techniques, one based on memory access traces and another based on affinity scheduling of parallel loops. These techniques can be used to effectively replace manual data distribution in regular applications. The results provide a proof of concept that it is possible to scale a portable shared-memory programming model up to more than 100 processors, without modifying the API and without exposing architectural details to the programmer.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

REFERENCES

  1. OpenMP Architecture Review Board. OpenMP Fortran Application Programming Interface.Version 1.2, http://www.openmp.org (November 2000).

  2. Y. Hu, H. Lu, A. Cox, and W. Zwaenepoel, OpenMP on Networks of SMPs, Proc. 13th Int'l. Parallel Processing Symp.; Symp. Parallel and Distributed Processing (IPPS/ SPDP'99), San Juan, Puerto Rico, pp. 302–310 (April 1999).

    Google Scholar 

  3. H. Lu, Y. Hu, and W. Zwaenepoel, OpenMP on Networks of Workstations, Proc.IEEE/ACM Supercomputing'98: High Performance Networking and Computing Conference (SC'98), Orlando, Florida (November 1998).

    Google Scholar 

  4. X. Tian, A. Bik, M. Girkar, P. Grey, H. Saito, and E. Su, Intel OpenMP C++/Fortran Compiler for Hyper-Threading Technology: Implementation and Performance. Intel Technology Journal, 6(1) (January 2002).

  5. cOMPunity. The Community of OpenMP Users, Researchers, Tool Developers, and Providers. http://www.compunjity.org (2002).

  6. T. Brewer and J. Astfalk, The Evolution of the HP/Convex Exemplar, Proc. 42nd IEEE Computer Society Int'l. Conf. (COMPCON'97), San Jose, California, pp. 81–96 (February 1997).

    Google Scholar 

  7. E. Hagersten and M. Koster, WildFire: A Scalable Path for SMPs, Proc. of Fifth Int'l. Symp. High Performance Computer Architecture (HPCA-5), Orlando, Florida, pp. 171–181 (January 1999).

    Google Scholar 

  8. J. Laudon and D. Lenoski, The SGI Origin: A ccNUMA Highly Scalable Server, Proc.24th Int'l. Symp. Computer Architecture (ISCA'97), Denver, Colorado, pp. 241–251 (June 1997).

  9. T. Lovett and R. Clapp, STiNG: A CC-NUMA Computer System for the Commercial Marketplace, Proc. 23rd Int'l. Symp. Computer Architecture (ISCA'96), Philadelphia, Pennsylvania, pp. 308–317 (May 1996).

    Google Scholar 

  10. C. Holt, J. P. Singh, and J. Hennessy, Application and Architectural Bottlenecks in Large-Scale Distributed Shared Memory Machines, Proc. 23rd Int'l. Symp. Computer Architecture (ISCA'96), Philadelphia, Pennsylvania, pp. 134–145 (June 1996).

    Google Scholar 

  11. D. Jiang and J. P. Singh, Scaling Application Performance on a Cache-Coherent Multiprocessor, Proc. 26th Int'l. Symp. Computer Architecture (ISCA'99), Atlanta, Georgia, pp. 305–316 (May 1999).

    Google Scholar 

  12. S. Benkner and T. Brandes, Exploiting Data Locality on Scalable Shared Memory Machines with Data Parallel Programs, Proc. 6th Int'l. EuroPar Conf. (EuroPar'2000), Munich, Germany, pp. 647–657 (August 2000).

    Google Scholar 

  13. J. Bircsak, P. Craig, R. Crowell, Z. Cvetanovic, J. Harris, C. Nelson, and C. Offner, Extending OpenMP for NUMA Machines. Proc. IEEE/ACM Supercomputing'2000: High Performance Networking and Computing Conf. (SC'2000), Dallas, Texas (November 2000).

    Google Scholar 

  14. W. Gropp, A User's View of OpenMP: The Good, The Bad and the Ugly, Workshop on OpenMP Applications and Tools (WOMPAT'2000), San Diego, California (July 2000).

    Google Scholar 

  15. D. Kuck, OpenMP: Past and Future, Proc. Workshop on OpenMP Applications and Tools (WOMPAT'2000), San Diego, California (July 2000).

    Google Scholar 

  16. J. Levesque, The Future of OpenMP on IBM SMP Systems, Proc. First European Workshop on OpenMP (EWOMP'99), Lund, Sweden, pp. 5–6 (October 1999).

    Google Scholar 

  17. J. Merlin and V. Schuster, HPF-OpenMP for SMP Clusters, Proc. Fourth Ann. HPF User Group Meeting (HPFUG'2000), Tokyo, Japan (October 2000).

    Google Scholar 

  18. V. Schuster and D. Miles. Distributed OpenMP, Extensions to OpenMP for SMP Clusters, Proc. Workshop on OpenMP Applications and Tools (WOMPAT'2000), San Diego, California (July 2000).

    Google Scholar 

  19. R. Chandra, D. Chen, R. Cox, D. Maydan, N. Nedelijkovic, and J. Anderson. Data Distribution Support on Distributed Shared Memory Multiprocessors, Proc. ACM Conf.Progr. Lang. Design and Implementation (PLDI'97), Las Vegas, Nevada, pp. 334–345 (June 1997).

    Google Scholar 

  20. D. Nikolopoulos, T. Papatheodorou, C. Polychronopoulos, J. Labarta, and E. Ayguadé, Is Data Distribution Necessary in OpenMP? Proc. IEEE/ACM Supercomputing'2000: High Performance Networking and Computing Conference (SC'2000), Dallas, Texas (November 2000).

    Google Scholar 

  21. D. Nikolopoulos, T. Papatheodorou, C. Polychronopoulos, J. Labarta, and E. Ayguadé.A Case for User-Level Dynamic Page Migration, Proc. 14th ACM Int'l. Conf. on Supercomputing (ICS'2000), Santa Fe, New Mexico, pp. 119–130 (May 2000).

    Google Scholar 

  22. B. Verghese, S. Devine, A. Gupta, and M. Rosenblum. Operating System Support for Improving Data Locality on CC-NUMA Compute Servers, Proc. Seventh Int'l. Conf. Architectural Support for Progr. Lang. Operat. Syst. (ASPLOS-VII), Cambridge, Massachusetts, pp. 279–289 (October 1996).

    Google Scholar 

  23. D. Nikolopoulos, C. Polychronopoulos, and E. Ayguadé, Scaling Irregular Parallel Codes with Minimal Programming Effort, Proc. ACM/IEEE Supercomputing'2001: High Performance Networking and Computing Conference (SC'2001), Denver, Colorado (November 2001).

    Google Scholar 

  24. High Performance FORTRAN Forum, High Performance FORTRAN Language Specification, Version 2.0. Technical Report CRPCTR-92225, Center for Research on Parallel Computation, Rice University (January 1997).

  25. D. Bailey, T. Harris, W. Saphir, R. V. der Wijngaart, A. Woo, and M. Yarrow, The NAS Parallel Benchmarks 2.0, Technical Report NAS-95-020, Numerical Aerodynamic Simulation Facility, NASA Ames Research Center (December 1995).

  26. H. Jin, M. Frumkin, and J. Yan. The OpenMP Implementation of the NAS Parallel Benchmarks and its Performance, Technical Report NAS-99-011, NASA Ames Research Center, (October 1999).

  27. HPF+Project Consortium, HPF+: Optimizing HPF for Advanced Applications. http://www.par.univie.ac.at/project/hpf+ (1998).

  28. D. Nikolopoulos, T. Papatheodorou, C. Polychronopoulos, J. Labarta, and E. Ayguadé, UPMlib: A Runtime System for Tuning the Memory Performance of OpenMP Programs on Scalable Shared-Memory Multiprocessors, Proc. Fifth ACM Workshop on Languages, Compilers and Runtime Systems for Scalable Computers (LCR'2000), Rochester, New York, LNCS Vol. 1915, pp. 85–99 (May 2000).

    Google Scholar 

  29. Standard Performance Evaluation Corporation, SPEC CPU2000 Benchmarks. http://www.spec.org (December 2000).

  30. Compaq Computer Corporation, Compaq Alpha Server GS/320 System Technical Summary. http://www.compaq.com/alphaserver (May 2000).

  31. L. Rauchwerger and D. Padua, The Privatizing DOALL Test: A Run-Time Technique for DOALL Loop Identification and Array Privatization, Proc. Eigth ACM Int'l. Conf.Supercomputing (ICS'94), Manchester, United Kingdom, pp. 33–43 (July 1994).

    Google Scholar 

  32. C. Polychronopoulos and D. Kuck. Guided Self-Scheduling: A Practical Scheduling Scheme for Parallel Supercomputers, IEEE Trans. Comput., C-36(12):1485–1495 (December 1987).

    Google Scholar 

  33. R. Chandra, S. Devine, A. Gupta, and M. Rosenblum. Scheduling and Page Migration for Multiprocessor Compute Servers, Proc. Sixth Int'l. Conf. Architectural Support Progr.Lang. and Operat. Syst. (ASPLOS-VI), San Jose, California, pp. 12–24 (October 1994).

    Google Scholar 

  34. M. Marchetti, L. Kontothanassis, R. Bianchini, and M. Scott. Using Simple Page Placement Schemes to Reduce the Cost of Cache Fills in Coherent Shared-Memory Systems, Proc. 9th IEEE Int'l. Parallel Proc. Symp. (IPPS'95), Santa Barbara, California, pp. 380–385 (April 1995).

    Google Scholar 

  35. E. Markatos and T. LeBlanc. Using Processor Affinity in Loop Scheduling on Shared-Memory Multiprocessors. IEEE Trans. Parallel and Distributed Systems, 5(4):379–400 (April 1994).

    Google Scholar 

  36. M. Frumkin, H. Jin, and J. Yan, Implementation of NAS Parallel Benchmarks in High Performance FORTRAN. Technical Report NAS-98-009, NASA Ames Research Center (September 1998).

  37. C. Hristea, D. Lenoski, and J. Keen, Measuring Memory Hierarchy Performance on Cache-Coherent Multiprocessors Using Microbenchmarks, Proc. ACM/IEEE Supercomputing' 97: High Performance Networking and Computing Conference (SC'97), San Jose, California (November 1997).

    Google Scholar 

  38. D. Nikolopoulos, Quantifying and Resolving Remote Memory Access Contention on Hardware DSM Multiprocessors, Proc. 16th IEEE/ACM Int'l. Parallel and Distributed Proc. Symp. (IPDPS'02), Fort Lauderdale, Florida (April 2002).

    Google Scholar 

  39. H. Shan, J. P. Singh, R. Biswas, and L. Oliker, A Comparison of Three Programming Models for Adaptive Applications on the Origin2000, Proc. IEEE/ACM Supercomputing' 2000: High Performance Networking and Computing Conference (SC'2000), Dallas, Texas (November 2000).

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Nikolopoulos, D.S., Ayguadé, E. & Polychronopoulos, C.D. Runtime vs. Manual Data Distribution for Architecture-Agnostic Shared-Memory Programming Models. International Journal of Parallel Programming 30, 225–255 (2002). https://doi.org/10.1023/A:1019899812171

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1019899812171

Navigation