Runtime vs. Manual Data Distribution for Architecture-Agnostic Shared-Memory Programming Models

Nikolopoulos, Dimitrios S.; Ayguadé, Eduard; Polychronopoulos, Constantine D.

doi:10.1023/A:1019899812171

Runtime vs. Manual Data Distribution for Architecture-Agnostic Shared-Memory Programming Models

Published: August 2002

Volume 30, pages 225–255, (2002)
Cite this article

International Journal of Parallel Programming Aims and scope Submit manuscript

Dimitrios S. Nikolopoulos¹,
Eduard Ayguadé² &
Constantine D. Polychronopoulos²

69 Accesses
4 Citations
Explore all metrics

Abstract

This paper compares data distribution methodologies for scaling the performance of OpenMP on NUMA architectures. We investigate the performance of automatic page placement algorithms implemented in the operating system, runtime algorithms based on dynamic page migration, runtime algorithms based on loop scheduling transformations and manual data distribution. These techniques present the programmer with trade-offs between performance and programming effort. Automatic page placement algorithms are transparent to the programmer, but may compromise memory access locality. Dynamic page migration algorithms are also transparent, but require careful engineering and tuned implementations to be effective. Manual data distribution requires substantial programming effort and architecture-specific extensions to the API, but may localize memory accesses in a nearly optimal manner. Loop scheduling transformations may or may not require intervention from the programmer, but conform better to an architecture-agnostic programming paradigm like OpenMP. We identify the conditions under which runtime data distribution algorithms can optimize memory access locality in OpenMP. We also present two novel runtime data distribution techniques, one based on memory access traces and another based on affinity scheduling of parallel loops. These techniques can be used to effectively replace manual data distribution in regular applications. The results provide a proof of concept that it is possible to scale a portable shared-memory programming model up to more than 100 processors, without modifying the API and without exposing architectural details to the programmer.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficient High-Level Programming in Plain Java

Article 05 December 2022

Programming big data analysis: principles and solutions

Article Open access 06 January 2022

Analyzing Resource Utilization in an HPC System: A Case Study of NERSC’s Perlmutter

REFERENCES

OpenMP Architecture Review Board. OpenMP Fortran Application Programming Interface.Version 1.2, http://www.openmp.org (November 2000).
Y. Hu, H. Lu, A. Cox, and W. Zwaenepoel, OpenMP on Networks of SMPs, Proc. 13th Int'l. Parallel Processing Symp.; Symp. Parallel and Distributed Processing (IPPS/ SPDP'99), San Juan, Puerto Rico, pp. 302–310 (April 1999).
Google Scholar
H. Lu, Y. Hu, and W. Zwaenepoel, OpenMP on Networks of Workstations, Proc.IEEE/ACM Supercomputing'98: High Performance Networking and Computing Conference (SC'98), Orlando, Florida (November 1998).
Google Scholar
X. Tian, A. Bik, M. Girkar, P. Grey, H. Saito, and E. Su, Intel OpenMP C++/Fortran Compiler for Hyper-Threading Technology: Implementation and Performance. Intel Technology Journal, 6(1) (January 2002).
cOMPunity. The Community of OpenMP Users, Researchers, Tool Developers, and Providers. http://www.compunjity.org (2002).
T. Brewer and J. Astfalk, The Evolution of the HP/Convex Exemplar, Proc. 42nd IEEE Computer Society Int'l. Conf. (COMPCON'97), San Jose, California, pp. 81–96 (February 1997).
Google Scholar
E. Hagersten and M. Koster, WildFire: A Scalable Path for SMPs, Proc. of Fifth Int'l. Symp. High Performance Computer Architecture (HPCA-5), Orlando, Florida, pp. 171–181 (January 1999).
Google Scholar
J. Laudon and D. Lenoski, The SGI Origin: A ccNUMA Highly Scalable Server, Proc.24th Int'l. Symp. Computer Architecture (ISCA'97), Denver, Colorado, pp. 241–251 (June 1997).
T. Lovett and R. Clapp, STiNG: A CC-NUMA Computer System for the Commercial Marketplace, Proc. 23rd Int'l. Symp. Computer Architecture (ISCA'96), Philadelphia, Pennsylvania, pp. 308–317 (May 1996).
Google Scholar
C. Holt, J. P. Singh, and J. Hennessy, Application and Architectural Bottlenecks in Large-Scale Distributed Shared Memory Machines, Proc. 23rd Int'l. Symp. Computer Architecture (ISCA'96), Philadelphia, Pennsylvania, pp. 134–145 (June 1996).
Google Scholar
D. Jiang and J. P. Singh, Scaling Application Performance on a Cache-Coherent Multiprocessor, Proc. 26th Int'l. Symp. Computer Architecture (ISCA'99), Atlanta, Georgia, pp. 305–316 (May 1999).
Google Scholar
S. Benkner and T. Brandes, Exploiting Data Locality on Scalable Shared Memory Machines with Data Parallel Programs, Proc. 6th Int'l. EuroPar Conf. (EuroPar'2000), Munich, Germany, pp. 647–657 (August 2000).
Google Scholar
J. Bircsak, P. Craig, R. Crowell, Z. Cvetanovic, J. Harris, C. Nelson, and C. Offner, Extending OpenMP for NUMA Machines. Proc. IEEE/ACM Supercomputing'2000: High Performance Networking and Computing Conf. (SC'2000), Dallas, Texas (November 2000).
Google Scholar
W. Gropp, A User's View of OpenMP: The Good, The Bad and the Ugly, Workshop on OpenMP Applications and Tools (WOMPAT'2000), San Diego, California (July 2000).
Google Scholar
D. Kuck, OpenMP: Past and Future, Proc. Workshop on OpenMP Applications and Tools (WOMPAT'2000), San Diego, California (July 2000).
Google Scholar
J. Levesque, The Future of OpenMP on IBM SMP Systems, Proc. First European Workshop on OpenMP (EWOMP'99), Lund, Sweden, pp. 5–6 (October 1999).
Google Scholar
J. Merlin and V. Schuster, HPF-OpenMP for SMP Clusters, Proc. Fourth Ann. HPF User Group Meeting (HPFUG'2000), Tokyo, Japan (October 2000).
Google Scholar
V. Schuster and D. Miles. Distributed OpenMP, Extensions to OpenMP for SMP Clusters, Proc. Workshop on OpenMP Applications and Tools (WOMPAT'2000), San Diego, California (July 2000).
Google Scholar
R. Chandra, D. Chen, R. Cox, D. Maydan, N. Nedelijkovic, and J. Anderson. Data Distribution Support on Distributed Shared Memory Multiprocessors, Proc. ACM Conf.Progr. Lang. Design and Implementation (PLDI'97), Las Vegas, Nevada, pp. 334–345 (June 1997).
Google Scholar
D. Nikolopoulos, T. Papatheodorou, C. Polychronopoulos, J. Labarta, and E. Ayguadé, Is Data Distribution Necessary in OpenMP? Proc. IEEE/ACM Supercomputing'2000: High Performance Networking and Computing Conference (SC'2000), Dallas, Texas (November 2000).
Google Scholar
D. Nikolopoulos, T. Papatheodorou, C. Polychronopoulos, J. Labarta, and E. Ayguadé.A Case for User-Level Dynamic Page Migration, Proc. 14th ACM Int'l. Conf. on Supercomputing (ICS'2000), Santa Fe, New Mexico, pp. 119–130 (May 2000).
Google Scholar
B. Verghese, S. Devine, A. Gupta, and M. Rosenblum. Operating System Support for Improving Data Locality on CC-NUMA Compute Servers, Proc. Seventh Int'l. Conf. Architectural Support for Progr. Lang. Operat. Syst. (ASPLOS-VII), Cambridge, Massachusetts, pp. 279–289 (October 1996).
Google Scholar
D. Nikolopoulos, C. Polychronopoulos, and E. Ayguadé, Scaling Irregular Parallel Codes with Minimal Programming Effort, Proc. ACM/IEEE Supercomputing'2001: High Performance Networking and Computing Conference (SC'2001), Denver, Colorado (November 2001).
Google Scholar
High Performance FORTRAN Forum, High Performance FORTRAN Language Specification, Version 2.0. Technical Report CRPCTR-92225, Center for Research on Parallel Computation, Rice University (January 1997).
D. Bailey, T. Harris, W. Saphir, R. V. der Wijngaart, A. Woo, and M. Yarrow, The NAS Parallel Benchmarks 2.0, Technical Report NAS-95-020, Numerical Aerodynamic Simulation Facility, NASA Ames Research Center (December 1995).
H. Jin, M. Frumkin, and J. Yan. The OpenMP Implementation of the NAS Parallel Benchmarks and its Performance, Technical Report NAS-99-011, NASA Ames Research Center, (October 1999).
HPF+Project Consortium, HPF+: Optimizing HPF for Advanced Applications. http://www.par.univie.ac.at/project/hpf+ (1998).
D. Nikolopoulos, T. Papatheodorou, C. Polychronopoulos, J. Labarta, and E. Ayguadé, UPMlib: A Runtime System for Tuning the Memory Performance of OpenMP Programs on Scalable Shared-Memory Multiprocessors, Proc. Fifth ACM Workshop on Languages, Compilers and Runtime Systems for Scalable Computers (LCR'2000), Rochester, New York, LNCS Vol. 1915, pp. 85–99 (May 2000).
Google Scholar
Standard Performance Evaluation Corporation, SPEC CPU2000 Benchmarks. http://www.spec.org (December 2000).
Compaq Computer Corporation, Compaq Alpha Server GS/320 System Technical Summary. http://www.compaq.com/alphaserver (May 2000).
L. Rauchwerger and D. Padua, The Privatizing DOALL Test: A Run-Time Technique for DOALL Loop Identification and Array Privatization, Proc. Eigth ACM Int'l. Conf.Supercomputing (ICS'94), Manchester, United Kingdom, pp. 33–43 (July 1994).
Google Scholar
C. Polychronopoulos and D. Kuck. Guided Self-Scheduling: A Practical Scheduling Scheme for Parallel Supercomputers, IEEE Trans. Comput., C-36(12):1485–1495 (December 1987).
Google Scholar
R. Chandra, S. Devine, A. Gupta, and M. Rosenblum. Scheduling and Page Migration for Multiprocessor Compute Servers, Proc. Sixth Int'l. Conf. Architectural Support Progr.Lang. and Operat. Syst. (ASPLOS-VI), San Jose, California, pp. 12–24 (October 1994).
Google Scholar
M. Marchetti, L. Kontothanassis, R. Bianchini, and M. Scott. Using Simple Page Placement Schemes to Reduce the Cost of Cache Fills in Coherent Shared-Memory Systems, Proc. 9th IEEE Int'l. Parallel Proc. Symp. (IPPS'95), Santa Barbara, California, pp. 380–385 (April 1995).
Google Scholar
E. Markatos and T. LeBlanc. Using Processor Affinity in Loop Scheduling on Shared-Memory Multiprocessors. IEEE Trans. Parallel and Distributed Systems, 5(4):379–400 (April 1994).
Google Scholar
M. Frumkin, H. Jin, and J. Yan, Implementation of NAS Parallel Benchmarks in High Performance FORTRAN. Technical Report NAS-98-009, NASA Ames Research Center (September 1998).
C. Hristea, D. Lenoski, and J. Keen, Measuring Memory Hierarchy Performance on Cache-Coherent Multiprocessors Using Microbenchmarks, Proc. ACM/IEEE Supercomputing' 97: High Performance Networking and Computing Conference (SC'97), San Jose, California (November 1997).
Google Scholar
D. Nikolopoulos, Quantifying and Resolving Remote Memory Access Contention on Hardware DSM Multiprocessors, Proc. 16th IEEE/ACM Int'l. Parallel and Distributed Proc. Symp. (IPDPS'02), Fort Lauderdale, Florida (April 2002).
Google Scholar
H. Shan, J. P. Singh, R. Biswas, and L. Oliker, A Comparison of Three Programming Models for Adaptive Applications on the Origin2000, Proc. IEEE/ACM Supercomputing' 2000: High Performance Networking and Computing Conference (SC'2000), Dallas, Texas (November 2000).
Google Scholar

Download references

Author information

Authors and Affiliations

Coordinated Science Laboratory, Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, 1308 W. Main St., Urbana, Illinois, 61801
Dimitrios S. Nikolopoulos
Department d'Arquitectura de Computadors, Universitat Politecnica de Catalunya, c/Jordi Girona 1–3, Modul D6, 08034, Barcelona, Spain
Eduard Ayguadé & Constantine D. Polychronopoulos

Authors

Dimitrios S. Nikolopoulos
View author publications
You can also search for this author in PubMed Google Scholar
Eduard Ayguadé
View author publications
You can also search for this author in PubMed Google Scholar
Constantine D. Polychronopoulos
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Nikolopoulos, D.S., Ayguadé, E. & Polychronopoulos, C.D. Runtime vs. Manual Data Distribution for Architecture-Agnostic Shared-Memory Programming Models. International Journal of Parallel Programming 30, 225–255 (2002). https://doi.org/10.1023/A:1019899812171

Download citation

Issue Date: August 2002
DOI: https://doi.org/10.1023/A:1019899812171

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Runtime vs. Manual Data Distribution for Architecture-Agnostic Shared-Memory Programming Models

Abstract

Access this article

Similar content being viewed by others

Efficient High-Level Programming in Plain Java

Programming big data analysis: principles and solutions

Analyzing Resource Utilization in an HPC System: A Case Study of NERSC’s Perlmutter

REFERENCES

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Runtime vs. Manual Data Distribution for Architecture-Agnostic Shared-Memory Programming Models

Abstract

Access this article

Similar content being viewed by others

Efficient High-Level Programming in Plain Java

Programming big data analysis: principles and solutions

Analyzing Resource Utilization in an HPC System: A Case Study of NERSC’s Perlmutter

REFERENCES

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation