Exploring Thread and Memory Placement on NUMA Architectures: Solaris and Linux, UltraSPARC/FirePlane and Opteron/HyperTransport

  • Joseph Antony
  • Pete P. Janes
  • Alistair P. Rendell
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4297)


Modern shared memory multiprocessor systems commonly have non-uniform memory access (NUMA) with asymmetric memory bandwidth and latency characteristics. Operating systems now provide application programmer interfaces allowing the user to perform specific thread and memory placement. To date, however, there have been relatively few detailed assessments of the importance of memory/thread placement for complex applications.

This paper outlines a framework for performing memory and thread placement experiments on Solaris and Linux. Thread binding and location specific memory allocation and its verification is discussed and contrasted.

Using the framework, the performance characteristics of serial versions of lmbench, Stream and various BLAS libraries (ATLAS, GOTO, ACML on Opteron/Linux and Sunperf on Opteron, UltraSPARC/Solaris) are measured on two different hardware platforms (UltraSPARC/FirePlane and Opteron/HyperTransport). A simple model describing performance as a function of memory distribution is proposed and assessed for both the Opteron and UltraSPARC.


Application Program Interface Memory Bandwidth Memory Allocation Data Quantity Virtual Address 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Brecht, T.: On the Importance of Parallel Application Placement in NUMA Multiprocessors. In: Proceedings of the Fourth Symposium on Experiences with Distributed and Multiprocessor Systems (SEDMS IV), pp. 1–18 (1993)Google Scholar
  2. 2.
    Browne, S., Dongarra, J., Garner, N., Ho, G., Mucci, P.: A Portable Programming Interface for Performance Evaluation on Modern Processors. The International Journal of High Performance Computing Applications 14(3), 189–204 (2000)CrossRefGoogle Scholar
  3. 3.
    Celestica Inc. AMD A8440 4U 4 Processor SCSI System,
  4. 4.
    Charlesworth, A.: The Sun Fireplane System Interconnect. In: Supercomputing 2001: Proceedings of the 2001 ACM/IEEE conference on Supercomputing (CDROM). ACM Press, New York (2001)Google Scholar
  5. 5.
    Culler, D.E., Gupta, A., Singh, J.P.: Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kaufmann Publishers, Inc., San Francisco (1999)Google Scholar
  6. 6.
    Butenhof, D.R.: Programming with POSIX Threads. Addison-Wesley Professional, Reading (1997)Google Scholar
  7. 7.
    Nikolopoulos, D.S., Papatheodorou, T.S., Polychronopoulos, C.D., Labarta, J., Ayguadé, E.: Leveraging Transparent Data Distribution in OpenMP via User-Level Dynamic Page Migration. In: Valero, M., Joe, K., Kitsuregawa, M., Tanaka, H. (eds.) ISHPC 2000. LNCS, vol. 1940, pp. 415–427. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  8. 8.
    Goto, K., van de Geijn, R.A.: Anatomy of High-Performance Matrix Multiplication. ACM Transactions on Mathematical Software (in submission, 2006)Google Scholar
  9. 9.
    Trodden, J., Anderson, D.: HyperTransport System Architecture. Addison-Wesley Professional, Reading (2003)Google Scholar
  10. 10.
    McCalpin, J.: Stream: Sustainable memory bandwidth in high performance computers,
  11. 11.
  12. 12.
    Keltcher, C.N., McGrath, K.J., Ahmed, A., Conway, P.: The AMD Opteron Processor for Multiprocessor Servers. IEEE Micro 23(2), 66–76 (2003)CrossRefGoogle Scholar
  13. 13.
    McVoy, L.W., Staelin, C.: lmbench: Portable tools for performance analysis. In: USENIX Annual Technical Conference, pp. 279–294 (1996)Google Scholar
  14. 14.
  15. 15.
    Ekman, P.: Linux kernel memory-to-node mappings,
  16. 16.
    Robertson, N., Rendell, A.P.: OpenMP and NUMA Architectures I: Investigating Memory Placement on the SGI Origin 3000. In: Sloot, P.M.A., Abramson, D., Bogdanov, A.V., Gorbachev, Y.E., Dongarra, J., Zomaya, A.Y. (eds.) ICCS 2003. LNCS, vol. 2660, pp. 648–656. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  17. 17.
    Chandra, R., Menon, R., et al.: Parallel Programming in OpenMP. Morgan Kaufmann, San Francisco (2000)Google Scholar
  18. 18.
    Sun Microsystems. Solaris 10 : Extended Library Functions,
  19. 19.
    Sun Microsystems. Solaris 10: Programming Interfaces Guide,
  20. 20.
    Sun Microsystems. UltraSPARC III Cu User’s Manual. Sun Microsystems, Santa Clara, California, USA, Version 2.2.1. (January 2004)Google Scholar
  21. 21.
    Sun Microsystems Inc. The Sun Fire V1280 Server Architecture(November 2002),
  22. 22.
    Tikir, M.M., Hollingsworth, J.K.: Using Hardware Counters to Automatically Improve Memory Performance. In: SC, p. 46. IEEE Computer Society Press, Los Alamitos (2004)Google Scholar
  23. 23.
    Whaley, R.C., Petitet, A., Dongarra, J.: ATLAS. Parallel Computing 27(1-2), 3–35 (2001)MATHCrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Joseph Antony
    • 1
  • Pete P. Janes
    • 1
  • Alistair P. Rendell
    • 1
  1. 1.Department of Computer ScienceAustralian National UniversityCanberraAustralia

Personalised recommendations