autopin – Automated Optimization of Thread-to-Core Pinning on Multicore Systems

  • Tobias Klug
  • Michael Ott
  • Josef Weidendorfer
  • Carsten Trinitis
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6590)


In this paper we present a framework for automatic detection and application of the best binding between threads of a running parallel application and processor cores in a shared memory system, by making use of hardware performance counters. This is especially important within the scope of multicore architectures with shared cache levels. We demonstrate that many applications from the SPEC OMP benchmark show quite sensitive runtime behavior depending on the thread/core binding used. In our tests, the proposed framework is able to find the best binding in nearly all cases. The proposed framework is intended to supplement job scheduling systems for better automatic exploitation of systems with multicore processors, as well as making programmers aware of this issue by providing measurement logs.


Multicore CMP automatic performance optimization hardware performance counters CPU binding thread placement 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Graham, S.L., Kessler, P.B., McKusick, M.K.: gprof: a Call Graph Execution Profiler. In: SIGPLAN Symposium on Compiler Construction, pp. 120–126 (1982)Google Scholar
  2. 2.
    Intel: VTune Performance Analyzer,
  3. 3.
    Frigo, M., Leiserson, C.E., Prokop, H., Ramachandran, S.: Cache-Oblivious Algorithms. In: FOCS 1999: Proceedings of the 40th Annual Symposium on Foundations of Computer Science, p. 285. IEEE Computer Society Press, Washington, DC (1999)Google Scholar
  4. 4.
  5. 5.
    Whaley, R.C., Dongarra, J.J.: Automatically Tuned Linear Algebra Software. Technical report (1997)Google Scholar
  6. 6.
    Intel Corporation: Intel 64 and IA-32 Architectures: Software Developer’s Manual, Denver, CO, USA (2007)Google Scholar
  7. 7.
    Advanced Micro Devices: AMD64 Architecture Programmer’s Manual. Number 24593 (2007)Google Scholar
  8. 8.
    Browne, S., Dongarra, J., Garner, N., London, K., Mucci, P.: A scalable cross-platform infrastructure for application performance tuning using hardware counters. In: Supercomputing 2000: Proceedings of the 2000 ACM/IEEE Conference on Supercomputing, Washington, DC, USA, p. 42. IEEE Computer Society, Los Alamitos (2000)Google Scholar
  9. 9.
    Levon, J.: OProfile manual,
  10. 10.
    Eranian, S.: The perfmon2 Interface Specification. Technical Report HPL-2004-200R1, Hewlett-Packard Laboratory (February 2005)Google Scholar
  11. 11. The OpenMP API specification for parallel programming,
  12. 12.
    Chapman, B., an Mey, D.: The Future of OpenMP in the Multi-Core Era. In: ParCo 2007: Proceedings of the International Conference on Parallel Computing: Architectures, Algorithms and Applications, pp. 571–572. IOS Press, Amsterdam (2008)Google Scholar
  13. 13.
  14. 14.
    Chapman, B.: The Multicore Programming Challenge. In: Xu, M., Zhan, Y.-W., Cao, J., Liu, Y. (eds.) APPT 2007. LNCS, vol. 4847, p. 3. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  15. 15.
    Fürlinger, K., Moore, S.: Continuous runtime profiling of openmp applications. In: Proceedings of the 2007 Conference on Parallel Computing (PARCO 2007), pp. 677–686 (September 2007)Google Scholar
  16. 16.
    Ott, M., Klug, T., Weidendorfer, J., Trinitis, C.: autopin - Automated Optimization of Thread-to-Core Pinning on Multicore Systems. In: Proceedings of 1st Workshop on Programmability Issues for Multi-Core Computers (MULTIPROG) (January 2008),
  17. 17.
    Schermerhorn, L.T.: Automatic Page Migration for Linux - A Matter of Hygiene (January 2007); Talk at 2007Google Scholar
  18. 18.
    Saito, H., Gaertner, G., Jones, W.B., Eigenmann, R., Iwashita, H., Lieberman, R., van Waveren, G.M., Whitney, B.: Large system performance of spec omp2001 benchmarks. In: Zima, H.P., Joe, K., Sato, M., Seo, Y., Shimasaki, M. (eds.) ISHPC 2002. LNCS, vol. 2327, pp. 370–379. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  19. 19.
    Weidendorfer, J., Ott, M., Klug, T., Trinitis, C.: Latencies of conflicting writes on contemporary multicore architectures. In: Malyshkin, V.E. (ed.) PaCT 2007. LNCS, vol. 4671, pp. 318–327. Springer, Heidelberg (2007)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Tobias Klug
    • 1
  • Michael Ott
    • 1
  • Josef Weidendorfer
    • 1
  • Carsten Trinitis
    • 1
  1. 1.Lehrstuhl für Rechnertechnik und Rechnerorganisation / Parallelrechnerarchitektur (LRR/TUM)Technische Universität MünchenGarching bei MünchenGermany

Personalised recommendations