A Methodology to Characterize Critical Section Bottlenecks in DSM Multiprocessors

  • Benjamín Sahelices
  • Pablo Ibáñez
  • Víctor Viñals
  • J. M. Llabería
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5704)


Understanding and optimizing the synchronization operations of parallel programs in distributed shared memory multiprocessors (dsm), is one of the most important factors leading to significant reductions in execution time.

This paper introduces a new methodology for tuning performance of parallel programs. We focus on the critical sections used to assure exclusive access to critical resources and data structures, proposing a specific dynamic characterization of every critical section in order to a) measure the lock contention, b) measure the degree of data sharing in consecutive executions, and c) break down the execution time, reflecting the different overheads that can appear. All the required measurements are taken using a multiprocessor simulator with a detailed timing model of the processor and memory system.

We propose also a static classification of critical sections that takes into account how locks are associated with their protected data. The dynamic characterization and the static classification are correlated to identify key critical sections and infer code optimization opportunities (e.g. data layout), which when applied can lead to significant reductions in execution time (up to 33 % in the SPLASH-2 scientific benchmark suite). By using the simulator we can also evaluate whether the performance of the applied code optimizations is sensitive to common hardware optimizations or not.


Execution Time Critical Section Baseline System Normalize Execution Time Data Layout 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Fernández, R., García, J.: rsim x86:a cost-effective performance simulator. In: Proc. 19th European Conference on Modelling and Simulation ECMS (2005)Google Scholar
  2. 2.
    Pai, V., Ranganathan, P., Adve, S.: rsim reference manual version 1.0. Technical report 9705, Dept. Electrical and Computer Eng., Rice University (1997)Google Scholar
  3. 3.
    Marathe, J., Mueller, F.: Source-code-correlated cache coherence characterization of openmp benchmarks. IEEE Transactions on Parallel and Distributed Systems 18(6), 818–834 (2007)CrossRefGoogle Scholar
  4. 4.
    Eggers, S.J., Jeremiassen, T.: Eliminating false sharing. In: Proc. Int. Conf. Parallel Processing, vol. I, pp. 377–381 (1991)Google Scholar
  5. 5.
    Kagi, A., Burger, D., Goodman, J.: Efficient synchronization: let them eat qolb. In: Proc. 24th ISCA, pp. 170–180 (1997)Google Scholar
  6. 6.
    Torrellas, J., Lam, M., Hennessy, J.: False sharing and spatial locality in multiprocessor caches. IEEE Trans. Computers 43(6), 651–663 (1994)CrossRefzbMATHGoogle Scholar
  7. 7.
    Gharachorloo, K., Gupta, A., Hennessy, J.: Two techniques to enhance the performance of memory consistency models. In: Proc. ICPP, pp. 355–364 (1991)Google Scholar
  8. 8.
    Michael, M., Nanda, A.: Design and performance of directory caches for scalable shared memory multiprocessors. In: Proc. 5th HPCA (1999)Google Scholar
  9. 9.
    Woodacre, M., Robb, D., Roe, D., Feind, K.: The SGI altix 3000 global shared-memory architecture. White paper silicon graphics inc., SGI (2003)Google Scholar
  10. 10.
    Anderson, T.: The performance of spin lock alternatives for shared-memory multiprocessors. IEEE Trans. Parallel and Distrib. Systems 1(1), 6–16 (1990)CrossRefGoogle Scholar
  11. 11.
    Graunke, G., Thakkar, S.: Synchronization algorithms for shared memory multiprocessors. IEEE Computer 23(6), 60–69 (1990)CrossRefGoogle Scholar
  12. 12.
    Mellor-Crummey, J., Scott, M.: Algorithms for scalable synchronization on shared memory multiprocessors. ACM Trans. Computer Systems 9(1), 21–65 (1991)CrossRefGoogle Scholar
  13. 13.
    Laudon, J., Lenoski, D.: The sgi origin: A cc-numa highly scalable server. In: Proc. 24th ISCA (1997)Google Scholar
  14. 14.
    Woo, S., et al.: The splash-2 programs: Characterization and methodological considerations. In: Proc. 22th ISCA, pp. 24–36 (1995)Google Scholar
  15. 15.
    Acacio, M., González, J., García, J., Duato, J.: Owner prediction for accelerating cache-to-cache transfer misses in a cc-numa architecture. In: Proc. 16th Int. Conf. on Supercomputing (2002)Google Scholar
  16. 16.
    Browne, S., Dongarra, J., Garner, N., London, K., Mucci, P.: A scalable cross-platform infrastructure for application performance tuning using hardware counters. In: ACM/IEEE Supercomputing Conference, p. 42 (2000)Google Scholar
  17. 17.
    De Rose, L., Reed, D.: Svpablo: A multi-language architecture-independent performance analysis system. In: Int. Conf. Parallel Processing, pp. 311–318 (1999)Google Scholar
  18. 18.
    Mellor-Crummey, J., Fowler, R., Whalley, D.: Tools for application-oriented performance tuning. In: Proc. 15th Int. Conf. Supercomput, pp. 154–165 (2001)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Benjamín Sahelices
    • 1
  • Pablo Ibáñez
    • 2
  • Víctor Viñals
    • 2
  • J. M. Llabería
    • 3
  1. 1.Depto. de InformáticaUniv. de ValladolidSpain
  2. 2.Depto. de Informática e Ing. de Sistemas, I3A and HiPEACUniv. de ZaragozaSpain
  3. 3.Depto. de Arquitectura de Computadores.Univ. Polit. de CataluñaSpain

Personalised recommendations