Performance Tuning of x86 OpenMP Codes with MAQAO

  • Denis Barthou
  • Andres Charif Rubial
  • William Jalby
  • Souad Koliai
  • Cédric Valensi
Conference paper


Failing to find the best optimization sequence for a given application code can lead to compiler generated codes with poor performances or inappropriate code. It is necessary to analyze performances from the assembly generated code to improve over the compilation process. This paper presents a tool for the performance analysis of multithreaded codes (OpenMP programs support at the moment). MAQAO relies on static performance evaluation to identify compiler optimizations and assess performance of loops. It exploits static binary rewriting for reading and instrumenting object files or executables. Static binary instrumentation allows the insertion of probes at instruction level. Memory accesses can be captured to help tune the code, but such traces require to be compressed. MAQAO can analyze the results and provide hints for tuning the code. We show on some examples how this can help users improve their OpenMP applications.


Memory Access Cache Line Memory Hierarchy Performance Tune Assembly Code 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Acumum AB. Acumem SlowSpotter and Acumem ThreadSpotter, 2009.
  2. 2.
    L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor-Crummey, and N. R. Tallent. HPCToolkit: Tools for performance analysis of optimized parallel programs. Technical Report TR08-06, Rice University, 2008. Google Scholar
  3. 3.
    A. Alexandrov, S. Bratanov, J. Fedorova, D. Levinthal, I. Lopatin, and D. Ryabtsev. Parallelization Made Easier with Intel Performance-Tuning Utility, 2007.
  4. 4.
    B. Buck and J. K. Hollingsworth. An API for Runtime Code Patching. Intl. Journal of High Performance Computing Applications, 14:317–329, 2000. CrossRefGoogle Scholar
  5. 5.
    Intel Corporation. Intel VTune Performance Analyzer 9.1, 2009.
  6. 6.
    L. Djoudi, D. Barthou, P. Carribault, C. Lemuet, J-T. Acquaviva, and W. Jalby. Exploring Application Performance: a New Tool For a Static/Dynamic Approach. In Los Alamos Computer Science Institute Symp., Santa Fe, NM, October 2005. Google Scholar
  7. 7.
    E. N. Elnozahy. Address trace compression through loop detection and reduction. SIGMETRICS Perform. Eval. Rev., 27(1):214–215, 1999. CrossRefGoogle Scholar
  8. 8.
    Agner F. Software optimization resources, 2009.
  9. 9.
    L. Georgiadis, R. F. Werneck, R. E. Tarjan, S. Triantafyllis, and D. I. August. Algorithms - ESA, 3221:677–688, 2004. Google Scholar
  10. 10.
    W. Jalby, C. Lemuet, and X. Le Pasteur. A New Set of Microbenchmarks to Explore Memory System Performance for Scientific Computing, 2004. International Journal of High Performance Computing Applications. Google Scholar
  11. 11.
    A. Ketterlin and Ph. Clauss. Prediction and Trace Compression of Data Access trough Nested Loop Recognition. In ACM/IEEE Int. Symp. on Code Optimization and Generation, 2008. Google Scholar
  12. 12.
    S. Koliai, S. Zuckerman, E. Oseret, M. Ivascot, T. Moseley, D. Quang, and W. Jalby. A Balanced Approach to Application Performance Tuning. In Proc. of LCPC, LNCS, Delaware, USA, October 2009. Springer. Google Scholar
  13. 13.
    J. Marathe, F. Mueller, T. Mohan, B. R. de Supinski, S. A. McKee, and A. Yoo. METRIC: Tracking Down Inefficiencies in the Memory Hierarchy via Binary Rewriting. ACM/IEEE Int. Symp. on Code Optimization and Generation, 0:289, 2003. Google Scholar
  14. 14.
    N. Nethercote and J. Seward. Valgrind: A Framework for Heavyweight Dynamic Binary Instrumentation. 2007. Proceedings of ACM SIGPLAN 2007 Conference on Programming Language Design and Implementation (PLDI 2007), San Diego, California, USA, June 2007. Google Scholar
  15. 15.
    C. Mills Olschanowsky, M. Tikir, L. Carrington, and A. Snavely. PSnAP: Accurate Synthetic Address Streams Through Memory Profiles. In Int. Workshop on Languages and Compilers for Parallel Computing, 2009. Google Scholar
  16. 16.
    ParMA ITEA2 Project: Parallel Programming for Multicore Architectures.
  17. 17.
    B. Risio, A. Berreth, S. Zuckerman, S. Koliai, M. Ivascot, W. Jalby, B. Krammer, B. Mohr, and T. William. How to Accelerate an Application: a Practical Case Study in Combustion Modelling. In Proc. of ParCo, Lyon, France, 2009. Google Scholar
  18. 18.
    C. Valensi and D. Barthou. MADRAS: Multi-Architecture Disassembler and Reassembler, 2009.
  19. 19.
    S. Wallace and K. Hazelwood. SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance. In ACM/IEEE Int. Symp. on Code Optimization and Generation, pages 209–217, San Jose, CA, March 2007. Google Scholar
  20. 20.
    F. Wolf, B.J.N. Wylie, E. Ábrahám, D. Becker, W. Frings, K. Fürlinger, M. Geimer, M.-A. Hermanns, B. Mohr, S. Moore, M. Pfeifer, and Z. Szebenyi. Usage of the SCALASCA Toolset for Scalable Performance Analysis of Large-Scale Parallel Applications. In Proc. of the 2nd HLRS Parallel Tools Workshop, pages 157–167, Stuttgart, Germany, July 2008. Springer. ISBN 978-3-540-68561-6. Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Denis Barthou
    • 1
  • Andres Charif Rubial
  • William Jalby
  • Souad Koliai
  • Cédric Valensi
  1. 1.LaBRI/INRIAUniversity of BordeauxBordeauxFrance

Personalised recommendations