Piecewise Holistic Autotuning of Compiler and Runtime Parameters

  • Mihail PopovEmail author
  • Chadi Akel
  • William Jalby
  • Pablo de Oliveira Castro
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9833)


Current architecture complexity requires fine tuning of compiler and runtime parameters to achieve full potential performance. Autotuning substantially improves default parameters in many scenarios but it is a costly process requiring a long iterative evaluation.

We propose an automatic piecewise autotuner based on CERE (Codelet Extractor and REplayer). CERE decomposes applications into small pieces called codelets: each codelet maps to a loop or to an OpenMP parallel region and can be replayed as a standalone program.

Codelet autotuning achieves better speedups at a lower tuning cost. By grouping codelet invocations with the same performance behavior, CERE reduces the number of loops or OpenMP regions to be evaluated. Moreover unlike whole-program tuning, CERE customizes the set of best parameters for each specific OpenMP region or loop.

We demonstrate CERE tuning of compiler optimizations, number of threads and thread affinity on a NUMA architecture. On average over the NAS 3.0 benchmarks, we achieve a speedup of 1.08 \(\times \) after tuning. Tuning a single codelet is 13 \(\times \) cheaper than whole-program evaluation and estimates the tuning impact on the original region with a 94.7 % accuracy. On a Reverse Time Migration (RTM) proto-application we achieve a 1.11 \(\times \) speedup with a 200 \(\times \) cheaper exploration.


Compiler Optimization NUMA Node Reverse Time Migration Stencil Computation Monolithic Approach 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



The research leading to these results has received funding under the Mont-Blanc project from the European Union’s Horizon 2020 research and innovation program under grant agreement No. 671697.


  1. 1.
    Lattner, C., Adve, V.: LLVM: a compilation framework for lifelong program analysis & transformation. In: International Symposium on Code Generation and Optimization, pp. 75–86. IEEE (2004)Google Scholar
  2. 2.
    Kisuki, T., Knijnenburg, P.M.W., O’Boyle, M.F.P., Bodin, F., Wijshoff, H.A.G.: A feasibility study in iterative compilation. In: Polychronopoulos, C., Fukuda, K.J.A., Tomita, S. (eds.) ISHPC 1999. LNCS, vol. 1615, pp. 121–132. Springer, Heidelberg (1999)CrossRefGoogle Scholar
  3. 3.
    Mazouz, A., Touati, S.A.A., Barthou, D.: Performance evaluation and analysis of thread pinning strategies on multi-core platforms: case study of SPEC OMP applications on intel architectures. In: High Performance Computing and Simulation (HPCS), pp. 273–279. IEEE (2011)Google Scholar
  4. 4.
    Rountree, B., Lownenthal, D.K., de Supinski, B.R., Schulz, M., Freeh, V.W., Bletsch, T.: Adagio: making DVS practical for complex HPC applications. In: Proceedings of the Conference on Supercomputing, pp. 460–469. ACM/IEEE (2009)Google Scholar
  5. 5.
    Triantafyllis, S., Vachharajani, M., Vachharajani, N., August, D.I.: Compiler optimization-space exploration. In: International Symposium on Code Generation and Optimization, CGO 2003, pp. 204–215. IEEE (2003)Google Scholar
  6. 6.
    Ladd, S.R.: ACOVEA: Analysis of compiler options via evolutionary algorithm (2007)Google Scholar
  7. 7.
    Cooper, K.D., Schielke, P.J., Subramanian, D.: Optimizing for reduced code space using genetic algorithms. In: SIGPLAN Notices, vol. 34, pp. 1–9. ACM (1999)Google Scholar
  8. 8.
    Hoste, K., Eeckhout, L.: COLE: compiler optimization level exploration. In: Code Generation and Optimization, pp. 165–174. ACM (2008)Google Scholar
  9. 9.
    de Oliveira Castro, P., Petit, E., Farjallah, A., Jalby, W.: Adaptive sampling for performance characterization of application kernels. Concurrency and Computation: Practice and Experience (2013)Google Scholar
  10. 10.
    Fursin, G., et al.: Milepost GCC: machine learning enabled self-tuning compiler. Int. J. Parallel Prog. 39(3), 296–327 (2011)CrossRefGoogle Scholar
  11. 11.
    de Oliveira Castro, P., Akel, C., Petit, E., Popov, M., Jalby, W.: CERE: LLVM based Codelet Extractor and REplayer for piecewise benchmarking and optimization. Trans. Archit. Code Optim. 12(1), 6 (2015)Google Scholar
  12. 12.
    Popov, M., Akel, C., Conti, F., Jalby, W., de Oliveira Castro, P.: PCERE: fine-grained parallel benchmark decomposition for scalability prediction. In: International Parallel and Distributed Processing Symposium, pp. 1151–1160. IEEE (2015)Google Scholar
  13. 13.
    Kessler, R.E., Hill, M.D., Wood, D.A.: A comparison of trace-sampling techniques for multi-megabyte caches. Trans. Comput. 43(6), 664–675 (1994)CrossRefzbMATHGoogle Scholar
  14. 14.
    Intel: Reference Guide for the Intel(R) C++ Compiler 15.0.
  15. 15.
    Bailey, D., et al.: The NAS parallel benchmarks summary and preliminary results. In: Proceedings of the Conference on Supercomputing, pp. 158–165. ACM/IEEE (1991)Google Scholar
  16. 16.
    Popov, M.: NAS 3.0 C OpenMP.
  17. 17.
    Baysal, E.: Reverse time migration. Geophysics 48(11), 1514 (1983)CrossRefGoogle Scholar
  18. 18.
    Sherwood, T., Perelman, E., Calder, B.: Basic block distribution analysis to find periodic behavior and simulation points in applications. In: Parallel Architectures and Compilation Techniques, pp. 3–14. IEEE (2001)Google Scholar
  19. 19.
    Fursin, G.G., Cohen, A., O’Boyle, M., Temam, O.: Quick and practical run-time evaluation of multiple program optimizations. In: Stenström, P. (ed.) Transactions on HiPEAC I. LNCS, vol. 4050, pp. 34–53. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  20. 20.
    de Oliveira Castro, P., Kashnikov, Y., Akel, C., Popov, M., Jalby, W.: Fine-grained benchmark subsetting for system selection. In: International Symposium on Code Generation and Optimization, pp. 132–142. ACM (2014)Google Scholar
  21. 21.
    Kulkarni, P.A., Jantz, M.R., Whalley, D.B.: Improving both the performance benefits and speed of optimization phase sequence searches, pp. 95–104. ACM (2010)Google Scholar
  22. 22.
    Purini, S., Jain, L.: Finding good optimization sequences covering program space. Trans. Archit. Code Optim. 9(4), 56 (2013)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Mihail Popov
    • 1
    Email author
  • Chadi Akel
    • 2
  • William Jalby
    • 1
  • Pablo de Oliveira Castro
    • 1
  1. 1.Université de Versailles Saint-Quentin-en-Yvelines, Université Paris-SaclayVersaillesFrance
  2. 2.Exascale Computing ResearchVersaillesFrance

Personalised recommendations