A multi-aspect online tuning framework for HPC applications

Abstract

Developing software applications for high-performance computing (HPC) requires careful optimizations targeting a myriad of increasingly complex, highly interrelated software, hardware and system components. The demands placed on minimizing energy consumption on extreme-scale HPC systems and the associated shift towards hete rogeneous architectures add yet another level of complexity to program development and optimization. As a result, the software optimization process is often seen as daunting, cumbersome and time-consuming by software developers wishing to fully exploit HPC resources. To address these challenges, we have developed the Periscope Tuning Framework (PTF), an online automatic integrated tuning framework that combines both performance analysis and performance tuning with respect to the myriad of tuning parameters available to today’s software developer on modern HPC systems. This work introduces the architecture, tuning model and main infrastructure components of PTF as well as the main tuning plugins of PTF and their evaluation.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19

Notes

  1. 1.

    PTF web site: http://periscope.in.tum.de.

  2. 2.

    The software environment on SuperMUC comprised the Intel Compiler 14, Parallel Environment 1.3, and OS SLE11 SP3. Details on SuperMUC can be found at: https://www.lrz.de/services/compute/supermuc/systemdescription

  3. 3.

    Due to the large number of flags, exhaustive search has not been used. It would have required over 27000 experiments.

  4. 4.

    Governors are processor policies to change frequency.

References

  1. Bajrovic, E., Mijakovic, R., Dokulil, J., Benkner, S., & Gerndt, M. (2016). Tuning OpenCL applications with the periscope tuning framework, Hawaii international conference on system sciences. IEEE.

    Google Scholar 

  2. Balaprakash, P., Tiwari, A., & Wild, S.M. (2013). Multi-objective optimization of hpc kernels for performance, power, and energy, 4th international workshop on performance modeling, benchmarking, and simulation of HPC systems (PMBS12), 11/2013.

    Google Scholar 

  3. Benedict, S., Petkov, V., & Gerndt, M. (2010). Periscope: An online-based distributed performance analysis tool. In Müller, M.S., Resch, M.M., Schulz, A., & Nagel, W.E. (Eds.), Tools for high performance computing 2009 (pp. 1–16). Berlin Heidelberg: Springer.

    Google Scholar 

  4. Bruel, P., Gonzalez, M., & Goldman, A. (2015). Autotuning gpu compiler parameters using opentuner. XXII Symposium of Systems of High Performance Computing.

  5. Buck, B., & Hollingsworth, J.K. (2000). An api for runtime code patching. International Journal of High Performance Computing Applications, 14(4), 317–329.

    Article  Google Scholar 

  6. CESAR. Proxy-apps. https://cesar.mcs.anl.gov/content/software.

  7. Chen, C., Chame, J., & Hall, M. (2008). Chill: A framework for composing high-level loop transformations. Technical report University of Southern California.

  8. Chung, I-H., & Hollingsworth, J.K. (2004). Using information from prior runs to improve automated tuning systems, Proceedings of the 2004 ACM/IEEE conference on supercomputing, SC ’04 (p. 30). Washington: IEEE Computer Society.

    Google Scholar 

  9. CORAL. benchmarks. https://asc.llnl.gov/coral-benchmarks.

  10. Costa, G., Jorba, J., Morajko, A., Margalef, T., & Luque, E. (2008). Performance models for dynamic tuning of parallel applications on computational grids, 2008 IEEE international conference on cluster computing (pp. 376–385).

    Google Scholar 

  11. Costa, G., Sikora, A., Jorba, J., & Gmate, T.M. (2014). Dynamic tuning of parallel applications in grid environment. Journal of Grid Computing, 12(2), 371–398.

    Article  Google Scholar 

  12. Datta, K., Murphy, M., Volkov, V., Williams, S., Carter, J., Oliker, L., Patterson, D., Shalf, J., & Yelick, K. (2008). Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In Proceedings of the 2008 ACM/IEEE conference on supercomputing, SC ’08 (pp. 4:1–4:12). Piscataway: IEEE Press.

  13. Demmel, J., Dongarra, J., Eijkhout, V., Fuentes, E., Petitet, A., Vuduc, R., Whaley, R. C., & Yelick, K. (2005). Self-adapting linear algebra algorithms and software. Proceedings of the IEEE, 93(2), 293–312.

    Article  Google Scholar 

  14. Frigo, M., & Johnson, S. G. (1998). Fftw: an adaptive software architecture for the fft. In Proceedings of the 1998 IEEE international conference on acoustics, speech and signal processing, 1998 (Vol. 3, pp. 1381–1384).

  15. Frigo, M., & Johnson, S. G. (2005). The design and implementation of fftw3. Proceedings of the IEEE, 93(2), 216–231.

    Article  Google Scholar 

  16. Fursin, G., Kashnikov, Y., Memon, A.W., Chamski, Z., Temam, O., Namolaru, M., Yom-Tov, E., Mendelson, B., Zaks, A., Courtois, E., Bodin, F., Barnard, P., Ashton, E., Bonilla, E., Thomson, J., Williams, C.K.I., & O’Boyle, M. (2011). Milepost gcc Machine learning enabled self-tuning compiler. International Journal of Parallel Programming, 39(3), 296–327.

    Article  Google Scholar 

  17. Gerndt, M., César, E., & Benkner, S. (eds.) (2015). Automatic tuning of HPC applications - the periscope tuning framework. Shaker Verlag.

  18. Haneda, M., Knijnenburg, P. M. W., & Wijshoff, H.A.G. (2005). Automatic selection of compiler options using non-parametric inferential statistics, 14th International conference on parallel architectures and compilation techniques, 2005. PACT 2005 (pp. 123–132).

    Google Scholar 

  19. Kukkonen, S., & Lampinen, J. (2005). Gde3: The third evolution step of generalized differential evolution. In The 2005 IEEE congress on evolutionary computation, 2005 (Vol. 1, pp. 443–450). IEEE.

  20. Leather, H., Bonilla, E., & O’Boyle, M. (2009). Automatic feature generation for machine learning based optimizing compilation, Proceedings of the 7th Annual IEEE/ACM international symposium on code generation and optimization, CGO ’09 (pp. 81–91). Washington: IEEE Computer Society.

    Google Scholar 

  21. Morajko, A., Caymes-Scutari, P., Margalef, T., & Mate, E. Luque. (2007). Monitoring, analysis and tuning environment for parallel/distributed applications. Concurrency and Computation: Practice and Experience, 19(11), 1517–1531.

    Article  MATH  Google Scholar 

  22. Morajko, A., César, E., Caymes-Scutari, P., Margalef, T., Sorribes, J., & Luque, E. (2005). Automatic tuning of Master/Worker applications. In Proceedings of Euro-Par 2005 parallel processing: 11th international euro-par conference (pp. 95–103).

  23. Navarette, C., Guillen, C., Hesse, W., & Brehm, M. (2014). Autotuning the energy consumption. In Bader, M. et al. (Eds.) Parallel computing accelerating computational science and engineering. IOS Press.

  24. Nelson, Y. L., Bansal, B., Hall, M., Nakano, A., & Lerman, K. (2008). Model-guided performance tuning of parameter values A case study with molecular dynamics visualization, IEEE international symposium on parallel and distributed processing, 2008. IPDPS 2008 (pp. 1–8).

    Google Scholar 

  25. Oleynik, Y., Gerndt, M., Schuchart, J., Kjeldsberg, P.G., & Nagel, W.E. (2015). Run-time exploitation of application dynamism for energy-efficient exascale computing (READEX). In IEEE 18th international conference on computational science and engineering (CSE), 2015 (pp. 347–350). IEEE.

  26. OpenCL SDK User-guide. https://software.intel.com/sites/default/files/m/e/7/0/3/1/33857-Intel_28R_29_OpenCL_SDK_User_Guide.eps https://software.intel.com/sites/default/files/m/e/7/0/3/1/33857-Intel_28R_29_OpenCL_SDK_User_Guide.eps.

  27. Pan, Z., & Eigenmann, R. (2006). Fast and effective orchestration of compiler optimizations for automatic performance tuning, Proceedings of the international symposium on code generation and optimization, CGO ’06 (pp. 319–332). Washington: IEEE Computer Society.

    Google Scholar 

  28. Püschel, M., Moura, J.M. F., Singer, B., Xiong, J., Johnson, J., Padua, D., Veloso, M., & Johnson, R.W. (2004). Spiral: a generator for platform-adapted libraries of signal processing algorithms. International Journal of High Performance Computing Applications, 18(1), 21–45.

    Article  Google Scholar 

  29. Ravipati, G., Bernat, A.R., Miller, B.P., & Hollingsworth, J.K. (2007). Towards the deconstruction of dyninst. Technical report. University of Wisconsin.

    Google Scholar 

  30. Ribler, R.L., Simitci, H., & Reed, D.A. (2001). The autopilot performance-directed adaptive control system. Future Generation Computer Systems, 18(1), 175–187.

    Article  MATH  Google Scholar 

  31. Ribler, R. L., Vetter, J. S., Simitci, H., & Reed, D. A. (1998). Autopilot: adaptive control of distributed applications, Proceedings of the 7th international symposium on high performance distributed computing, 1998 (pp. 172–179).

    Google Scholar 

  32. Tang, Y., Chowdhury, R.A., Kuszmaul, B.C., Luk, C.-K., & Leiserson, C.E. (2011). The pochoir stencil compiler, Proceedings of the 23rd annual ACM symposium on parallelism in algorithms and architectures, SPAA ’11 (pp. 117–128). New York: ACM.

    Google Scholar 

  33. Tiwari, A., Chen, C., Chame, J., Hall, M., & Hollingsworth, J.K. (2009). A scalable auto-tuning framework for compiler optimization, IEEE International symposium on parallel distributed processing, 2009. IPDPS 2009 (pp. 1–12).

    Google Scholar 

  34. Tiwari, A., & Hollingsworth, J. K. (2011). Online adaptive code generation and tuning. In 2011 IEEE international parallel distributed processing symposium (IPDPS) (pp. 879–892).

  35. Ţăpuş, C., Chung, I-H., & Hollingsworth, J.K. (2002). Active harmony: Towards automated performance tuning, Proceedings of the 2002 ACM/IEEE conference on supercomputing, SC ’02 (pp. 1–11). Los Alamitos: IEEE Computer Society Press.

    Google Scholar 

  36. The LLVM Compiler Infrastructure. http://llvm.org/.

  37. Triantafyllis, S., Vachharajani, M., Vachharajani, N., & August, D.I. (2003). Compiler optimization-space exploration, Proceedings of the international symposium on code generation and optimization: feedback-directed and runtime optimization, CGO ’03 (pp. 204–215). Washington: IEEE Computer Society.

    Google Scholar 

  38. Vuduc, R., Demmel, J.W., & Yelick, K.A. (2005). Oski: A library of automatically tuned sparse matrix kernels. Journal of Physics: Conference Series, 16(1), 521.

    Google Scholar 

  39. Whaley, R.C., Petitet, A., & Dongarra, J.J. (2001). Automated empirical optimizations of software and the ATLAS project. Parallel Computing, 27(12), 3–35. New Trends in High Performance Computing.

    Article  MATH  Google Scholar 

  40. Williams, S., Waterman, A., & Patterson, D. (2009). Roofline: An insightful visual performance model for multicore architectures. Communications of the ACM, 52 (4), 65–76.

    Article  Google Scholar 

  41. X-TUNE. Autotuning for exascale: self-tuning software to manage heterogeneity. http://ctop.cs.utah.edu/x-tune/.

  42. Xiujuan, L., & Zhongke, S. (2004). Overview of multi-objective optimization methods. Journal of Systems Engineering and Electronics, 15(2), 142–146.

    Google Scholar 

Download references

Acknowledgments

This work was supported by the European Commission FP7 project AutoTune under grant no. 288038.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Eduardo César.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Gerndt, M., Benkner, S., César, E. et al. A multi-aspect online tuning framework for HPC applications. Software Qual J 26, 1063–1096 (2018). https://doi.org/10.1007/s11219-017-9370-x

Download citation

Keywords

  • Automatic performance tuning
  • High-performance computing
  • Performance optimization
  • Parallel architectures
  • Energy tuning
  • OpenCL