Advertisement

Automatic Performance Modeling of HPC Applications

  • Felix Wolf
  • Christian Bischof
  • Alexandru CalotoiuEmail author
  • Torsten Hoefler
  • Christian Iwainsky
  • Grzegorz Kwasniewski
  • Bernd Mohr
  • Sergei Shudler
  • Alexandre Strube
  • Andreas Vogel
  • Gabriel Wittum
Conference paper
Part of the Lecture Notes in Computational Science and Engineering book series (LNCSE, volume 113)

Abstract

Many existing applications suffer from inherent scalability limitations that will prevent them from running at exascale. Current tuning practices, which rely on diagnostic experiments, have drawbacks because (i) they detect scalability problems relatively late in the development process when major effort has already been invested into an inadequate solution and (ii) they incur the extra cost of potentially numerous full-scale experiments. Analytical performance models, in contrast, allow application developers to address performance issues already during the design or prototyping phase. Unfortunately, the difficulties of creating such models combined with the lack of appropriate tool support still render performance modeling an esoteric discipline mastered only by a relatively small community of experts. This article summarizes the results of the Catwalk project, which aimed to create tools that automate key activities of the performance modeling process, making this powerful methodology accessible to a wider audience of HPC application developers.

Keywords

Performance Model Scalability Issue Match Classification Analytical Performance Model Scalability Framework 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 1.
    Adhianto, L., Banerjee, S., Fagan, M.W., Krentel, M.W., Marin, G., Mellor-Crummey, J., Tallent, N.R.: HPCToolkit: tools for performance analysis of optimized parallel programs. Concurr. Comput. Pract. Exper. 22 (6), 685–701 (2010)Google Scholar
  2. 2.
    Bailey, D.H., Barszcz, E., Barton, J.T., Browning, D.S., Carter, R.L., Dagum, L., Fatoohi, R.A., Frederickson, P.O., Lasinski, T.A., Schreiber, R.S., Simon, H.D., Venkatakrishnan, V., Weeratunga, S.K.: The NAS parallel benchmarks–summary and preliminary results. In: Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (SC), Albuquerque, pp. 158–165. ACM (1991)Google Scholar
  3. 3.
    Bauer, G., Gottlieb, S., Hoefler, T.: Performance modeling and comparative analysis of the MILC lattice QCD application su3_rmd. In: Proceedings of the CCGrid, Ottawa, pp. 652–659. IEEE (2012)Google Scholar
  4. 4.
    Behr, M., Nicolai, M., Probst, M.: Efficient parallel simulations in support of medical device design. NIC Ser. 38, 19–26 (2008)Google Scholar
  5. 5.
    Benabderrahmane, M.W., Pouchet, L.N., Cohen, A., Bastoul, C.: The polyhedral model is more widely applicable than you think. In: Gupta, R. (ed.) Compiler Construction. LNCS, vol. 6011, pp. 283–303. Springer (2010). http://dx.doi.org/10.1007/978-3-642-11970-5_16
  6. 6.
    Bhattacharyya, A., Kwasniewski, G., Hoefler, T.: Using compiler techniques to improve automatic performance modeling. In: Accepted at the 24th International Conference on Parallel Architectures and Compilation (PACT’15), San Francisco. ACM (2015)Google Scholar
  7. 7.
    Bhattacharyya, A., Hoefler, T.: PEMOGEN: automatic adaptive performance modeling during program runtime. In: Proceedings of the 23rd International Conference on Parallel Architectures and Compilation Techniques (PACT’14). ACM, Edmonton (2014)Google Scholar
  8. 8.
    Blanc, R., Henzinger, T.A., Hottelier, T., Kovacs, L.: ABC: algebraic bound computation for loops. In: Clarke, E., Voronkov, A. (eds.) Logic for Programming, Artificial Intelligence, and Reasoning. LNCS, vol. 6355, pp. 103–118 (2010). http://dx.doi.org/10.1007/978-3-642-17511-4_7 MathSciNetzbMATHGoogle Scholar
  9. 9.
    Bull, J.M., O’Neill, D.: A microbenchmark suite for OpenMP 2.0. ACM Comput. Architech. News 29 (5), 41–48 (2001)Google Scholar
  10. 10.
    Calotoiu, A., Hoefler, T., Poke, M., Wolf, F.: Using automated performance modeling to find scalability bugs in complex codes. In: Proceedings of the ACM/IEEE Conference on Supercomputing (SC13), Denver, pp. 1–12. ACM (2013)Google Scholar
  11. 11.
    Carrington, L., Snavely, A., Wolter, N.: A performance prediction framework for scientific applications. Future Gener. Comput. Syst. 22 (3), 336–346 (2006). http://dx.doi.org/10.1016/j.future.2004.11.019 CrossRefGoogle Scholar
  12. 12.
    Chan, E., Heimlich, M., Purkayastha, A., van de Geijn, R.: Collective communication: theory, practice, and experience. Concurr. Comput. Pract. Exp. 19 (13), 1749–1783 (2007)CrossRefGoogle Scholar
  13. 13.
    Dennis, J.M., Edwards, J., Evans, K.J., Guba, O., Lauritzen, P.H., Mirin, A.A., St-Cyr, A., Taylor, M.A., Worley, P.H.: CAM-SE: a scalable spectral element dynamical core for the community atmosphere model. Int. J. High Perform. Comput. 26 (1), 74–89 (2012). http://hpc.sagepub.com/content/26/1/74.abstract CrossRefGoogle Scholar
  14. 14.
    Geimer, M., Wolf, F., Wylie, B.J.N., Ábrahám, E., Becker, D., Mohr, B.: The Scalasca performance toolset architecture. Concurr. Comput. Pract. Exp. 22 (6), 702–719 (2010)Google Scholar
  15. 15.
    Gewaltig, M.O., Diesmann, M.: Nest (neural simulation tool). Scholarpedia J. 2 (4), 1430 (2007)CrossRefGoogle Scholar
  16. 16.
    Goldsmith, S.F., Aiken, A.S., Wilkerson, D.S.: Measuring empirical computational complexity. In: Proceedings of the 6th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC-FSE ’07), New York, pp. 395–404. ACM (2007). http://doi.acm.org/10.1145/1287624.1287681
  17. 17.
    Hammer, J., Hager, G., Eitzinger, J., Wellein, G.: Automatic loop kernel analysis and performance modeling with kerncraft. In: Proceedings of the 6th International Workshop on Performance Modeling, Benchmarking, and Simulation of High Performance Computing Systems (PMBS ’15), New York, pp. 4:1–4:11. ACM (2015). http://doi.acm.org/10.1145/2832087.2832092
  18. 18.
    Hoefler, T., Kwasniewski, G.: Automatic complexity analysis of explicitly parallel programs. In: Proceedings of the 26th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA’14), Prague. ACM (2014)Google Scholar
  19. 19.
    Hoefler, T., Snir, M.: Performance engineering: a must for petaflops and beyond. In: Proceedings of the Workshop on Large-Scale System and Application Performance (LSAP), in Conjunction with HPDC, San Jose. ACM (2011)Google Scholar
  20. 20.
    Hoefler, T., Gropp, W., Kramer, W., Snir, M.: Performance modeling for systematic performance tuning. In: State of the Practice Reports (SC ’11), pp. 6:1–6:12. ACM (2011). http://doi.acm.org/10.1145/2063348.2063356
  21. 21.
    Hoefler, T., Kwasniewski, G.: Automatic complexity analysis of explicitly parallel programs. In: Proceedings of the 26th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA ’14), New York, pp. 226–235. ACM (2014). http://doi.acm.org/10.1145/2612669.2612685
  22. 22.
    Iwainsky, C., Shudler, S., Calotoiu, A., Strube, A., Knobloch, M., Bischof, C., Wolf, F.: How many threads will be too many? On the scalability of OpenMP implementations. In: Proceedings of the 21st Euro-Par Conference, Vienna. LNCS, vol. 9233, pp. 451–463. Springer (2015)Google Scholar
  23. 23.
    Jayakumar, A., Murali, P., Vadhiyar, S.: Matching application signatures for performance predictions using a single execution. In: 2015 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Hyderabad, pp. 1161–1170. IEEE (2015)Google Scholar
  24. 24.
    JuBE – Jülich Benchmarking Environment (2016). http://www.fz-juelich.de/jsc/jube
  25. 25.
    JuSPIC – Jülich Scalable Particle-in-Cell Code (2016). http://www.fz-juelich.de/ias/jsc/EN/Expertise/High-Q-Club/JuSPIC/_node.html
  26. 26.
    Kerbyson, D.J., Alme, H.J., Hoisie, A., Petrini, F., Wasserman, H.J., Gittings, M.: Predictive performance and scalability modeling of a large-scale application. In: Proceedings of the ACM/IEEE Conference on Supercomputing (SC’01), Denver, p. 37. ACM (2001)Google Scholar
  27. 27.
    LLVM home page (2016). http://llvm.org/
  28. 28.
    Lo, Y.J., Williams, S., Van Straalen, B., Ligocki, T.J., Cordery, M.J., Wright, N.J., Hall, M.W., Oliker, L.: Roofline model toolkit: a practical tool for architectural and program analysis. In: High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation, New Orleans, pp. 129–148. Springer (2014)Google Scholar
  29. 29.
    Marin, G., Mellor-Crummey, J.: Cross-architecture performance predictions for scientific applications using parameterized models. SIGMETRICS Perform. Eval. Rev. 32 (1), 2–13 (2004). http://doi.acm.org/10.1145/1012888.1005691 CrossRefGoogle Scholar
  30. 30.
  31. 31.
    Pllana, S., Brandic, I., Benkner, S.: Performance modeling and prediction of parallel and distributed computing systems: a survey of the state of the art. In: Proceedings of the 1st International Conference on Complex, Intelligent and Software Intensive Systems (CISIS), Vienna, pp. 279–284. IEEE (2007)Google Scholar
  32. 32.
    Shudler, S., Calotoiu, A., Hoefler, T., Strube, A., Wolf, F.: Exascaling your library: will your implementation meet your expectations? In: Proceedings of the 29th ACM on International Conference on Supercomputing (ICS ’15), New York, pp. 165–175. ACM (2015). http://doi.acm.org/10.1145/2751205.2751216
  33. 33.
    Siegmund, N., Grebhahn, A., Apel, S., Kästner, C.: Performance-influence models for highly configurable systems. In: Proceedings of the 2015-10th Joint Meeting on Foundations of Software Engineering (ESEC/FSE 2015), New York, pp. 284–294. ACM (2015). http://doi.acm.org/10.1145/2786805.2786845
  34. 34.
    Spafford, K.L., Vetter, J.S.: Aspen: a domain specific language for performance modeling. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC ’12), Los Alamitos, pp. 84:1–84:11. IEEE Computer Society Press (2012). http://dl.acm.org/citation.cfm?id=2388996.2389110
  35. 35.
    Sutmann, G., Westphal, L., Bolten, M.: Particle based simulations of complex systems with mp2c: hydrodynamics and electrostatics. In: International Conference of Numerical Analysis and Applied Mathematics 2010 (ICNAAM 2010), Rhodes, vol. 1281, pp. 1768–1772. AIP Publishing (2010)Google Scholar
  36. 36.
    Tallent, N.R., Hoisie, A.: Palm: easing the burden of analytical performance modeling. In: Proceedings of the 28th ACM International Conference on Supercomputing (ICS ’14), NewYork, pp. 221–230. ACM (2014). http://doi.acm.org/10.1145/2597652.2597683
  37. 37.
    Thakur, R., Rabenseifner, R., Gropp, W.: Optimization of collective communication operations in mpich. Int. J. High Perform. Comput. 19 (1), 49–66 (2005)CrossRefGoogle Scholar
  38. 38.
    Vetter, J., Worley, P.: Asserting performance expectations. In: Proceedings of the ACM/IEEE Conference on Supercomputing, Baltimore, pp. 1–13. ACM (2002)Google Scholar
  39. 39.
    Vogel, A., Reiter, S., Rupp, M., Nägel, A., Wittum, G.: UG 4: a novel flexible software system for simulating PDE based models on high performance computers. Comput. Vis. Sci. 16 (4), 165–179 (2013)CrossRefGoogle Scholar
  40. 40.
    Vogel, A., Calotoiu, A., Strube, A., Reiter, S., Nägel, A., Wolf, F., Wittum, G.: 10,000 performance models per minute – scalability of the ug4 simulation framework. In: Proceedings of the 21st Euro-Par Conference, Vienna. LNCS, vol. 9233, pp. 519–531. Springer (2015)Google Scholar
  41. 41.
    Vömel, C.: ScaLAPACK’s MRRR algorithm. ACM T. Math. Softw. 37 (1), 1:1–1:35 (2010)Google Scholar
  42. 42.
    Vuduc, R., Demmel, J.W., Bilmes, J.A.: Statistical models for empirical search-based performance tuning. Int. J. High Perform. Comput. 18 (1), 65–94 (2004). http://dx.doi.org/10.1177/1094342004041293 CrossRefzbMATHGoogle Scholar
  43. 43.
    Wasserman, H., Hoisie, A., Lubeck, O., Lubeck, O.: Performance and scalability analysis of teraflop-scale parallel architectures using multidimensional wavefront applications. Int. J. High Perform. Comput. 14, 330–346 (2000)CrossRefGoogle Scholar
  44. 44.
    Wu, X., Müller, F.: Scalaextrap: trace-based communication extrapolation for SPMD programs. ACM T. Lang. Sys. 34 (1), 113–122 (2012)Google Scholar
  45. 45.
    Wylie, B.J.N., Geimer, M., Mohr, B., Böhme, D., Szebenyi, Z., Wolf, F.: Large-scale performance analysis of Sweep3D with the Scalasca toolset. Parallel Process. Lett. 20 (4), 397–414 (2010)MathSciNetCrossRefGoogle Scholar
  46. 46.
    Zaparanuks, D., Hauswirth, M.: Algorithmic profiling. Sigplan Not. 47 (6), 67–76 (2012). http://doi.acm.org/10.1145/2345156.2254074 Google Scholar
  47. 47.
    Zhai, J., Chen, W., Zheng, W.: Phantom: predicting performance of parallel applications on large-scale parallel machines using a single node. Sigplan Not. 45 (5), 305–314 (2010). http://doi.acm.org/10.1145/1837853.1693493 CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Felix Wolf
    • 1
  • Christian Bischof
    • 1
  • Alexandru Calotoiu
    • 1
    Email author
  • Torsten Hoefler
    • 2
  • Christian Iwainsky
    • 1
  • Grzegorz Kwasniewski
    • 2
  • Bernd Mohr
    • 3
  • Sergei Shudler
    • 1
  • Alexandre Strube
    • 3
  • Andreas Vogel
    • 4
  • Gabriel Wittum
    • 4
  1. 1.Technische Universität DarmstadtDarmstadtGermany
  2. 2.ETH ZurichZurichSwitzerland
  3. 3.Jülich Supercomputing CenterJuelichGermany
  4. 4.Goethe Universität FrankfurtFrankfurtGermany

Personalised recommendations