Advertisement

Performance Modeling of Gyrokinetic Toroidal Simulations for a Many-Tasking Runtime System

  • Matthew AndersonEmail author
  • Maciej Brodowicz
  • Abhishek Kulkarni
  • Thomas Sterling
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8551)

Abstract

Conventional programming practices on multicore processors in high performance computing architectures are not universally effective in terms of efficiency and scalability for many algorithms in scientific computing. One possible solution for improving efficiency and scalability in applications on this class of machines is the use of a many-tasking runtime system employing many lightweight, concurrent threads. Yet a priori estimation of the potential performance and scalability impact of such runtime systems on existing applications developed around the bulk synchronous parallel (BSP) model is not well understood. In this work, we present a case study of a BSP particle-in-cell benchmark code which has been ported to a many-tasking runtime system. The 3-D Gyrokinetic Toroidal code (GTC) is examined in its original MPI form and compared with a port to the High Performance ParalleX 3 (HPX-3) runtime system. Phase overlap, oversubscription behavior, and work rebalancing in the implementation are explored. Results for GTC using the SST/macro simulator complement the implementation results. Finally, an analytic performance model for GTC is presented in order to guide future implementation efforts.

Keywords

Performance modeling ParalleX Many-tasking runtime systems 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
  2. 2.
    Allen, E., Chase, D., Hallett, J., Luchangco, V., Maessen, J.-W., Ryu, S., Steele Jr., G.L., Tobin-Hochstadt, S.: The Fortress language specification, version 1.0 (March 2008)Google Scholar
  3. 3.
    Angskun, T., Bosilca, G., Fagg, G.E., Gabriel, E., Dongarra, J.J.: Performance analysis of mpi collective operations. In: Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDP 2005) - Workshop 15 (2005)Google Scholar
  4. 4.
    Antypas, K., Shalf, J., Wasserman, H.: Nersc-6 workload analysis and benchmark selection process. Technical Report LBNL 1014E, National Energy Research Scientific Computing Center Division Ernest Orlando Lawrence Berkeley National Laboratory (August 2008)Google Scholar
  5. 5.
    Appeltaue, M., Hirschfeld, R., Haupt, M., Lincke, J., Perscheid, M.: A comparison of context-oriented programming languages. In: International Workshop on Context-Oriented Programming, COP 2009, pp. 6:1–6:6. ACM, New York (2009)Google Scholar
  6. 6.
    Cappello, F., Etiemble, D.: Mpi versus mpi+openmp on the ibm sp for the nas benchmarks. In: ACM/IEEE 2000 Conference on Supercomputing, p. 12 (2000)Google Scholar
  7. 7.
    Chamberlain, B., Callahan, D., Zima, H.: Parallel programmability and the Chapel language. Int. J. High Perform. Comput. Appl. 21(3), 291–312 (2007)CrossRefGoogle Scholar
  8. 8.
    Charles, P., Grothoff, C., Saraswat, V., Donawa, C., Kielstra, A., Ebcioglu, K., von Praun, C., Sarkar, V.: X10: an object-oriented approach to non-uniform cluster computing. SIGPLAN Not. 40, 519–538 (2005)CrossRefGoogle Scholar
  9. 9.
    Dekate, C., Anderson, M., Brodowicz, M., Kaiser, H., Adelstein-Lelbach, B., Sterling, T.: Improving the scalability of parallel N-body applications with an event-driven constraint-based execution model. International Journal of High Performance Computing Applications 26(3), 319–332 (2012)CrossRefGoogle Scholar
  10. 10.
    Dinan, J., Balaji, E., Lusk, E., Sadayappan, P., Thakur, R.: Hybrid parallel programming with mpi and unified parallel c. In: Proceedings of the 7th ACM International Conference on Computing Frontiers, CF 2010, pp. 177–186. ACM, New York (2010)Google Scholar
  11. 11.
    Duran, A., Teruel, X., Ferrer, R., Martorell, X., Ayguade, E.: Barcelona OpenMP tasks suite: A set of benchmarks targeting the exploitation of task parallelism in OpenMP. In: Proceedings of the 2009 International Conference on Parallel Processing, ICPP 2009, pp. 124–131. IEEE Computer Society, Washington, DC (2009)Google Scholar
  12. 12.
    El-Ghazawi, T., Cantonnet, F., Yao, Y.: Evaluations of UPC on the Cray X1. In: CUG 2005 Proceedings, New York, NY, USA, p. 10 (2005)Google Scholar
  13. 13.
    Ethier, S., Tang, W.M., Lin, Z.: Gyrokinetic particle-in-cell simulations of plasma microturbulence on advanced computing platforms. Journal of Physics: Conference Series 16(1), 1 (2005)Google Scholar
  14. 14.
    Gao, G. Sterling, T., Stevens, R. Hereld, M., Zhu, W.: Parallex: A study of a new parallel computation model. In: IEEE International Parallel and Distributed Processing Symposium, IPDPS 2007, pp. 1–6 (2007)Google Scholar
  15. 15.
    Gautier, T., Lima, J.V.F., Maillard, N., Raffin, B.: Xkaapi: A runtime system for data-flow task programming on heterogeneous architectures. In: Proc. of the 27th IEEE International Parallel and Distributed Processing Symposium (IPDPS) (2013)Google Scholar
  16. 16.
    Gilmanov, T., Anderson, M., Brodowicz, M., Sterling, T.: Application characteristics of many-tasking execution models. In: Proc. of the 2013 International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA) (2013)Google Scholar
  17. 17.
    Hendry, G.: Decreasing Network Power with On-Off Links Informed by Scientific Applications. In: The Ninth Workshop on High-Performance, Power Aware Computing (May 2013)Google Scholar
  18. 18.
    Hendry, G., Rodrigues, A.: Simulator for exascale co-design, http://sst.sandia.gov/publications.html
  19. 19.
    Hendry, G., Rodrigues, A.: Sst: A simulator for exascale co-design. In: Proc. of the ASCR/ASC Exascale Research Conference (2012)Google Scholar
  20. 20.
    Hewitt, C., Baker, H.G.: Actors and continuous functionals. Technical report, Cambridge, MA, USA (1978)Google Scholar
  21. 21.
    Hockney, R.W.: The communication challenge for mpp: Intel paragon and meiko cs-2. Parallel Comput. 20(3), 389–398 (1994)CrossRefGoogle Scholar
  22. 22.
    Hoefler, T., Gropp, W., Snir, M., Kramer, W.: Performance Modeling for Systematic Performance Tuning. In: International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2011), SotP Session (November 2011)Google Scholar
  23. 23.
    Hoefler, T., Schneider, T., Lumsdaine, A.: LogGOPSim - simulating large-scale applications in the LogGOPS model. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, pp. 597–604. ACM (June 2010)Google Scholar
  24. 24.
    HPC University and the Ohio Supercomputer Center. Report on high performance computing training and education survey, http://www.teragridforum.org/mediawiki/images/5/5d/HPCSurveyResults.FINAL.pdf
  25. 25.
    Iancu, C., Hofmeyr, S., Blagojevic, F., Zheng, Y.: Oversubscription on multicore processors. In: 2010 IEEE International Symposium on Parallel Distributed Processing (IPDPS), pp. 1–11 (April 2010)Google Scholar
  26. 26.
    Kaiser, H., Brodowicz, M., Sterling, T.: ParalleX an advanced parallel execution model for scaling-impaired applications. In: International Conference on Parallel Processing Workshops, ICPPW 2009, pp. 394–401 (September 2009)Google Scholar
  27. 27.
    Kale, L.V., Krishnan, S.: Charm++: Parallel Programming with Message-Driven Objects. In: Wilson, G.V., Lu, P. (eds.) Parallel Programming Using C++, pp. 175–213. MIT Press (1996)Google Scholar
  28. 28.
    Karlin, I., Bhatele, A., Keasler, J., Chamberlain, B.L., Cohen, J., DeVito, Z., Haque, R., Laney, D., Luke, E., Wang, F., Richards, D. Schulz, M., Still, C.H.: Exploring traditional and emerging parallel programming models using a proxy application. In: Proc. of the 27th IEEE International Parallel and Distributed Processing Symposium (IPDPS) (2013)Google Scholar
  29. 29.
    Koniges, A., Preissl, R., Kim, J., Eder, D., Fisher, A., Masters, N., Mlaker, V., Ethier, S., Wang, W., Head-Gordon, M., Wichmann, N.: Application Acceleration on Current and Future Cray Platforms. In: CUG 2010, the Cray User Group Meeting (May 2010)Google Scholar
  30. 30.
    Madduri, K., Ibrahim, K.Z., Williams, S., Im, E.-J., Ethier, S., Shalf, J., Oliker, L.: Gyrokinetic toroidal simulations on leading multi- and manycore hpc systems. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2011, pp. 23:1–23:12. ACM, New York (2011)Google Scholar
  31. 31.
    Mathis, M.M., Kerbyson, D.J., Hoisie, A.: A performance model of non-deterministic particle transport on large-scale systems. Future Gener. Comput. Syst. 22(3), 324–335 (2006)CrossRefGoogle Scholar
  32. 32.
    McCool, M.D., Robison, A.D., Reinders, J.: Structured parallel programming patterns for efficient computation (2012)Google Scholar
  33. 33.
    Olivier, S., Prins, J.F.: Comparison of OpenMP 3.0 and other task parallel frameworks on unbalanced task graphs. International Journal of Parallel Programming 38(5–6), 341–360 (2010)CrossRefzbMATHGoogle Scholar
  34. 34.
    Reinders, J.: Intel Threading Building Blocks: Outfitting C++ for Multi-Core Processor Parallelism, 1st edn. O’Reilly Media (July 2007)Google Scholar
  35. 35.
    Robert, J., Halstead, H.: Multilisp: a language for concurrent symbolic computation. ACM Trans. Program. Lang. Syst. 7(4), 501–538 (1985)CrossRefzbMATHGoogle Scholar
  36. 36.
    Stitt, T., Robinson, T.: A survey on training and education needs for petascale computing, http://www.prace-project.eu/IMG/pdf/D3-3-1_document_final.pdf
  37. 37.
    Tskhakaya, D.: The particle-in-cell method. In: Fehske, H., Schneider, R., Weie, A. (eds.) Computational Many-Particle Physics. Lecture Notes in Physics, vol. 739, pp. 161–189. Springer, Heidelberg (2008)Google Scholar
  38. 38.
    Wheeler, K., Murphy, R., Thain, D.: Qthreads: An API for Programming with Millions of Lightweight Threads. In: International Parallel and Distributed Processing Symposium. IEEE Press (2008)Google Scholar
  39. 39.
    Wu, X., Taylor, V.: Performance modeling of hybrid mpi/openmp scientific applications on large-scale multicore cluster systems. In: 2011 IEEE 14th International Conference on Computational Science and Engineering (CSE), pp. 181–190 (2011)Google Scholar
  40. 40.
    Yang, C., Murthy, K., Mellor-Crummey, J.: Managing asynchronous operations in coarray fortran 2.0. In: Proc. of the 27th IEEE International Parallel and Distributed Processing Symposium (IPDPS) (2013)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Matthew Anderson
    • 1
    Email author
  • Maciej Brodowicz
    • 1
  • Abhishek Kulkarni
    • 1
  • Thomas Sterling
    • 1
  1. 1.School of Informatics and Computing, Center for Research in Extreme Scale TechnologiesIndiana UniversityBloomingtonIndiana

Personalised recommendations