Modeling and Simulation of a Dynamic Task-Based Runtime System for Heterogeneous Multi-core Architectures

  • Luka Stanisic
  • Samuel Thibault
  • Arnaud Legrand
  • Brice Videau
  • Jean-François Méhaut
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8632)


Multi-core architectures comprising several GPUs have become mainstream in the field of High-Performance Computing. However, obtaining the maximum performance of such heterogeneous machines is challenging as it requires to carefully offload computations and manage data movements between the different processing units. The most promising and successful approaches so far rely on task-based runtimes that abstract the machine and rely on opportunistic scheduling algorithms. As a consequence, the problem gets shifted to choosing the task granularity, task graph structure, and optimizing the scheduling strategies. Trying different combinations of these different alternatives is also itself a challenge. Indeed, getting accurate measurements requires reserving the target system for the whole duration of experiments. Furthermore, observations are limited to the few available systems at hand and may be difficult to generalize. In this article, we show how we crafted a coarse-grain hybrid simulation/emulation of StarPU, a dynamic runtime for hybrid architectures, over SimGrid, a versatile simulator for distributed systems. This approach allows to obtain performance predictions accurate within a few percents on classical dense linear algebra kernels in a matter of seconds, which allows both runtime and application designers to quickly decide which optimization to enable or whether it is worth investing in higher-end GPUs or not.


Computation Kernel Multicore Architecture Multiple GPUs Dense Linear Algebra Task Granularity 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Augonnet, C., Aumage, O., Furmento, N., Namyst, R., Thibault, S.: StarPU-MPI: Task Programming over Clusters of Machines Enhanced with Accelerators. In: Träff, J.L., Benkner, S., Dongarra, J.J. (eds.) EuroMPI 2012. LNCS, vol. 7490, pp. 298–299. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  2. 2.
    Augonnet, C., Thibault, S., Namyst, R.: Automatic Calibration of Performance Models on Heterogeneous Multicore Architectures. In: Lin, H.-X., Alexander, M., Forsell, M., Knüpfer, A., Prodan, R., Sousa, L., Streit, A. (eds.) Euro-Par 2009 Workshops. LNCS, vol. 6043, pp. 56–65. Springer, Heidelberg (2010)Google Scholar
  3. 3.
    Augonnet, C., Thibault, S., Namyst, R., Wacrenier, P.A.: StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures. Concurrency and Computation: Practice and Experience 23, 187–198 (2011)CrossRefGoogle Scholar
  4. 4.
    Ayguadé, E., Badia, R.M., Igual, F.D., Labarta, J., Mayo, R., Quintana-Ortí, E.S.: An Extension of the StarSs Programming Model for Platforms with Multiple GPUs. In: Sips, H., Epema, D., Lin, H.-X. (eds.) Euro-Par 2009. LNCS, vol. 5704, pp. 851–862. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  5. 5.
    Bakhoda, A., Yuan, G.L., Fung, W.W.L., Wong, H., Aamodt, T.M.: Analyzing CUDA workloads using a detailed GPU simulator. In: ISPASS, pp. 163–174 (2009)Google Scholar
  6. 6.
    Bedaride, P., Degomme, A., Genaud, S., Legrand, A., Markomanolis, G., Quinson, M., Stillwell, L.M., Suter, F., Videau, B.: Toward better simulation of mpi applications on ethernet/tcp networks. In: 4th International Workshop on Performance Modeling, Benchmarking and Simulation of HPC Systems (PMBS) (November 2013)Google Scholar
  7. 7.
    Bosilca, G., Bouteiller, A., Danalis, A., Herault, T., Lemarinier, P., Dongarra, J.: DAGuE: A Generic Distributed DAG Engine for High Performance Computing. In: IEEE International Symposium on Parallel and Distributed Processing, pp. 1151–1158. IEEE Computer Society (2011)Google Scholar
  8. 8.
    Casanova, H., Legrand, A., Quinson, M.: SimGrid: A Generic Framework for Large-Scale Distributed Experiments. In: Proceedings of the 10th IEEE International Conference on Computer Modeling and Simulation (UKSim) (April 2008)Google Scholar
  9. 9.
    Collange, S., Daumas, M., Defour, D., Parello, D.: Barra: A Parallel Functional Simulator for GPGPU. In: IEEE/ACM International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication, pp. 351–360 (2010)Google Scholar
  10. 10.
    Denby, L., Mallows, C.: Variations on the histogram. Journal of Computational and Graphical Statistics 18(1), 21–31 (2009)MathSciNetCrossRefGoogle Scholar
  11. 11.
    Ubal, R., Jang, B., Mistry, P., Schaa, D., Kaeli, D.: Multi2Sim: A Simulation Framework for CPU-GPU Computing. In: Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, PACT 2012, pp. 335–344. ACM, New York (2012)Google Scholar
  12. 12.
    Velho, P., Schnorr, L., Casanova, H., Legrand, A.: On the validity of flow-level TCP network models for grid and cloud simulations. ACM Transactions on Modeling and Computer Simulation 23(3) (October 2013)Google Scholar
  13. 13.
    Companion of the StarPU+SimGrid article. Hosted on Figshare (2014),, online version of this article with access to the experimental data and scripts (in the org source)

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Luka Stanisic
    • 1
  • Samuel Thibault
    • 2
  • Arnaud Legrand
    • 1
  • Brice Videau
    • 1
  • Jean-François Méhaut
    • 1
  1. 1.CNRS, InriaUniversity of GrenobleFrance
  2. 2.University of BordeauxInriaFrance

Personalised recommendations