Modeling and Simulation of a Dynamic Task-Based Runtime System for Heterogeneous Multi-core Architectures
Multi-core architectures comprising several GPUs have become mainstream in the field of High-Performance Computing. However, obtaining the maximum performance of such heterogeneous machines is challenging as it requires to carefully offload computations and manage data movements between the different processing units. The most promising and successful approaches so far rely on task-based runtimes that abstract the machine and rely on opportunistic scheduling algorithms. As a consequence, the problem gets shifted to choosing the task granularity, task graph structure, and optimizing the scheduling strategies. Trying different combinations of these different alternatives is also itself a challenge. Indeed, getting accurate measurements requires reserving the target system for the whole duration of experiments. Furthermore, observations are limited to the few available systems at hand and may be difficult to generalize. In this article, we show how we crafted a coarse-grain hybrid simulation/emulation of StarPU, a dynamic runtime for hybrid architectures, over SimGrid, a versatile simulator for distributed systems. This approach allows to obtain performance predictions accurate within a few percents on classical dense linear algebra kernels in a matter of seconds, which allows both runtime and application designers to quickly decide which optimization to enable or whether it is worth investing in higher-end GPUs or not.
KeywordsComputation Kernel Multicore Architecture Multiple GPUs Dense Linear Algebra Task Granularity
Unable to display preview. Download preview PDF.
- 2.Augonnet, C., Thibault, S., Namyst, R.: Automatic Calibration of Performance Models on Heterogeneous Multicore Architectures. In: Lin, H.-X., Alexander, M., Forsell, M., Knüpfer, A., Prodan, R., Sousa, L., Streit, A. (eds.) Euro-Par 2009 Workshops. LNCS, vol. 6043, pp. 56–65. Springer, Heidelberg (2010)Google Scholar
- 5.Bakhoda, A., Yuan, G.L., Fung, W.W.L., Wong, H., Aamodt, T.M.: Analyzing CUDA workloads using a detailed GPU simulator. In: ISPASS, pp. 163–174 (2009)Google Scholar
- 6.Bedaride, P., Degomme, A., Genaud, S., Legrand, A., Markomanolis, G., Quinson, M., Stillwell, L.M., Suter, F., Videau, B.: Toward better simulation of mpi applications on ethernet/tcp networks. In: 4th International Workshop on Performance Modeling, Benchmarking and Simulation of HPC Systems (PMBS) (November 2013)Google Scholar
- 7.Bosilca, G., Bouteiller, A., Danalis, A., Herault, T., Lemarinier, P., Dongarra, J.: DAGuE: A Generic Distributed DAG Engine for High Performance Computing. In: IEEE International Symposium on Parallel and Distributed Processing, pp. 1151–1158. IEEE Computer Society (2011)Google Scholar
- 8.Casanova, H., Legrand, A., Quinson, M.: SimGrid: A Generic Framework for Large-Scale Distributed Experiments. In: Proceedings of the 10th IEEE International Conference on Computer Modeling and Simulation (UKSim) (April 2008)Google Scholar
- 9.Collange, S., Daumas, M., Defour, D., Parello, D.: Barra: A Parallel Functional Simulator for GPGPU. In: IEEE/ACM International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication, pp. 351–360 (2010)Google Scholar
- 11.Ubal, R., Jang, B., Mistry, P., Schaa, D., Kaeli, D.: Multi2Sim: A Simulation Framework for CPU-GPU Computing. In: Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, PACT 2012, pp. 335–344. ACM, New York (2012)Google Scholar
- 12.Velho, P., Schnorr, L., Casanova, H., Legrand, A.: On the validity of flow-level TCP network models for grid and cloud simulations. ACM Transactions on Modeling and Computer Simulation 23(3) (October 2013)Google Scholar
- 13.Companion of the StarPU+SimGrid article. Hosted on Figshare (2014), http://dx.doi.org/10.6084/m9.figshare.928095, online version of this article with access to the experimental data and scripts (in the org source)