Scheduling Data Flow Program in XKaapi: A New Affinity Based Algorithm for Heterogeneous Architectures

  • Raphaël Bleuse
  • Thierry Gautier
  • João V. F. Lima
  • Grégory Mounié
  • Denis Trystram
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8632)


Efficient implementations of parallel applications on heterogeneous hybrid architectures require a careful balance between computations and communications with accelerator devices. Even if most of the communication time can be overlapped by computations, it is essential to reduce the total volume of communicated data. The literature therefore abounds with ad hoc methods to reach that balance, but these are architecture and application dependent. We propose here a generic mechanism to automatically optimize the scheduling between CPUs and GPUs, and compare two strategies within this mechanism: the classical Heterogeneous Earliest Finish Time (HEFT) algorithm and our new, parametrized, Distributed Affinity Dual Approximation algorithm (DADA), which consists in grouping the tasks by affinity before running a fast dual approximation. We ran experiments on a heterogeneous parallel machine with twelve CPU cores and eight NVIDIA Fermi GPUs. Three standard dense linear algebra kernels from the PLASMA library have been ported on top of the XKaapi runtime system. We report their performances. It results that HEFT and DADA perform well for various experimental conditions, but that DADA performs better for larger systems and number of GPUs, and, in most cases, generates much lower data transfers than HEFT to achieve the same performance.


Heterogeneous architectures scheduling cost models dual approximation scheme programming tools affinity 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Agullo, E., Augonnet, C., Dongarra, J., Faverge, M., Langou, J., Ltaief, H., Tomov, S.: Lu factorization for accelerator-based systems. In: IEEE/ACS, AICCSA 2011, pp. 217–224. IEEE Computer Society, Washington, DC (2011)Google Scholar
  2. 2.
    Agullo, E., Augonnet, C., Dongarra, J., Faverge, M., Ltaief, H., Thibault, S., Tomov, S.: QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators. In: IEEE IPDPS. EUA (2011)Google Scholar
  3. 3.
    Augonnet, C., Thibault, S., Namyst, R.: Automatic calibration of performance models on heterogeneous multicore architectures. In: Lin, H.-X., Alexander, M., Forsell, M., Knüpfer, A., Prodan, R., Sousa, L., Streit, A. (eds.) Euro-Par 2009 Workshops. LNCS, vol. 6043, pp. 56–65. Springer, Heidelberg (2010)Google Scholar
  4. 4.
    Augonnet, C., Thibault, S., Namyst, R., Wacrenier, P.A.: StarPU: A unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and Computation: Practice and Experience 23(2), 187–198 (2011)CrossRefGoogle Scholar
  5. 5.
    Bosilca, G., Bouteiller, A., Danalis, A., Herault, T., Lemarinier, P., Dongarra, J.: DAGuE: A generic distributed DAG engine for High Performance Computing. Parallel Computing 38(1–2), 37–51 (2012)CrossRefGoogle Scholar
  6. 6.
    Bueno, J., Planas, J., Duran, A., Badia, R.M., Martorell, X., Ayguadé, E., Labarta, J.: Productive Programming of GPU Clusters with OmpSs. In: IEEE IPDPS (2012)Google Scholar
  7. 7.
    Buttari, A., Langou, J., Kurzak, J., Dongarra, J.: A class of parallel tiled linear algebra algorithms for multicore architectures. Parallel Computing 35(1), 38–53 (2009)MathSciNetCrossRefGoogle Scholar
  8. 8.
    Gautier, T., Besseron, X., Pigeon, L.: KAAPI: A thread scheduling runtime system for data flow computations on cluster of multi-processors. In: PASCO 2007. ACM, London (2007)Google Scholar
  9. 9.
    Gautier, T., Lima, J.V., Maillard, N., Raffin, B.: XKaapi: A Runtime System for Data-Flow Task Programming on Heterogeneous Architectures. In: IEEE IPDPS, pp. 1299–1308 (2013)Google Scholar
  10. 10.
    Hermann, E., Raffin, B., Faure, F., Gautier, T., Allard, J.: Multi-GPU and Multi-CPU Parallelization for Interactive Physics Simulations. In: D’Ambra, P., Guarracino, M., Talia, D. (eds.) Euro-Par 2010, Part II. LNCS, vol. 6272, pp. 235–246. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  11. 11.
    Hochbaum, D.S., Shmoys, D.B.: Using dual approximation algorithms for scheduling problems theoretical and practical results. J. ACM 34(1), 144–162 (1987)MathSciNetCrossRefGoogle Scholar
  12. 12.
    Kedad-Sidhoum, S., Monna, F., Mounié, G., Trystram, D.: Scheduling independent tasks on multi-cores with GPU accelerators. In: an Mey, D., et al. (eds.) Euro-Par 2013. LNCS, vol. 8374, pp. 228–237. Springer, Heidelberg (2014)Google Scholar
  13. 13.
    Lima, J.V.F., Gautier, T., Maillard, N., Danjean, V.: Exploiting Concurrent GPU Operations for Efficient Work Stealing on Multi-GPUs. In: 24th SBAC-PAD, pp. 75–82. IEEE, New York (2012)Google Scholar
  14. 14.
    Song, F., Dongarra, J.: A scalable framework for heterogeneous GPU-based clusters. In: ACM SPAA, pp. 91–100. ACM, New York (2012)Google Scholar
  15. 15.
    Tomov, S., Dongarra, J., Baboulin, M.: Towards dense linear algebra for hybrid GPU accelerated manycore systems. Parallel Computing 36(5-6), 232–240 (2010)CrossRefMATHGoogle Scholar
  16. 16.
    Topcuoglu, H., Hariri, S., Wu, M.Y.: Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE TPDC 13(3), 260–274 (2002)Google Scholar
  17. 17.
    YarKhan, A., Kurzak, J., Dongarra, J.: Quark users’ guide: Queueing and runtime for kernels. Tech. Rep. ICL-UT-11-02, University of Tennessee (2011)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Raphaël Bleuse
    • 1
  • Thierry Gautier
    • 2
  • João V. F. Lima
    • 4
  • Grégory Mounié
    • 1
  • Denis Trystram
    • 1
    • 3
  1. 1.Univ. Grenoble AlpesFrance
  2. 2.Inria Rhône-AlpesFrance
  3. 3.Institut universitaire de FranceFrance
  4. 4.Universidade Federal de Santa Maria (UFSM)Brazil

Personalised recommendations