The Journal of Supercomputing

, Volume 71, Issue 12, pp 4646–4662 | Cite as

Performance-aware composition framework for GPU-based systems

  • Usman DastgeerEmail author
  • Christoph Kessler


User-level components of applications can be made performance-aware by annotating them with performance model and other metadata. We present a component model and a composition framework for the automatically optimized composition of applications for modern GPU-based systems from such components, which may expose multiple implementation variants. The framework targets the composition problem in an integrated manner, with the ability to do global performance-aware composition across multiple invocations. We demonstrate several key features of our framework relating to performance-aware composition including implementation selection, both with performance characteristics being known (or learned) beforehand as well as cases when they are learned at runtime. We also demonstrate hybrid execution capabilities of our framework on real applications. “Furthermore, we present a bulk composition technique that can make better composition decisions by considering information about upcoming calls along with data flow information extracted from the source program by static analysis. The bulk composition improves over the traditional greedy performance aware policy that only considers the current call for optimization.”


Global composition Implementation selection Hybrid execution GPU-based systems Performance portability 


  1. 1.
    Augonnet C et al (2009) Automatic calibration of performance models on heterogeneous multicore architectures. In: Euro-Par Workshops (HPPC 2009), LNCS, vol 6043Google Scholar
  2. 2.
    Hong S, Kim H (2009) An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In: Proceedings of the Annual International Symposium on Computer Architecture (ISCA)Google Scholar
  3. 3.
    Karcher T, Pankratius V (2011) Run-time automatic performance tuning for multicore applications. In: Euro-Par, LNCS vol 6852Google Scholar
  4. 4.
    Ansel J et al (2009) PetaBricks: a language and compiler for algorithmic choice. In: Proceedings conference on Programming Language Design and Implementation (PLDI)Google Scholar
  5. 5.
    Linderman MD et al (2008) Merge: a programming model for heterogeneous multi-core systems. In: Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)Google Scholar
  6. 6.
    Wernsing JR, Stitt G (2010) Elastic computing: a framework for transparent, portable, and adaptive multi-core heterogeneous computing. In: Proceedings of the Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES)Google Scholar
  7. 7.
    Gregg C, Hazelwood K (2011) Where is the data? Why you cannot debate CPU vs. GPU performance without the answer. In: International Symposium on Performance Analysis of Systems and Software (ISPASS)Google Scholar
  8. 8.
    Quinlan D, Liao C (2011) The ROSE source-to-source compiler infrastructure. In: Cetus users and compiler infrastructure workshop, USAGoogle Scholar
  9. 9.
    Augonnet C et al (2011) StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurr Comput Pract Exper 23:187–198CrossRefGoogle Scholar
  10. 10.
    Feautrier P (1991) Dataflow analysis of array and scalar references. Intl J Parallel Program 20(1)Google Scholar
  11. 11.
    Kicherer M et al (2011) Cost-aware function migration in heterogeneous systems. In: Proceedings of the Conference on High Performance and Embedded Architectures and Compilers (HiPEAC)Google Scholar
  12. 12.
    Li L, Dastgeer U, Kessler C (2013) Adaptive off-line tuning for optimized composition of components for heterogeneous many-core systems. In: Seventh Int. Workshop on Automatic Performance Tuning (iWAPT-2012), Proc. VECPAR-2012 ConferenceGoogle Scholar
  13. 13.
    Che S et al (2009) Rodinia: a benchmark suite for heterogeneous computing. In: IEEE International Symposium on Workload Characterization (IISWC)Google Scholar
  14. 14.
    Topcuoglu H et al (2002) Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Trans Par Dist Syst 13(3)Google Scholar
  15. 15.
    Korch M, Rauber, T (2006) Optimizing locality and scalability of embedded Runge-Kutta solvers using block-based pipelining. J Parallel Distrib Comput 66(3)Google Scholar
  16. 16.
    Kicherer M, Nowak F, Buchty R, Karl W (2012) Seamlessly portable applications:Managing the diversity of modern heterogeneous systems. ACM Trans Archit Code Optim 8(4):42(1–42:20)Google Scholar
  17. 17.
    Dastgeer U, Li L, Kessler C (2012) The PEPPHER composition tool: Performance-aware dynamic composition of applications for GPU-based systems. In: MuCoCoS, SC12Google Scholar
  18. 18.
    Lee S, Vetter JS (2012) Early evaluation of directive-based gpu programming models for productive exascale computing. In: Conference for high performance computing, networking, storage and analysisGoogle Scholar
  19. 19.
    Reyes R, Sande F (2011) Automatic code generation for GPUs in llc. J Supercomput 58:349–356CrossRefGoogle Scholar
  20. 20.
    Kessler CW, Löwe W (2012) Optimized composition of performance-aware parallel components. Concurr Comput Pract Exper 24(5):481–498CrossRefGoogle Scholar
  21. 21.
    Ericsson M (2008) Composition and optimization. Växjö University Press, KalmarGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  1. 1.Department of Computer and Information ScienceLinköping UniversityLinköpingSweden

Personalised recommendations