The Journal of Supercomputing

, Volume 71, Issue 12, pp 4646–4662

Performance-aware composition framework for GPU-based systems

Authors

    • Department of Computer and Information ScienceLinköping University
  • Christoph Kessler
    • Department of Computer and Information ScienceLinköping University
Article

DOI: 10.1007/s11227-014-1105-1

Cite this article as:
Dastgeer, U. & Kessler, C. J Supercomput (2015) 71: 4646. doi:10.1007/s11227-014-1105-1
  • 152 Views

Abstract

User-level components of applications can be made performance-aware by annotating them with performance model and other metadata. We present a component model and a composition framework for the automatically optimized composition of applications for modern GPU-based systems from such components, which may expose multiple implementation variants. The framework targets the composition problem in an integrated manner, with the ability to do global performance-aware composition across multiple invocations. We demonstrate several key features of our framework relating to performance-aware composition including implementation selection, both with performance characteristics being known (or learned) beforehand as well as cases when they are learned at runtime. We also demonstrate hybrid execution capabilities of our framework on real applications. “Furthermore, we present a bulk composition technique that can make better composition decisions by considering information about upcoming calls along with data flow information extracted from the source program by static analysis. The bulk composition improves over the traditional greedy performance aware policy that only considers the current call for optimization.”

Keywords

Global compositionImplementation selectionHybrid executionGPU-based systemsPerformance portability

1 Introduction

In recent years, GPU-based systems with disjoint physical memory have become popular in mainstream computing. GPU compute device(s) present in these systems can be used for doing many computations that were otherwise performed on the CPU devices. For example, several implementations for sorting numbers on both CPU and GPU devices are available. With GPUs getting more general-purpose and better programmable everyday, more and more computations are facing this choice. Choosing which implementation to use for a given execution context is known as the implementation selection problem.1

There exist multiple aspects of this implementation selection problem. Some applications can rely on performance models calibrated via earlier executions [1] (or have analytical models [2]) while others require online learning with no previously calibrated performance models [3]. Also, an application can have more than one type of (componentized) computation, with arbitrary data dependency and resource state flow that can result in more complex scheduling and selection tradeoffs. A greedy, locally optimal implementation selection, in this case, may result in a sub-optimal global solution. Furthermore, instead of selecting which implementation to use on which computing device, some computations may benefit from the possibility of simultaneously using multiple (or all) computing devices (CPUs and GPUs) by dividing the computation work into parts and processing them in parallel on different devices. Data parallel computations can especially benefit from this hybrid execution capability.

In this work, we propose an integrated Global Composition Framework (GCF) that aims at addressing the implementation selection problem for heterogeneous systems in a variety of contexts. To the best of our knowledge, our framework is the first implementation selection framework for modern GPU-based systems that targets both local and global composition while relying on existing well-established programming models (OpenMP, PThreads, CUDA, OpenCL etc.) rather than proposing any new programming model (like [46]). Specifically, we make the following contributions:
  • An integrated Global Composition Framework (GCF) based on the ROSE source-to-source compiler that can do performance-aware composition for systems in an automated and portable manner.

  • A component model that defines the notion of component, interface and implementation variants along with specifications of annotations for extra meta-data.

  • A generic performance modeling API that enables usage of both analytical and empirical performance models while supporting online feedback about runtime execution information to the composition framework.

  • A global bulk scheduling heuristic for composing multiple component calls constrained in a data dependency chain.

  • Evaluation of different composition capabilities of our GCF framework with real applications taken from RODINIA benchmarks and other sources.

Although discussed for GPU-based systems in this paper, both the designed framework and the component model are generic and can be used whenever doing implementation selection, e.g. on homogeneous CMP systems, and even can easily be extended to other heterogeneous systems in the future.

This paper is structured as follows: Sect. 2 describes central concepts of the GCF component model. Section 3 describes the proposed framework in detail. Section 4 presents an evaluation of the developed framework with different applications. Section 5 describes bulk composition; some related work is listed in Sect. 6. Section 7 concludes the work.

2 GCF component model

2.1 Components, interfaces and implementation variants

A GCF component consists of an interface that describes a functionality and multiple implementations or implementation variants of that interface that actually implement the functionality. Both interface and implementations have attached meta-data which provides extra information. We use pragmas and comments to represent meta-data, because pragmas are flexible and maintainable during code evolution. The component model is currently realized in C/C++.

In C/C++, an interface is represented by a function declaration that describes the functionality. If not specified otherwise, the function name becomes the name of the interface. The meta-information for an interface includes:
  • Access mode of each parameter (read/written inside the component).

  • Relationship between different parameters (e.g., one parameter may describe the size of another parameter).

  • Performance model associated with the interface.

A component implementation constitutes an implementation of functionality promised by the component interface. Several implementations may exist for the same functionality (interface) by different algorithms or for different execution platforms; also, further component implementation variants may be generated automatically from a common source module, e.g. by special compiler transformations or by instantiating or binding tunable parameters. These variants differ by their resource requirements and performance behavior, and thereby become alternative choices for composition whenever the (interface) function is called. A component implementation targeting a particular execution platform assumes its operand data to be present in the memory address space associated with that particular execution platform. In order to prepare and guide variant selection, component implementations need to expose their relevant meta-data explicitly, such as:
  • The execution platform (e.g. CPU, GPU) it targets.

  • The programming environment it uses (e.g. C++, OpenMP, CUDA, OpenCL).

  • A performance model for this implementation variant.

  • Specific requirements about the target platform and execution context that must be satisfied for its successful execution (selectability conditions).

We use source comments and pragmas to represent interface and implementation meta-data annotations respectively (see Fig. 1). The performance model can be specified at both interface and implementation-variant level. A performance model specified for an interface will be shared by all variants of that interface and will supersede individual performance models of any variant if they exist. As we will see later, the ability to specify performance models at different levels gives us the flexibility to address different application needs.
https://static-content.springer.com/image/art%3A10.1007%2Fs11227-014-1105-1/MediaObjects/11227_2014_1105_Fig1_HTML.gif
Fig. 1

Syntax of source code annotations for interface and implementation meta-data

2.2 Composition technique

Composition is the selection of a specific implementation variant for a call to component-provided functionality and the allocation of resources for its execution. It is made context-aware for performance optimization and can be done in one or several stages. For example, by static composition, preselection of a subset of appropriate implementation variants can be made depending on the statically available information about target system and call context. The final selection among the remaining component variants is then done at runtime (dynamic composition).

Composition points are restricted to calls on general-purpose execution units (with access to main memory) only. Consequently, all component implementations using hardware accelerators such as GPUs must be wrapped in CPU code containing a platform-specific call to the accelerator. A GCF component call is non-preemptive and may be translated to a task for a runtime system. Components are stateless; however, the parameter data that they operate on do have state. The composition points can also be annotated with extra information relevant to the call context. This includes information about constraints on the execution (e.g., execution must be carried out on a GPU) or on the call properties (e.g., limit on problem size). Figure 2(left) shows annotation syntax for a component call.
https://static-content.springer.com/image/art%3A10.1007%2Fs11227-014-1105-1/MediaObjects/11227_2014_1105_Fig2_HTML.gif
Fig. 2

Syntax of source code annotations for component calls and data accesses respectively

As mentioned earlier, a component implementation in the GCF component model assumes operand data to be present in default memory associated with that execution platform. Whether and when to perform the possibly required operand data transfer is not hardcoded but exposed to the composition framework. Factoring out the GPU data transfers from a GPU implementation yields two important benefits. First, it avoids the overhead of data transfers for each component invocation which can have a serious impact on performance [7]. The data transfers can thus be optimized across multiple component invocations. Secondly, it enables further optimizations in data transfers (e.g., overlappping communication with computation) without requiring any changes in the implementation code. This however requires runtime support for (in-time) data transfers to ensure that data is available in the right address space when an implementation gets called. Our framework can automatically find out places in the source code to insert calls to the runtime data handling API to ensure that data is only communicated when really needed while ensuring data consistency and correctness of execution (more on this in the next section). As a fallback option, we provide annotations [Fig. 2(right)] to assist the data access analyzer in cases where data accesses may not be accurately found out by our framework.

3 Global composition framework

Our framework (see Fig. 3) consists of five major parts:
  1. 1.
    A component repository manager (gcfmanager) with support for both registering new components (interfaces, variants, performance models) as well as managing already registered components.
    https://static-content.springer.com/image/art%3A10.1007%2Fs11227-014-1105-1/MediaObjects/11227_2014_1105_Fig3_HTML.gif
    Fig. 3

    The GCF framework and its different parts and their interconnection is shown inside the blue box. Moreover, the process of composition for an input program is shown along with the corresponding output modified source file

     
  2. 2.

    A component tree builder (builder) that, based on the ROSE compiler [8], parses the application source code Abstract Syntax Tree (AST) with one or more component calls and analyses data accesses for component operand data to build a component tree.

     
  3. 3.

    A composer which parses the component tree along with the program control flow obtained via the ROSE compiler to generate the composition code. As a component call is made by interface name, a wrapper (proxy) function is generated to intercept the call and internally delegate the call to a certain implementation variant. The composer can generate code for both the GCF runtime library, which we have developed in this work, as well as for the StarPU runtime system [9].

     
  4. 4.

    A performance modeling API that specifies a generic and extensible interaction interface between the performance models and runtime library.

     
  5. 5.

    The GCF runtime library that can do performance-aware composition by using the performance modeling API along with data handling for component operand data.

     
In the following we explain each part in more detail.

3.1 Componentization and component repository

For practical usage, a component model must be complemented with a mechanism for managing component information in an intuitive and scalable manner. We have provided the tool gcfmanager that can be used to manage the component repository. It has a simple interface and can be used to register and manage both interfaces and implementation variants. It internally stores information about registered interface and implementation variants at a specific place specified at the installation time.

Registering a new interface is done by specifying a path to a file containing the interface declaration with annotations marked as comments. For example, the following command registers an interface of a component from a smoothed particle hydro-dynamics (SPH) application:

https://static-content.springer.com/image/art%3A10.1007%2Fs11227-014-1105-1/MediaObjects/11227_2014_1105_Figa_HTML.gif

where the interface declaration in the header-file looks as follows:

https://static-content.springer.com/image/art%3A10.1007%2Fs11227-014-1105-1/MediaObjects/11227_2014_1105_Figb_HTML.gif

Similarly an implementation of the interface is registered in the following way:

https://static-content.springer.com/image/art%3A10.1007%2Fs11227-014-1105-1/MediaObjects/11227_2014_1105_Figc_HTML.gif

where the implementation definition in file looks as follows:

https://static-content.springer.com/image/art%3A10.1007%2Fs11227-014-1105-1/MediaObjects/11227_2014_1105_Figd_HTML.gif

3.2 Component tree builder

As shown in Fig. 3, the input program AST contains one or more calls to GCF components which operate on certain operand data. As implementation variants assume that operand data reside in their memory address space, certain data handling must be carried out to ensure that data is available at the right place at the right time. As data handling is carried out by the runtime system, we need to place data handling calls for the runtime system at appropriate places in the source code. We achieve this by analyzing the program control and data flow using the ROSE compiler. We have written a custom pass that provides this functionality; it works in the following manner.

First, all component calls in the application code are collected and a list of all variables is maintained that are used (as operands) inside one or more component calls. The only exception is scalar variables that are passed by value (e.g., the size variable in the input code in Fig. 3). The data access analysis is then carried out for each variable in the list. We record data creation, accesses (read and/or written in non-component code) and usage inside a component call. Currently, data usage for array variables is recorded at the variable level. This means that, for example, a read access to a certain array element is considered a read operation on the complete array. This simplifies data flow analysis and data handling code that is inserted [10]. We analyze data accesses in a recursive manner2 for variables that are passed by reference or pointer to non-component function calls by tracing their usage inside that function body. The analysis carried out is conservative to ensure program correctness. However, as the output of our framework is a modified source file with extra C/C++ source code added, the user can verify and detect cases of imprecise analysise. If required, the user can assist data analysis via annotations [see Fig. 2(right)].3 As per now, data usages are correctly tracked for regular C data types including pointers as well as any combination of C composite types using structures (e.g., structure with array member variables, array of structures etc.). For composite types, if a specific member variable is used inside a component call, the access can be separately tracked only for that member variable. For an array member variable, data usage is tracked at the array level, as said previously.

At this level, all data access operations and component calls are recorded in an AST tree representation which makes data flow analysis and optimizations simple to carry out. Using the ROSE compiler, data accesses recorded in the tree can be traced back to their location in the program control flow graph. In the next stage, this tree is complemented by the program control flow graph to insert data handling code at appropriate places in the program control flow.

3.3 Composer

The composer takes the component tree generated by the component tree builder as input along with the program control flow graph. By mapping operations listed in the component tree to their location in the program control flow graph, it generates the actual composition code for a runtime system. Currently, we support composition using either our own GCF runtime library or using the StarPU runtime system [9]. The choice of a runtime system is specified by the user as a simple switch4 when doing the composition. Recursive components are not supported yet; partially because StarPU does not support them.

At this stage, certain optimizations are carried out at the component tree (i.e. not in the actual source code) with the help of the program control flow graph to simplify the process of placing data handling code. For example, if two write accesses to the same variable occur in the same (or nested) scope without any component call in between, they are considered as one write access. Similarly, if one read access is followed by a write access to the same variable with no component call in between, we can consider them a single readwrite access instead.

Finding appropriate places for placing data handling code becomes quite tricky in a program with complex control flow. For instance, the data accesses and component calls can be inside conditional and/or loop constructs. The code must be placed in a way that it preserves the program correctness. The composer applies the following rules when placing the data handling code:
  • Calls for registering5 and unregistering a variable to the runtime system are placed at the same scope level. In case component calls and data accesses span over multiple non-nested scopes, calls can be placed at their immediate common parent scope.

  • Placing data handling code inside a conditional statement is avoided, when possible, to simplify execution flow. When not possible, the code may need to be replicated across different control flows (e.g., if-else statement).

  • Placing code inside loop constructs is avoided, when possible, to optimize data transfers. For example, if a variable is read inside a loop and no component call using that data is present inside the same (or nested) loop, the necessary code to ensure that data is available for the read purpose is placed just before the loop body to avoid the overhead of executing that code for each loop iteration.

In the following, we briefly describe the two currently supported runtime environments for whom the composer can generate the composition code.

StarPU [9] is a C based unified runtime system for heterogeneous multicore GPU-based systems. It supports the explicit notion of CPU and GPU workers, a data management API as well as multiple scheduling policies to decide which implementation to execute at which computing device for a given execution context. It can use the information about runtime workload of each worker to do dynamic load balancing and can effectively use multiple CPU and GPU devices in parallel. We use this capability of StarPU to support hybrid execution in our framework.

GCF runtime library To get more control over runtime decisions, we have developed a light-weight C++ based runtime library. The library is designed for doing performance-aware implementation selection where selection is mainly concerned with which implementation variant to use at which execution platform. The main design goals of this runtime library are to be:
  • light-weight: we want to reduce the runtime overhead to the minimum possible extent. Our library considers all CPUs as one combined worker, instead of creating one CPU worker for each CPU as StarPU does.

  • configurable: The library uses a configurable performance modeling API which allows it to be used for applications with different kinds of performance models. The user can specify performance models (empirical, analytical etc.) to control implementation selection in an effective manner.

There exists naturally some overlap in functionality offered by both runtimes; however there are several major differences. The GCF runtime library supports a configurable performance modeling API with multiple types of performance models available; it also allows a user to easily plug any new performance model with his/her components. Also, the GCF library mainly targets performance-aware implementation selection. This is more of an offloading decision, i.e., making a choice between using CPU (e.g. OpenMP) and GPU (e.g. CUDA) rather than simultaneous usage of both as employed by StarPU.

3.4 Performance modelling API

Considering that no single performance model can suit every application need, we have designed a generic API to handle interaction between a performance model and the runtime library, using C++ classes and polymorphism. Listing 1 shows the abstract performance model (PerfModel) interface which all concrete performance models must inherit. The performance-relevant call context properties are represented by an array of integers. This allows to model any number and kind of performance-relevant properties (e.g., problem size(s), sparsity structure, degree of sortedness of data etc.). Careful selection of performance-relevant properties is important to achieve good prediction accuracy. The PerfModel interface can be used to specify either of the following two kinds of performance models:
  1. 1.

    A Per-variant performance model targets a single implementation variant. In this kind of performance model, the prediction function predict returns a floating-point cost value. A common example of such kind of performance model is the historical execution based performance model used by the StarPU runtime system [1]. For now, we are concerned with performance in terms of execution time so the return value is predicted execution time. However, this could easily be used for other performance objectives such as energy consumption or code-size.

     
  2. 2.

    A Per-interface performance model is common for all variants implementing an interface. This kind of model is at a higher granularity level as the performance model internally handles the choice between different implementations. In this case, the prediction function predict returns the expected best implementation name, managing the complete selection process internally. Many performance models can fit in here, including models based upon decision tree, SVM and one presented in [11].

     
When comparing different variants for implementation selection, potential data transfer cost must also be considered besides the computation cost predicted for each variant. For a CUDA implementation variant, the potential data transfer cost is added to its computation cost if input data is not already present in the target GPU memory. The same should be done for CPU or OpenMP implementations if the most updated data resides in some GPU memory. The implementation selection can be configured in two modes. (1) In pessimistic mode it considers a component call as a standalone call. The data transfer cost for a variant is estimated assuming that data must be made available in the main memory after the component call. This for example could mean that, for a GPU variant, the cost of both transferring input operands from main memory to GPU memory (if not present already) as well as the cost for transferring modified data back to the main memory must be added to its computation cost. (2) In optimistic mode, data transfer cost is estimated by assuming that a component call is part of multiple component calls using that data. The data transfer cost, in this case, is only estimated for input operand data. The pessimistic and optimistic modes are suitable for an application with single and multiple component calls using a certain data respectively. The modes can be automatically determined by the framework by its component tree and control flow analysis.
https://static-content.springer.com/image/art%3A10.1007%2Fs11227-014-1105-1/MediaObjects/11227_2014_1105_Fige_HTML.gif
We have provided a simple library that can estimate data transfer cost given the number of bytes, and source and destination memory unit. For each architecture, the library builds a simple linear model using latency (latency) and average cost of transferring one byte (costPerByte) determined by actual measurements on that architecture. Afterwards, it predicts data transfer time between two memory units \(a\), \(b\) for \(B\) bytes in direction \(a \rightarrow b\) by:
$$\begin{aligned} \mathrm{latency}[a][b] + \mathrm{costPerByte}[a][b] * B. \end{aligned}$$
The framework not only uses the performance model to do prediction (using the predict method) but also feeds the actual measured cost value (execution time in our case) back to the performance model using the addEntry method. This enables runtime calibration where the performance model can learn/improve and predict during the same program execution. One example is online learning where the performance model does not have any previous calibration data available and thus learns and predicts at runtime. The performance models can internally control switching between calibration and prediction mode. One key feature of the API is its genericity as it does not require an empirical model and can work for, e.g., analytical models such as [2].

The process works as follows: the framework creates references to the performance model for each component interface/variant used inside the application during the initialization phase. The performance model can internally load previously calibrated information if any exists. During the actual program execution the performance model is used to predict the best variant as well as the actual execution data is fed back to the performance model. At termination, the framework calls the close method for each performance model. This allows a performance model to persist calibration information for future usage. Currently, the following four performance models are implemented in our framework. All models can learn and predict both within a program execution or across different program executions.

Exact entry model This performance model targets a single variant and is designed for numerical computations and other applications where few combinations of performance-relevant properties get repeated over different executions. The performance model is similar to the history-based performance model in StarPU. It predicts execution time if an exact match is found with calibration data recorded earlier. Otherwise it switches to calibration mode to gather information for possible future prediction.

Euclidean normalized model This performance model also targets a single variant and is more flexible (but usually less accurate [1]) than the previous model as it can predict for intermediate points based on measured points using Euclidean distance where the Euclidean distance between any two points \(a\) and \(b\) is the length of the line segment connecting them (\(\overline{ab}\)). The performance relevant properties are normalized to the same scale before calculating the Euclidean distance.

Convex model This model is at the interface level and thus internally manages selection between different variants. It is based on the model from Kicherer et al. [11]. The model aims for execution contexts with limited or no repetitions in performance relevant properties (i.e., almost every execution has a different value for performance relevant properties). When no calibration data is found, it starts calibration by trying out different variants in a round-robin fashion and collecting their execution time. When a reasonable amount of performance data is collected it builds the initial performance model in the following way: considering that measured points6 for different variants would be possibly different, it takes the union of all recorded points for the different variants. It then interpolates measured points for each variant to get potential cost values for missing points in the union. The optimal variant at each point in the union is then recorded by comparing the potential cost values (along with data transfer costs) of each variant at that point. This model is further optimized by making a convexity assumption that if a single variant is found best for two neighboring points, then the same variant is predicted to be best for intermediate points [12].

Generic model This model does not have any inherent semantics but rather relies on the user to provide an actual implementation. It can be used to plug in any performance model by specifying function pointers for the predict, addEntry and close methods. Two variations of this model are designed for plugging-in per-interface and per-variant performance models respectively.

4 Evaluation

For evaluation, we implemented seven applications from the RODINIA benchmark suite [13], two scientific kernels (sgemm and spmv), and several other applications from different domains (image processing, fluid dynamics etc.). Two evaluation platforms are used: System A with Xeon® E5520 CPUs running at 2.27 GHz with 1 NVIDIA® C2050 GPU with L1/L2 cache support. System B with Xeon® X5550 CPUs running at 2.67 GHz with a lower-end GPU (NVIDIA® C1060 GPU) is used for showing performance portability.

Implementation selection Figure 4 shows execution of several applications over multiple problem instances on two different GPU-based systems. The main decision here is to choose the implementation that performs better in a given execution context (application, architecture and problem instances etc.). As shown in the figure, different implementations can perform better for the same application on a given architecture but at different problem sizes (e.g. pathfinder, convolution) or the choice can be different between different architectures for the same application and problem sizes (e.g. nw.b). The performance-aware implementation selection effectively adjusts to these differences without requiring any modifications in the user code.
https://static-content.springer.com/image/art%3A10.1007%2Fs11227-014-1105-1/MediaObjects/11227_2014_1105_Fig4_HTML.gif
Fig. 4

Execution times for different applications over multiple problem instances with CUDA, OpenMP and our Tool-Generated Performance-Aware (TGPA) code that uses the runtime library to do implementation selection on both platforms (System A, System B). The (exact entry) performance models were calibrated by earlier executions. The baseline (normalized time 1.0) is the faster of the two (OpenMP, CUDA)

Online learning There exist some applications where the same operation is applied multiple times over different operand data. Examples of such bulk operations include applying sorting, image filters, or compressions/de-compressions over multiple files in a loop. In these applications, performance models can be learned at runtime [3, 11]. Our framework provides support for online feedback which allows performance models to be learned/improved during a program execution. In Table 1, we show online learning results for sorting and image convolution applications over 2,000 files and 500 images respectively with random sizes on both systems. The overhead of online learning is also included in the measurements. As the problem sizes are different and non-repetitive, we use the convex model. In both applications, we can see that our tool-generated performance aware code can effectively learn completely online without any previous performance data being available on both systems and can perform up to 20 % better than the best performing version.
Table 1

Online learning—sorting (2,000 files), image convolution (500 images)

 

System A

System B

 

Time(ms)

Rel.time

Time(ms)

Rel.time

Quick sort CPU

3,444.96

3.45

7,271.36

4.20

Radix sort CUDA

1,134.25

1.20

1,961.77

1.13

TGPA (sorting)

997.24

1.0

1,732.73

1.0

Convolution OpenMP

12,276.79

2.75

6,951.62

1.74

Convolution CUDA

4,997.54

1.12

4,598.67

1.15

TGPA (convolution)

4,467.44

1.0

3,990.18

1.0

Hybrid execution An implementation variant of a component can internally specify parallelism, e.g., by using OpenMP, CUDA. However, the implementation is still bound to either CPUs or GPUs. For certain computations, more parallelism can be spawned from a single component invocation by partitioning and dividing the work into several chunks that all can be processed concurrently, possibly on different devices by different implementation variants. This feature is implemented in our framework for data-parallel computations where the final result can be produced either by simple concatenation or simple reduction (using plus, max etc. operators) of intermediate output results produced by each sub-part (e.g. blocked matrix multiplication, dotproduct). When applicable, this feature can be used in a transparent manner (controlled using the partition option in the interface annotations, see Fig. 1) for an application, as it does not require any modification in the user code and/or component implementations. Figure 5 shows how this feature can help in achieving better performance, for different applications, than using any one of the OpenMP and CUDA backends. More than 100 % performance gain is achieved for some applications over the best variant on both systems. This speedup is achievable considering that dividing the computation work of a GPU with multicore CPUs not only divides the computation work but also reduces the communication overhead associated with GPU execution [7].
https://static-content.springer.com/image/art%3A10.1007%2Fs11227-014-1105-1/MediaObjects/11227_2014_1105_Fig5_HTML.gif
Fig. 5

Performance benefit of hybrid execution by using our TGPA code (hybrid) with respect to OpenMP and CUDA implementation variants on System A and B respectively. This feature is enabled using the StarPU runtime system. The baseline is hybrid execution

5 Bulk composition

The heterogeneous earliest finish time (HEFT) scheduler [14] used by both runtime systems considers one component call at a time and selects an implementation that would reduce the overall program time span. In practice, it could prove sub-optimal in presence of multiple component calls operating on the same operand data in a sequence (e.g., read after write dependency). As shown in Fig. 6(left), when considering just the first call alone, we find it better to execute it on CPU (OpenMP) because of the data transfer cost to GPU. However, considering the upcoming component call operating on the same data, going to GPU at start, although locally sub-optimal, makes a better decision overall.

In our framework, we propose and implement a bulk heuristic scheduler for such sequences of component calls that are constrained in a data dependency chain. The bulk scheduler considers all component calls inside a sequence as one scheduling unit and schedules all of them on the device that results in the shortest execution time.Now, we discuss two example scenarios.

Figure 6(right) shows a code portion of a Runge–Kutta ODE Solver from the LibSolve library [15] containing multiple component calls with data dependency between them. Finding an optimal solution with two implementations (OpenMP, CUDA) would require evaluating \(128\) different possible combinations at runtime which can result in a major overhead. Using bulk composition, for a given execution, considers only the two estimates of executing all calls with either OpenMP or CUDA and selects the best one. By using bulk composition, as shown in Fig. 7(left), we were able to perform better than the pure local scheduling policy that considers a single component call at a time. The figure also shows that over different problem sizes, the choice between using OpenMP and CUDA also changes, which the bulk composition was effectively able to adjust.
https://static-content.springer.com/image/art%3A10.1007%2Fs11227-014-1105-1/MediaObjects/11227_2014_1105_Fig6_HTML.gif
Fig. 6

Left an execution scenario showing sub-optimal execution of HEFT greedy scheduler for data dependent component calls. Right Pseudo-code for ODE solver calls

https://static-content.springer.com/image/art%3A10.1007%2Fs11227-014-1105-1/MediaObjects/11227_2014_1105_Fig7_HTML.gif
Fig. 7

Difference between TGPA bulk (global) composition and TGPA HEFT (local) composition for ODE solver (left) and DAXPY (right) component calls [Sytem A]

Similarly, Fig. 7(right) shows execution of a BLAS Level 1 DAXPY component call (with an OpenMP and a CUDA implementation) executing inside a loop over the same data. Considering one component call at a time with operand data initially placed in the main memory, the HEFT scheduler prefers the OpenMP implementation over the CUDA implementation considering that the data transfer overhead to GPU memory, in this case, supersedes the potential computational advantage of GPU execution. The bulk composition can make a better decision by considering data transfer cost amortized over multiple executions of the component call. As the decision is made at runtime, the loop iteration count needs not be necessarily known statically.

6 Related work

Kicherer et al. [11, 16] consider implementation selection (no hybrid execution) for a single call with GPU data transfers as part of CUDA variants which can have serious performance implications. PetaBricks [4], Merge [5] and Elastic Computing [6] propose/rely on a unified programming model/API for programming. In contrast, we do not propose any new programming model but rather support established programming models/APIs (OpenMP, PThreads, CUDA, OpenCL etc.) and can easily support new programming models in future. In PEPPHER [17], we have also studied implementation selection for a single component call by using XML to represent metadata (i.e. no program analysis) and StarPU runtime system. The current framework provides much richer capabilities for doing (both local and global) composition by using program analysis capabilities of the ROSE compiler and a light-weight runtime library coupled with a flexible performance modeling API.

Recently, many directive based approaches [18, 19] for GPU programming are introduced, targeting automatic generation of GPU code from an annotated source code. This is different and somewhat complementary to our work as we consider the composition problem among different implementation variants rather than automatic code generation for GPUs. The context-aware composition [20, 21] is a global optimization technique where each implementation variant specifies its execution contraints (resource requirements etc.) and the composition system tries to efficiently map component calls to given system resources. It relies on performance models provided by the programmer and statically computes dispatch tables for different execution contexts using interleaved dynamic programming. In GPU-based systems that we target, such scheduling decisions could be efficiently made with the help of a runtime system (StarPU etc.) by doing asynchronous component executions.

7 Conclusions

We have presented a component model and a global composition framework that addresses the implementation selection problem in an integrated manner. We have shown its effectiveness for GPU-based systems with the help of several applications, by showing: (1) automatic implementation selection for a given execution context for cases with pre-calibrated models to one with pure online learning; (2) hybrid execution capabilities for data parallel computations, giving up to two times performance improvement for some applications over the best performing variant; and (3) bulk scheduling and its effectiveness with a set of different component calls with data dependency between them.

Although considered for GPU based system in this work, both the proposed component model and the framework (performance modeling API etc.) are generic and can be adapted in other implementation selection contexts.

Footnotes
1

In this article, we use the terms implementation, implementation variant and variant interchangably.

 
2

We consider a recursive function as a special scenario to avoid combinatorial explosion of the solution space.

 
3

We have not encountered any such scenario yet in any application that we have ported to our framework.

 
4

By default, the system generates composition code for our own GCF runtime library. The user can set the -starpu switch to generate code for the StarPU runtime system.

 
5

Registering a variable to the runtime system creates a unique data handle (with information about size, memory address etc.) for that data in the runtime system which can be used for controlling its state and data transfers.

 
6

A point represents a single execution with certain performance relevant properties.

 

Copyright information

© Springer Science+Business Media New York 2014