Journal of Signal Processing Systems

, Volume 70, Issue 2, pp 177–191

Integration of Dataflow-Based Heterogeneous Multiprocessor Scheduling Techniques in GNU Radio

Authors

    • Department of Electrical and Computer EngineeringUniversity of Maryland
  • William Plishker
    • Department of Electrical and Computer EngineeringUniversity of Maryland
  • Shuvra S. Bhattacharyya
    • Department of Electrical and Computer EngineeringUniversity of Maryland
  • Charles Clancy
    • Bradley Department of Electrical and Computer EngineeringVirginia Tech
  • John Kuykendall
    • Laboratory for Telecommunications Sciences
Article

DOI: 10.1007/s11265-012-0696-0

Cite this article as:
Zaki, G.F., Plishker, W., Bhattacharyya, S.S. et al. J Sign Process Syst (2013) 70: 177. doi:10.1007/s11265-012-0696-0

Abstract

As the variety of off-the-shelf processors expands, traditional implementation methods of systems for digital signal processing and communication are no longer adequate to achieve design objectives in a timely manner. There is a necessity for designers to easily track the changes in computing platforms, and apply them efficiently while reusing legacy code and optimized libraries that target specialized features in single processing units. In this context, we propose an integration workflow to schedule and implement Software Defined Radio (SDR) protocols that are developed using the GNU Radio environment on heterogeneous multiprocessor platforms. We show how to utilize Single Instruction Multiple Data (SIMD) units provided in Graphics Processing Units (GPUs) along with vector accelerators implemented in General Purpose Processors (GPPs). We augment a popular SDR framework (i.e, GNU Radio) with a library that seamlessly allows offloading of algorithm kernels mapped to the GPU without changing the original protocol description. Experimental results show how our approach can be used to efficiently explore design spaces for SDR system implementation, and examine the overhead of the integrated backend (software component) library.

Keywords

Design methodologySoftware defined radioGraphic processor unitMultiprocessor schedulingGNU Radio

1 Introduction

In recent years, we have witnessed rapid growth in the computational capacity for fixed and floating point arithmetic in processors. This has allowed radio tasks that could once only be implemented in dedicated analog circuits, analog/digital ASICs, or FPGA logic to now be achievable using software. Additionally numerous modern wireless communications standards have chosen to exploit the low cost of computational resources and have begun to significantly drive up the complexity of waveforms in order to achieve improved spectral efficiency and coverage. Software defined radio avoids much of these problems by reusing computing resources on a more fine grained level. This reuse is achieved by simply implementing all of the necessary routines and flow graphs for these signal processing routines in software [17, 20, 29].

Classical systems development for SDR has targeted single core processors. Following Moore’s law and the necessity to limit the dissipated power on a single chip, performance gains in these processing platforms has more recently been coming from increasing the number of cores on a single die instead of increasing the frequency of single core. This can effectively increase the computational horsepower on-chip while not adversely affecting power consumption.

As SDR applications have different levels of parallelism and performance demands to be suitable for many of these nascent multicore architectures, this creates the potential of many unique targets to be considered when going to implementation. Design and programming difficulties have inhibited the adoption of specialized multicore off-the-shelf platforms like Digital Signal Processors (DSPs), General Purpose Processors (GPPs) and graphics processors (GPUs). Numerous design decisions have to be taken in order to achieve efficient communication protocol realizations that meet the given platform capabilities and constraints. For example, an efficient scheduling (mapping and ordering) of the given algorithm kernels to the available computing units is required. Depending on the target platform, scheduling objectives include different aspects to improve final performance metrics, such as latency, throughput, and power consumption. The scheduling problem is classically known to be NP complete. In practice, designer experience, as well as different heuristics or time-consuming, exact algorithms are applied to derive scheduling solutions.

Following this historical evolution of programmable platforms, developers of signal processing systems are now often required to migrate their libraries of kernels that were originally optimized to target single core processors to newer families of multicore processors [9, 10]. In order to simplify this operation for many multicore architectures, programming models and environments are being introduced to take advantage of particular processors types, along with their associated forms of memory hierarchy and communication facilities. One solution is to use the Model Based Approach (MBA) which requires refinement of the original algorithm to a formal model in order to identify data dependencies between the kernels and different sources of parallelism. Once this identification process is complete, the resulting application model can be analyzed and implemented either manually or using automated tools to take advantage of the target parallel platform and reuse prior investments.

An example of prior investments exists in application specific design frameworks such as GNU Radio [7] which provides SDR developers a rich library and a customized runtime engine to design and test radio applications. GNU Radio is extensive enough to describe audio radio transceivers, distributed sensor networks, and radar systems, and fast enough to run such systems on off-the-self radio hardware. In GNU Radio, fast design flows are facilitated by exploiting the common application structures for communication systems and rich libraries of elements tailored to them. By applying MBA to GNU Radio, required modifications are limited to kernel interface modifications needed to adhere to any new design models, and to adaptations for scheduling techniques that are needed to handle new target platform characteristics.

Preliminary versions of this work were presented in [31] and [26] were a scheduler and a backend module to perform MBA-based migration of SDR systems developed using GNU Radio from single to multiple heterogeneous processors platforms was explored. By applying formal models to the application and the target architecture, our integration enabled new target specific optimizations for performance improvement and provides enhanced retargetability. In this paper we extend these works by providing the complete workflow for both previous works by discussing design consideration to implement SDR primitive blocks on GPU and how they affect building the heterogeneous multiprocessors scheduler. We also benchmark the overheads associated with GPU-accelerated GNU radio actors and demonstrate the speedups that can be achieved with a GPU from a heavily loaded application benchmark.

This paper is organized as follows. In Section 2, the basic workflow steps are explained followed by a detailed description of the associated application models, architecture models and multiprocessor scheduling problem. We also survey related work and emphasize the key contributions of this paper. In Section 3, we explain the implementation of our MBA to migrate, schedule and implement SDR applications onto heterogeneous multiprocessor platforms. In Section 4, we cover implementation details, the backend package added to GNU Radio and the design space exploration. Finally, we conclude in Section 5.

2 Background

Figure 1 shows a high level illustration of a model based design approach for GNU Radio. Here, a system designer starts by choosing appropriate models to represent the application and the targeted platform. These descriptions along with basic objectives and constraints are given as input to the dataflow graph, and then to one or more multiprocessor schedulers. The scheduling solution is then passed to be implemented using the GNU Radio engine. Then profiling takes place, and the design space is explored. A major advantage of this kind of design approach is the separation of the application and platform representations. Such separation helps to preserve efforts used to model and implement different system kernels across different platforms. In this section, we describe different application and platform models and how they are used during the application-to-architecture mapping process. Then we discuss related work that uses model based approaches to develop digital signal processing systems, and finally, we state the contribution of this paper.
https://static-content.springer.com/image/art%3A10.1007%2Fs11265-012-0696-0/MediaObjects/11265_2012_696_Fig1_HTML.gif
Figure 1

Implemented workflow for SDR applications described in GNU Radio.

2.1 Dataflow Applications Models

Dataflow models are widely used in design, analysis and implementation of DSP systems. Different models exists to match various types of applications as synchronous dataflow (SDF) [15], cyclo-static dataflow (CSDF) [6] for static models and Boolean dataflow (BDF) [8], core functional dataflow [25] for dynamic applications. A dataflow model of an application captures important data dependency information between system modules.

A dataflow graph G consists of set of vertices V and a set of edges E. The vertices or actors represent computational functions, and edges represent FIFO buffers that can hold data values, which are encapsulated as tokens. Depending on the application and the required level of model-based decomposition, actors may represent simple arithmetic operations, such as multipliers or more complex operations as turbo decoders.

A directed edge e(v1, v2) in a dataflow graph is an ordered pair of a source actor \(v_1 = \mathit{src}(e)\) and sink actor \(v_2 = \mathit{snk}(e)\), where v1 ∈ V and v2 ∈ V. When a vertex v executes or fires, it consumes zero or more tokens from each input edge and produces zero or more tokens on each output edge. Synchronous data flow is a specialized form of dataflow where for every edge e ∈ E, a fixed number of tokens is produced onto e every time \(\mathit{src}(e)\) is invoked, and similarly, a fixed number of tokens is consumed from e every time \(\mathit{snk}(e)\) is invoked. These fixed numbers are represented, respectively, by \(\mathit{prd}(e)\) and \(\mathit{cns}(e)\). Homogeneous Synchronous Data Flow (HSDF) is a restricted form of SDF where \(\mathit{prd}(e) = \mathit{cns}(e) = 1\) for every edge e.

Given an SDF graph G, a schedule for the graph is a sequence of actor invocations. A valid schedule guarantees that every actor is fired at least once, there is no deadlock due to token underflow on any edge, and there is no net change in the number of tokens on any edge in the graph (i.e., the total number of tokens produced on each edge during the schedule is equal to the total number consumed from the edge). If a valid schedule exists for G, then we say that G is consistent. For each actor v in a consistent SDF graph, there is a unique repetition countq(v), which gives the number of times that v must be executed in a minimal valid schedule (i.e., a valid schedule that involves a minimum number of actor firing.

This minimal schedule executes a unit of execution that we refer to as one iteration of the given SDF graph. Furthermore, associated with any valid schedule S, there is a unique positive integer B, called the blocking factor of S, such that S invokes each actor v exactly B ×q(v) times [16]. This operation is also known as vectorization of S.

In general, a consistent SDF graph can have many different valid schedules, and these schedules can differ widely in the associated trade-offs in terms of metrics such as latency, throughput, code size, and buffer memory requirements [5]. Figure 4 shows a typical SDF graph to model the mp-sched benchmark, which we describe later in Section 4.2. The repetition counts for this example are: \(\ensuremath{\mathit{(SRC,1)}}\), \(\ensuremath{\mathit{(A11,1)}}\), \(\ensuremath{\mathit{(A21,1)}}\), \(\ensuremath{\mathit{(A12,2)}}\), \(\ensuremath{\mathit{(A22,2)}}\), \(\ensuremath{\mathit{(SNK,6)}}\).

2.1.1 Pre-Optimized Kernels: GNU Radio

GNU Radio is an open-source engine that has a collection of many common radio primitives. It allows users to specify a directed acyclic graph (DAG) by using a Python script to instantiate previously-compiled blocks and interconnect them at run-time. These blocks represent common signal processing operations, ranging from digital filters to modulators to forward error correction. A typical flow graph begins with a data source block, proceeds sequentially down a number of signal processing blocks, and then terminates in a data sink block. Between each block is a buffer that is managed transparently. Buffers are generally refined into implementations that are appropriate for the targeted architecture and operating system. Blocks contain buffer readers and writers that maintain their appropriate pointers into each of their input and output buffers, all of which is typically transparent to the user.

GNU Radio currently has two automated run-time schedulers. The original one is single-threaded. A topological sort of the blocks is executed in order, where each actor is executed until its input buffer is exhausted. When the last block is completed, execution resumes at the first block. The multithreaded scheduler instead instantiates each block in its own thread, where mutexed buffered FIFO queues are used to pass data between them.

2.2 Architecture Models

Many platforms have been proposed to run SDR applications. A typical solution consists of an RF front end, units that can perform digital signal processing (processors, IP cores, etc.), and interconnection media. The system input is the digitized signal produced by the analog to digital converter. In typical SDR platforms, the processing is executed by multiple heterogeneous processors that are suitable for different actor operations. Actors that perform control functions require complex pipelines and branch prediction units. GPPs (e.g., Intel quad cores) are usually suitable for these actors. Another relevant type of processor is the Single Instruction Multiple Thread (SIMT) type (e.g., NVIDIA GPUs). These processors have less sophisticated cores that are able to process individual functions on different data sets (e.g., symbol mapping and coding). Many physical layer actors require this kind of data parallelism, and GPUs often exhibit good performance for such actors.

The target platform that we consider consists of a multicore GPP, possibly with one or more SIMD accelerators (e.g., SSE [13] extensions in Intel cores) accompanied with one or more GPUs. All of the processors are assumed to be connected with an all-to-all communication medium. If two dependent actors are allocated on the same processor, data movement will take place using shared memory at zero cost; otherwise, communication occurs across a contention-based communication medium (e.g., PCI bus). Figure 2 illustrates the architecture and memory hierarchy of a typical CUDA GPU. This device consists of a number of streaming multiprocessor (SMs), where each SM consists of multiple scalar processors (SPs). Following a Single Program Multiple Thread (SIMT) paradigm, CUDA kernels can be configured into grids of blocks where every block consists of a grid of threads.
https://static-content.springer.com/image/art%3A10.1007%2Fs11265-012-0696-0/MediaObjects/11265_2012_696_Fig2_HTML.gif
Figure 2

GPU memory hierarchy.

The CUDA work flow consists of serial code running on the host machine and a parallel kernel running on the device. Initially, the input data resides on the host memory. Special functions are provided to copy the data from the host to device memory, where the latter can be accessed by all the CUDA blocks.

2.3 Related Work and Contribution

Many previous research efforts have considered facilitation of DSP system implementation on new computing platforms. In this section, related work that considers kernel implementation, application and architecture modeling, and multiprocessor scheduling is surveyed.

SDR blocks implementation on GPUs is discussed in [30]. A GPU compiler is described in order to take a naive actor implementation written in CUDA [23], and generate an efficient kernel configuration that enhances the load balance on the available GPU cores, hides memory latency, and coalesces data movement. This work can be used in our proposed framework to enhance the implementation of individual software radio actors on a GPU.

Previous research to reduce the gap between application and multicore processor modeling is reported in [18]. In this work, the authors develop a new programming language that is able to describe SDR systems and their implementation on Single Instruction Multiple Data platforms.

In [3], the advantages and drawbacks of various models to describe SDR applications are investigated. Different dataflow models that can be applied to various actors of an LTE receiver are demonstrated. In [17], a hierarchical dataflow programming approach is suggested to specify SDR graphs, and the Satisfiability Modulo Theory is used to formulate the scheduling problem in order to increase system throughput subject to platform memory constraints.

Various heuristics and mixed linear programming models have been suggested for scheduling task graphs on homogeneous and heterogeneous processors (e.g., see [22]). In these works, the problem formulations are developed to address different objective functions and target platforms for implementing the input application graphs.

In [32], a dynamic multiprocessor scheduler for SDR applications is described. The basic platform consists of a Universal Software Radio Peripheral (USRP), and cluster of General Purpose Processors. A flexible framework for dynamic mapping of SDR components onto heterogeneous multiprocessor platforms is described in [20].

Vectorization for single-processor implementation of SDF graphs has been studied previously (e.g., see [27] and [14]). In [11] automatic “SIMDization” (conversion to a form that utilizes SIMD acceleration on the target processor) of streaming programs from a general purpose programming approach is proposed. A combination of SIMDization techniques with homogeneous multiprocessor scheduling is also discussed.

In contrast with prior work, our approach begins with applications described in a domain specific environment (e.g., GNU Radio), and allows designers to use their existing optimized libraries alongside GPU-accelerated library elements. We target platforms that consist of multiple GPP and GPU components, and systematically integrate SDF vectorization and inter-actor (task-level) parallel scheduling to optimize application throughput and latency on the targeted class of heterogeneous multiprocessor platforms.

In the current GNU Radio engine, a strictly runtime multiprocessor scheduler is used to run applications through dynamic scheduling. However, for a wide range of SDR systems, offline profiling and analysis can be employed to drive efficient scheduling solutions that are computed statically. To exploit such static scheduling opportunities, we provide a Mixed Linear Programming (MLP) formulation for the targeted multiprocessor scheduling problem.

Our new scheduling technique shows how to make use of three levels of parallelism in order to increase the system throughput. Our approach is restricted to acyclic SDF graphs, which can be used to represent a broad class of practical SDR applications and subsystems. Generalization of our techniques to graphs that contain cycles is a useful direction for future work.

The primary contribution of this paper is the detailed development of a novel workflow for scheduling SDF graphs while taking into account actor execution times, efficient vectorization, and heterogeneous multiprocessor execution. This scheduling workflow is targeted carefully towards heterogeneous platforms that employ off the shelf GPPs and GPUs, and applications described in a domain-specific language. Moreover, we analyze the overheads of GPU-accelerated GNU radio actors, which facilitates the fast empirical analysis of results generated from our workflow.

3 Integration Workflow

In this section, we describe the steps of our proposed workflow in detail, and provide a new mixed linear programming formulation for heterogeneous multiprocessor scheduling.

3.1 Workflow Description

The design flow of our model based approach proceeds as described in the following steps:
  1. 1.

    Designers write the model of their SDR application using the appropriate application model with no consideration for the underlying platform. As the domain specific environment has an execution engine and a library of SDR kernels components, designers can verify correct functionality of their application. Architecture models for the underlying platforms can also be build independently of the application.

     
  2. 2.

    If actors of interest are not in the new hardware accelerated library, a designer writes accelerated versions of the actors (e.g., targeting CUDA). The design focuses on exposing the parallelism to match the underlying architecture in a parameterized way.

     
  3. 3.

    Either through automated or manual processes, instantiated actors are assigned to the selected type of heterogeneous platform (e.g., GPP, GPU or hardware accelerator). This step may be revisited often as part of a system level design space exploration.

     
  4. 4.

    The mapping result is utilized by augmenting the original application description environment with required standalone libraries and code generation tools to reach a final system implementation.

     

The following sections cover these steps in detail, specifically as they relate to our approaches for augmenting the design flow to accommodate new target platforms and design environments, such as CUDA in the GNU Radio environment.

3.1.1 Writing Accelerated Kernels

GNU Radio give the designer a dataflow facility to describe the application graph. Actors are individually accelerated using GPU specific tools. If an actor of interest is not present in the GPU accelerated library, the developer switches to the GPU customized programming environment, which in our case is CUDA. Also, other tools such as OpenCl and Yang’s C-to-CUDA translator [30] can also be used to implement such actors. We show in Section 3.2 that actors are later profiled to be scheduled on the heterogeneous processors. The designer is still saddled with difficult design decisions, but these decisions are localized to a single actor. In this paper, we target applications that can be profiled offline. Many key SDR applications in the GNU Radio benchmarks fall in this category of static systems. This is in fact common for the broader signal processing domain, where static scheduling remains a popular and useful tool in the design process [4, 28, 24].

System level design decisions are orthogonal to this step of the design process. While we do not aim to replace the programming approach of the actors functionality, the following design strategy lends itself to later design space exploration by the developer.

As with other GPU programming environments, in CUDA a designer must divide their application into levels of parallelism: threads and blocks, where threads represent the smallest unit of a sequential task to be run in parallel and blocks are groups of threads. In our experience, SDR actors vary in how to use thread level parallelism, but tend to realize block level parallelism with parallelism at the sample level. The ability to tightly couple execution between threads within a block creates a host of possibilities for the basic unit of work within a block, be it processing a code word, multiplying and accumulating for a tap, or performing an operation on a matrix. Because blocks are decoupled, only fully independent tasks can be parallelized. For SDR those situations tend to arise between channels or between samples on a single channel.

The performance of this parallelization strategy strongly influenced by the number of channels or the size of a chunk of samples that can be processed at one time. When the application requests processing on a small chunk of sample, there are few blocks to spread across a GPU leaving it under utilized, while large chunks enable high-utilization. The performance difference between small and large chunks is non-linear due to the high fixed latency penalty that both scenarios experience when transferring data to and from the GPU and launching kernels. When chunks are small, GPU time is dominated by transfer time, but when chunks are larger, computation time of the kernel dominates, which amortizes the fixed penalty delay. As the application dictates these values, actors must be written in a parameterized way to accommodate different size inputs.

3.1.2 Partitioning, Scheduling, and Mapping

Once actors are written, system level design decisions must be made, such as assigning which actors are to invoke GPU acceleration. With some applications, the best solution may be to offload every actor that is faster on the GPU than it is on the GPP. But in some cases, this greedy strategy fails to recognize the work that could occur simultaneously on the GPP, while the host thread with the kernel call waits for the GPU kernel to finish. A general solution to the problem would consider application features such as rates of firings, dependencies, and execution times on each platform of each actor, as well as architectural features such as the number and types of processing elements, memories, and topology.

When targeting a GPU or other SIMD platform, vectorization must also be considered. More vectorization tends to lead to higher utilization of the platform (and therefore higher throughput), but often at the expense of increased latency and buffer memory requirements. Also an accelerator typically requires significant latency to move data to or from the host processor, so sufficient data must be burst to the accelerator to amortize such overheads. Ideally, application designers would be simply presented with a Pareto curve of latency versus vectorization trade-offs so that an appropriate design point can be selected. However, vectorization generally influences the efficiency of a given mapping. Thus, to fully unlock the potential of heterogeneous multiprocessor platforms for DSP systems, an automated way of arriving at quality solutions is desirable. In Section 3.2, a scheduler that accept the application and architecture descriptions and generates a varieties of solutions that targets heterogeneous multiprocessors platforms equipped with SIMD units is explained.

3.1.3 GRGPU: GPU Acceleration in GNU Radio

We have developed a set of GPU accelerated, GNU Radio actors in a separate, stand-alone library called GRGPU. GRGPU extends GNU Radio’s build and install framework to link against libraries in CUDA. The resulting actors may be instantiated alongside traditional GNU Radio actors, meaning that designers may swap out existing actors for GRGPU actors to bring GPU acceleration to existing SDR applications. The traditional GNU Radio actors run unaffected on the host GPP, while the GRGPU actors utilize the GPU.

When writing a new GRGPU actor, application developers start by writing a normal GNU Radio actor including a C+ + wrapper that describes the interface to the actor. The GPU kernels are written in CUDA in a separate file and tied back to the C+ + wrapper via C functions such as device_work(). Additional configuration information may be sent in through the same mechanism. For example, the taps of a FIR filter typically need to be updated only once or rarely during the execution, so instead of passing the tap coefficients during each firing of the actor (taps sent from work() to device_work() to the kernel call), they could be loaded into device memory when the taps are updated in GNU Radio. The CUDA compiler, NVCC, is invoked to synthesize C+ + code which contains binaries of the code destined for the GPU, but glue code formatted for C+ +. By generating the C+ + instead of an object file directly, we are able to make use of the standard GNU build process using libtool. Even though the original application description was in a different language, the code is wrapped and built in the GNU standard way giving it compatibility with previous and future versions of GNU and GNU Radio.

When a GNU Radio actor is instantiated, a new C+ + object is created which stores and manages the state of the actor. However, state in the CUDA file is not automatically replicated, creating a conflict when more than one GRGPU actor of the same type is instantiated. To work around this issue, we save CUDA (both host and GPU) state inside the C+ + actor, which includes GPU memory pointers of data already loaded to the GPU. The state from the GPU itself is not saved inside the C+ + object, but rather the pointers to the device memory are. Data residing in the GPUs memory space is explicitly managed on the host, so saving GPU pointers is sufficient for keeping the state of the CUDA portion of an actor.

To minimize the number of host-to-GPU and GPU-to-host transfers, we introduce two actors, H2D and D2H, to explicitly move data to and from the device in the flow graph. This allows other GRGPU actors to contain only kernels that produce and consume data in the GPU memory. If multiple GPU operations are chained together, data is processed locally, reducing redundant I/O between GPU and host as shown in Fig. 3. In GNU Radio, the host side buffers still exist which connect links between the C+ + objects that wrap the CUDA kernels. Instead of carrying data, these buffers now carry pointers to data in GPU memory. From a host perspective, H2D and D2H transform host data to and from GPU pointers, respectively.
https://static-content.springer.com/image/art%3A10.1007%2Fs11265-012-0696-0/MediaObjects/11265_2012_696_Fig3_HTML.gif
Figure 3

GRGPU actors within H2D and D2H communicate data using the GPUs memory, avoiding unnecessary host/GPU transfers.

While having both a host buffer and a GPU buffer introduces some redundancy, it has a number of benefits which make this an attractive solution. First, there is no change to the GNU Radio engine. The GNU Radio engine still manages data being produced and consumed by each actor, so decisions on chunk size or invocation order do not need to be changed with the use of GRGPU actors. Second, GPU buffers may be safely managed by the GRGPU actors. With GPU pointers being sent through host buffers, actors need only concern themselves with maintaining their own input and output buffers. This provides dynamic flexibility (actors can choose to create and free memory for data as needed) or static performance tuning (actors can maintain circular buffers which they read and write a fixed amount of data to and from). Such schemes require coordination between GRGPU actors and potentially information regarding buffer sizing, but the designer does have the power to manage these performance critical actions without redesigning or changing GRGPU. Future versions of GRGPU could provide a designers with a few options regarding these schemes and even make use of the dataflow schedule or other analysis to make quality design decisions. Finally, no extraneous transfers between GPU and host occur. While the host and GPU buffers mirror each other, no transfers occur between them, which avoids I/O latencies that can be the cause of application bottlenecks.

3.2 Multi-Objective Multicore Scheduler

Automated techniques are useful as starting points for leveraging multiprocessing platforms consisting of GPPs and GPUs, with SSE acceleration and CUDA acceleration. An important criterion to arriving at quality solutions in this context is the ability to explore a variety of design points efficiently and accurately.

3.2.1 From Application Model to Block Processing DAGs

Increasing the throughput of individual actors can be performed by applying actor-level vectorization (also referred to as block processing) [27], we can process the maximum possible number of tokens per actor execution. For an SDF graph, this objective can be achieved by using a flat schedule of the input graph. A flat schedule can be generated by deriving a topological sort, and invoking every actor v a number of times equal to B ×q(v). While flat schedules have the potential to improve processor utilization and throughput, such schedules generally suffer from high memory usage. However, in this paper, our objective is to increase the utilization of SIMD cores and furthermore, the available memory on a typical GPU is not a constraint for the class of SDR applications that we are targeting. Given an acyclic SDF application graph G, our scheduling approach first generates a directed acyclic graph (DAG), which we call a block processing DAG (BPDAG) T. T is isomorphic to G, meaning that the sets of vertices and edges are in one-to-one correspondence with one another. Each vertex t in T represents a vectorized version of a specific vertex v in G with some vectorization factor k (i.e., t represents k successive invocations of v). We refer to each vertex in a BPDAG as a task.

For platforms that consist of both GPPs and GPUs, different levels of parallelism can generally be exploited in order to improve throughput. First, a fine grain level of data parallelism can be applied by utilizing the SIMD cores available in GPUs and vector operation accelerators in GPPs (if available). This level can be exploited using vectorization. A more coarse grain form of task parallelism is applied by mapping parallel tasks of the application graph onto the available set multiple processors. Both forms of parallelism may generally be exploited more effectively when B > 1, where B is the blocking factor. Under such a scheduling approach, the latency for a single graph iteration may increase. However, the latency for a block of B successive graph iterations may be reduced significantly, which leads to an increase in throughput (in terms of executed graph iterations per unit time). Such a trade-off is favorable in many throughput-critical systems or in applications where the increased latency does not exceed the given latency constraint.

In our workflow, we set the level of global vectorization before the mapping step to properly inform the multiprocessor scheduler of the vectorized running time of the actors in the application for each processor type. By doing so, we efficiently utilize the SIMD cores by simultaneously firing multiple graph iterations. Therefore the basic multiprocessor scheduler objective is set to minimize the overall latency LB of B graph iterations, which provides an optimized graph execution throughput of N/LB graph iterations per unit time. Here B is a parameter than can be changed flexibly in our framework to help explore the scheduling design space.

The BPDAG is sent to the core of our multiprocessor scheduling engine to perform task mapping (assignment of tasks to cores) and ordering (ordering of tasks assigned to the same core). BPDAG tasks are annotated with their running times, which generally are functions of the vectorization factors. Figure 4 shows an example of transforming the mp-sched SDR example explained in Section 4.2 to its corresponding BPDAG for a blocking factor of 10. This blocking factor generates the vectorization factor and is used to derive each task in the BPDAG.
https://static-content.springer.com/image/art%3A10.1007%2Fs11265-012-0696-0/MediaObjects/11265_2012_696_Fig4_HTML.gif
Figure 4

Example of an SDF graph for the mp-sched benchmark and its corresponding BPDAG.

3.2.2 Problem Formulation

The input to our multiprocessor scheduler consists of the task graph, a description of the target platform, and profile data (for execution time estimation) for each task. The objectives of the scheduler are to perform task assignment and ordering in order to meet a given objective function under a given set of constraints. For assignment, the scheduler is responsible for mapping tasks to processors and edges to communication media. Non-preemptive operation is assumed for both tasks and edges (communication). The ordering aspect of scheduling is required if multiple tasks (edges) are assigned to the same processor (medium). Execution times for tasks are the associated BPDAG vertex processing times (as determined by the actor profiles together with the associated vectorization factors), while execution times for edges are estimated as the data communication times. The input to the multiprocessor scheduler consists of the following items.
  1. (a)

    Architecture description: The platform is described by a set P of processors and a set β of communication buses.

     
  2. (b)

    Application description: The application model (input BPDAG) consists of a set T of tasks, and edges E.

     
  3. (c)

    Dependency descriptions: Dataflow dependencies are defined by the \(\mathit{src}\) and \(\mathit{snk}\) functions described in Section 2.1.

     
  4. (d)

    Task and edge profiles: The task and edge execution times are obtained by simulating the tasks (edges) on different processors (communication media). These profiles are described by two functions: RTP(t ∈ T, p ∈ P) →R defines the execution time of task t on processor p, and REB(e ∈ E, b ∈ β) →R defines the execution time of edge e on bus b. Here, R is the set of positive real numbers.

     
  5. (e)

    Dependency analysis: Task t1 is said to be dependent on task t2 if there is a path that starts at t1 and ends at t2. If no such path exists between t1 and t2, then they are called parallel tasks. A similar concept can be applied to edges.

     

The input summarized in items a–e above is sent to the multiprocessor scheduler in order to perform the operations of mapping and ordering.

3.3 Multiprocessor MLP Scheduler

The problem description in Section 3.2.2 can be solved using available heuristics and optimal schedulers. As offline analysis is suggested to schedule static applications, a mixed linear programming (MLP) heterogeneous multiprocessor scheduler is proposed in order to find efficient solutions. The MLP scheduler consists of a set of equalities and inequalities that describe the application and architecture graphs, solution variables, constraints and objective.

3.3.1 Basic Variables

The basic MLP variables in our formulation are as follows.
  • Mapping variables: ∀ t ∈ T and ∀ p ∈ P, \(\ensuremath{\mathit{XT}}[t,p] = 1\) if task t is assigned to processor p, and \(\ensuremath{\mathit{XT}}[t,p] = 0\) otherwise. Similarly, ∀ e ∈ E and b ∈ β, \(\ensuremath{\mathit{XE}}[e,b] = 1\) if edge e is assigned to bus b, and \(\ensuremath{\mathit{XE}}[e,b] = 0\) otherwise.

  • Ordering variables: ∀ parallel tasks t1 and t2 that are assigned to the same processor, \(\ensuremath{\mathit{YT}}[t_1, t_2] = 1\) if t1 is scheduled to run before t2, and \(\ensuremath{\mathit{YT}}[t_1, t_2] = 0\) if t1 is scheduled to run after t2. A similar formulation is applied for parallel edges.

  • Actual running time: ∀ t ∈ T, RT[t] is the actual (platform-dependent) execution time of the task t depending on its mapping. Similarly, ∀ e ∈ E, RE[e] is the actual token transfer time for the edge e.

  • Start time: ∀ t ∈ T, \(\ensuremath{\mathit{ST}}[t]\) is the start time for execution of task t. ∀ e ∈ E, \(\ensuremath{\mathit{SE}}[e]\) is start time of data transfer across edge e. These variables will be controlled by the dependencies expressed in the BPDAG and the ordering variables.

In this formulation, the basic variables (defined above) are used to derive a number of other variables. These derivations are carried out so that we can use linear equations to “detect” pairs of tasks that are assigned to the same processor. First we define the variables \(\ensuremath{\mathit{ZTP}}[t_1, t_2, p]\), where t1 ∈ T, t2 ∈ T, p ∈ P, and t1 ≠ t2. \(\ensuremath{\mathit{ZTP}}[t_1, t_2, p]\) equals one if t1 and t2 are both assigned to p, and equals zero otherwise. Clearly, this variable depends on \(\ensuremath{\mathit{XT}}[t_1,p]\) and \(\ensuremath{\mathit{XT}}[t_2,p]\). This dependency can be linearized according to the following constraints:
  • \(\ensuremath{\mathit{ZTP}}[t_1,t_2,p] \geq \ensuremath{\mathit{XT}}[t_1,p] + \ensuremath{\mathit{XT}}[t_2,p] - 1\)

  • \(\ensuremath{\mathit{ZTP}}[t_1,t_2,p] \leq \ensuremath{\mathit{XT}}[t_1,p]\)

  • \(\ensuremath{\mathit{ZTP}}[t_1,t_2,p] \leq \ensuremath{\mathit{XT}}[t_2,p]\)

The first equation handles tasks are assigned to the same processor, while the other two handle the three other cases. It can be shown that these equations dominate the problem size, and as a result, they contribute significantly to the time required by the applied solver.

Next, we define another set of variables \(\ensuremath{\mathit{ZT}}[t_1, t_2]\), where t1 ∈ T, t2 ∈ T, t1 ≠ t2. \(\ensuremath{\mathit{ZT}}[t_1, t_2]\) equals one if the two tasks t1 and t2 are collocated. These variables can be easily derived by the following inequality:
$$ \ensuremath{\mathit{ZT}}[t_1, t_2] \geq \sum\limits_{p\in P} \ensuremath{\mathit{ZTP}}[t_1,t_2,p]. $$

The derived variables \(\ensuremath{\mathit{ZT}}\) will be used in two cases. First, for collocated parallel tasks, these variables help to adjust the start times of tasks based on their ordering. Second, if a pair of tasks is connected by an edge, then these variables serve to make the corresponding edge transfer time equal to zero, which is appropriate since the communication occurs through processor shared memory.

3.3.2 Constraints

We use the following inequalities to formulate our targeted heterogeneous scheduling problem:
  • Assignment: Every task (edge) is assigned to only one processor (communication medium):
    $$ \forall t \in T, \sum\limits_{p\in P} \ensuremath{\mathit{XT}}[t,p] = 1 $$
    and
    $$ \forall e \in E, \sum\limits_{b\in \beta} \ensuremath{\mathit{XE}}[e,b] = 1 $$
  • Task running time: ∀ t ∈ T, p ∈ P
    $$ RT[t]\geq \ensuremath{\mathit{XT}}[t,p]\times RTP[t,p] $$
  • Edge running time: ∀ e ∈ E, b ∈ β
    $$\begin{array}{rll} RE[e]&\geq& \ensuremath{\mathit{XE}}[e,b] \times REB[e,b] \\ &&-\, K \times \ensuremath{\mathit{ZT}}[\mathit{src}(e), \mathit{snk}(e)] \end{array}$$
    where K is a very large number. The second term in this inequality models the “edge zeroing process” (i.e., the process of setting an edge’s token transfer time to zero) if the source and the sink tasks of the edge are assigned to the same processor.
  • Starting times for dependent tasks: ∀ e ∈ E
    $$ \ensuremath{\mathit{SE}}[e] \geq \ensuremath{\mathit{ST}}[\mathit{src}(e)] + RT[\mathit{snk}(e)] $$
    $$ \ensuremath{\mathit{ST}}[\mathit{snk}(e)] \geq \ensuremath{\mathit{SE}}[e] + RE(e) $$

    These two equations guarantee the proper execution order of dependent tasks by also taking into consideration relevant edge execution times.

  • Starting times for parallel tasks: Orderings for parallel tasks can be achieved using an adaptation of an equality from [2]: ∀ parallel tasks t1 ∈ T, and t2 ∈ T, t1 ≠ t2:
    $$\begin{array}{rll} \ensuremath{\mathit{ST}}[t_1] &\geq& \ensuremath{\mathit{ST}}[t_2] + RT[t_2] -K( 1 - \ensuremath{\mathit{YT}}[t_1,t_2]) \\ &&-\, K \times \ensuremath{\mathit{ZT}}[t_1, t_2] \\ \ensuremath{\mathit{ST}}[t_2] &\geq& \ensuremath{\mathit{ST}}[t_1] + RT[t_1] -K\times \ensuremath{\mathit{YT}}[t_1, t_2] \\ &&-\, K\times \ensuremath{\mathit{ZT}}[t_1, t_2] \end{array}$$

Note that the last term effectively disables these inequalities if the two tasks are not collocated

3.3.3 Objective

Finally, the objective function minimized is the total graph latency (makespan) M, which can be specified by:
$$ \forall t \in T, M \geq \ensuremath{\mathit{ST}}[t] + RT[t] $$

The solution of the formulated MLP problem is then sent to the workflow back-end to generate a running system. In the next section, implementation of different parts of the scheduling workflow is described for SDR systems used in the GNU Radio environment while targeting platforms consisting of off-the-shelf GPPs and GPUs.

4 Evaluation

We have implemented the proposed workflow, multiprocessor scheduler, and GNU Radio integration. We have experimented with the framework using the mp-sched benchmark [7] shown in Fig. 5. This benchmark is a synthetic benchmark with a parameterized structure that is representative of a broad class of practical signal flowgraph structures. This benchmark describes a flow graph that consists of a rectangular grid of FIR filters. The dimensions of this grid are parameterized by the number of stages (STAGES) and number of pipelines (PIPES). The total number of FIR filters is thus equal to STAGES × PIPES. In this evaluation, the number of filter taps equals 60. This benchmark represents a non-trivial problem for the multiprocessor scheduler as all actors in different pipes can be executed in parallel. In practical GNU Radio system design, designer input typically eliminates a significant part of the solution space by restricting the allocation of some actors to specific processors.
https://static-content.springer.com/image/art%3A10.1007%2Fs11265-012-0696-0/MediaObjects/11265_2012_696_Fig5_HTML.gif
Figure 5

MP-sched SDR benchmark.

4.1 GRGPU Profile

While GRGPU facilitates a fast path to implementation and consequently the design exploration of how to offload functionality to acceleration platforms, it does contain overheads. To benchmark the overheads, we consider two application types: a lightly loaded application graph and a heavily loaded application graph. Intuitively, the lightly loaded graph isolates the minimum overheads associated with using GRGPU, while the heavily loaded application indicates the overheads associated with compute intensive kernels. We benchmark these applications on an NVIDIA GTX 260 in a box with a dual core Xeon running at 3.0 GHz.

The structure of both the lightly and heavily loaded application graphs is the same as that shown in Fig. 3. A single source of samples is introduced into a chain of operations, which for these tests are either all processed by the CPU or all accelerated with the kernel. Samples are chunked into groups of 32 k to be processed by the operations to ensure some baseline level of vectorization available to the actor. In the lightly loaded case, we use an operation that has typical CUDA vectorization capabilities, but has a small compute load: constant add. In the heavily loaded case we replace this operation with an operation well optimized for the GPU: a 32 K point FFT. By cascading the FFTs, we are able to simulate significant compute loads with a single actor type. The CPU implementation of the FFT is the SSE accelerated implementation from the GNU Radio library, while the GRGPU implementation is based on the CUFFT library released by NVIDIA.

The results for the two benchmarks are shown in Fig. 6. In the lightly loaded case, the CPU implementation outperforms the GPU accelerated case regardless of how many samples are to be processed. Because there is negligible speedup with the constant add kernel, the GPU implementation does not overtake the performance of the CPU. Instead, there is a fixed latency penalty of 200 ms. This is incurred from a variety of sources including transferring samples to the device, launching the kernel, and collecting the results. There is also time spent in the GRGPU control logic that coordinates the host thread queues with the device queues. While these penalties appear negligible from a throughput standpoint, the latency penalty is currently high. In the heavily loaded benchmark, the acceleration from the CUFFT library almost immediately makes up for the latency overhead. The GPU approaches 40x acceleration over the SSE accelerated CPU implementation.
https://static-content.springer.com/image/art%3A10.1007%2Fs11265-012-0696-0/MediaObjects/11265_2012_696_Fig6_HTML.gif
Figure 6

GRGPU overhead for various benchmarks.

4.2 Scheduler Empirical Results

4.2.1 Test Setup

The input files for the workflow shown in Fig. 1 consist of the following.
  • The SDF application graph described using the dataflow interchange format (DIF) [12] language.

  • Constraints on the blocking factor B, which are intended to be derived from constraints on memory requirements and overall application latency.

  • A platform description that includes the available processors types and the number of processors for each type.

  • Profile information for every actor (edge) on the available processor types (communication media).

The first stage of the workflow consists of the SDF scheduler, which reads the application SDF graph; calculates the repetition count q(v) for every actor v; reads the global blocking factor B; and generates the corresponding BPDAG. Dependency analysis will be performed to set the required values associated with task dependencies. These operations are implemented using the DIF package.

In the second multiprocessor scheduler stage, the scheduler input is generated by setting the run-time of every actor and edge. This run-time will depend on the number of generated tokens per task invocation. The profiles of every actor for a given processor type are stored as tables that are indexed by the number of produced tokens. In this way, the tables are consistent with GNU Radio synchronous block descriptions. In this evaluation, this step is repeated for different blocking factors (B values). The MLP formulation is implemented using the GNU MathProg language [19]. This implementation consists of two parts: the problem description and data section. The problem description specifies the equalities and inequalities mentioned in Section 3.3 in a parameterized format. For every platform and application graph, the data part described in Section 3.2.2 changes. The MLP problem is solved using the IBM ILOG CPLEX optimizer.

4.2.2 Design Space Exploration

To evaluate our approach empirically, we selected a solution to implement within GNU Radio. Figure 7 shows the mapping and ordering solution for a 2×5 mp-sched graph running on a typical modern platform that consists of 1 GPP (Intel Xeon CPU 3 GHz), 1 GPU (a NVidia GTX 260), and a PCI bus for a blocking factor B = 2048. According to the model, this is a 55 % performance improvement over an all-GPP implementation and a 19 % improvement over an all-GPU implementation. To validate the model result, we implemented this design within GNU Radio using profiling for latency per token and ensuring accuracy within 6 decimal places to the existing solution. With GRGPU, our solution provided a 39 % performance improvement over our empirical results for an all-GPP solution, and a 21 % improvement over an all-GPU solution. For this level of vectorization (B = 2048), using both a GPU and GPP in the implementation provides the best results, as the model indicates.
https://static-content.springer.com/image/art%3A10.1007%2Fs11265-012-0696-0/MediaObjects/11265_2012_696_Fig7_HTML.gif
Figure 7

Gantt chart for 2×5 mp-sched graph on 1 GPP and 1 GPU.

Figure 8 shows a graph of latency per iteration for different vectorization levels. From the characteristic curves of the GPP and GPU implementations of the FIR actor, the GPU is selectively used when I/O latency bound, but more heavily used when sufficient vectorization makes the problem compute bound for the GPU. Using this design space graph, the designer can start by choosing the maximum allowable latency of the DSP application, and then pick the design point that provides the maximum throughput that can be supported for this latency.
https://static-content.springer.com/image/art%3A10.1007%2Fs11265-012-0696-0/MediaObjects/11265_2012_696_Fig8_HTML.gif
Figure 8

Design space for a 2×5 mp-sched graph on 1 GPP and 1 GPU for different blocking factors.

Table 1 shows the solver running time for different mp-sched graphs on various platforms. In Table 1, we have included the amount of improvement (Imp) provided by the mp-sched graphs that are scheduled for integration with GRGPU over the existing homogeneous CPU implementations in GNU Radio. For the reported solver time, the solution gap ranges from 17 % for the 4×4 graph to 67 % for the 8×8 graph. By the solution gap, we mean the difference between the generally non-realizable results obtained from the real-valued solutions produced by the solver, and the practical results obtained from the corresponding integer valued solutions (derived by rounding the solver solutions). In these experiments, the MLP solver was executed on an Intel Core 2 Duo processor at 3 GHz. In our experiments, we have found that our developed MLP problem formulation can solve instances with graph sizes of up to 32 nodes in less than a day. Such a one-day turnaround time is acceptable for many coarse grain dataflow design scenarios in the embedded signal processing domain. In such dataflow graphs, actors typically perform higher level signal processing operations (e.g., FFTs or digital filters), and the corresponding implementations are intended to be fixed or modified only very rarely once they are derived (e.g., see [21, 4]). For larger graphs, different scheduling heuristics can be incorporated with our workflow to find efficient solutions (e.g., see [1] for examples of relevant heuristics).
Table 1

Solver results for different mp-sched graphs.

Graph size

Plat. desc.

Imp. (%)

Solver (h)

PIPES

STAGES

GPPs

GPUs

2

5

1

1

55

0.01

4

4

2

2

400

0.49

6

6

3

3

494

3.94

8

8

4

4

398

19.6

5 Conclusion

As designers of software defined radio (SDR) systems attempt to leverage special purpose multicore platforms in complex applications, they need to be able to quickly arrive at an initial prototype to understand the potential performance benefits. In this paper, we have presented a design flow that extends a popular SDR environment, lays the foundation for rigorous analysis based on formal models, and creates a stand-alone library of GPU accelerated actors that can be placed within existing applications. GPU integration into an SDR specific programming environment allows application designers to quickly evaluate GPU accelerated implementations and explore the design space of possible solutions at the system level. We have also showed how efficient utilization of SIMD cores can be achieved by applying extensive block processing in conjunction with efficient mapping and scheduling using an MLP formulation. Useful directions for future work include new graph transformation techniques for handling cyclic graphs, and handling of dynamic dataflow behaviors in addition to SDF graphs. Other useful directions are (1) the extension of GRGPU to multi-GPU platforms by customizing GPU actors to communicate and launch on specific GPUs, and (2) exploring the efficiency of our workflow on different platforms and programming models.

Copyright information

© Springer Science+Business Media New York 2012