This section describes the proposed tool flow, concepts and techniques for the implementation of image processing applications, described in RVC-CAL dataflow language, on AP SoC devices.
Our developed tool flow for the implementation of image processing applications is shown in Fig. 5. The input to the framework by the user is the behavioural description of an image-processing algorithm coded in the RVC-CAL dataflow language. This behavioural implementation is expected to be composed of multiple actors along with an xdf dataflow network description. Some of these actors are selected to execute in soft-cores (one actor per core) hence providing concurrent execution of these actors, and the rest to run in the host CPUs. By analysing the behavioural description of the algorithm, the software/hardware partitioning of the design is determined. The metrics involved in this decision-making will be discussed later.
Once the actors are split based on their target execution platform, the original xdf file no longer represents the network topology of either of the two sets. Each set of actors should be redesigned separately and their input/output ports fixed and each set’s xdf dataflow network description file generated. This can easily be done using the Orcc Development Environment.
The actors to run on the host CPUs are compiled from RVC-CAL to C using the C backend of Orcc Development Environment. The actors to be accelerated using the proposed IPPro-based multi-core network are first analysed for decomposition and/or SIMD application, and then passed through a compiler framework. Both of these important steps will be discussed later. The compilation flow is composed of three distinctive steps. The first step investigates the xdf dataflow network file and assigns the actors to the processors on the network and keeps a record of the settings for each actor to communicate with the other ones to establish the data streams. The second step of the compilation is the conversion of each actor’s RVC-CAL code to IPPro assembly code. The final step is the generation of control register values, mainly for AXI Lite Registers, and parameters required by the developed C-APIs running on the host CPUs.
While the interconnects and input/output ports ‘between’ the FPGA-targeted actors are handled by the compiler, receiving image data by the first-level actors and sending the results from the final-level actors back requires some development work and creation of settings. Multiple controllers (programmable by the host CPUs) are designed to provide the interface to transfer the image data to the accelerators and gather the results and transfer back to the host. This part of the design is currently custom-designed or manually handled in our implementation. The fully-programmable implementation is a subject for future work.
With the host CPUs running part of the design and setting control registers and C control functions parameters, the IPPro binary codes of the other actors loaded to the proper cores on the accelerator, and the interface between the software/hardware sections set accordingly, the system implementation is in place and ready to run.
Software/Hardware Partitioning and Decomposition of the SIMD Application
An initial version of a performance analysis tool or profiler has been developed and embedded in the partitioning and decomposition tools in order to evaluate how well the decomposed actors will perform on the new architecture. Various static and dynamic profiling tools and techniques exist in open literature, such as that in Simone et al. [5] who proposed a very beneficial design space framework for profiling and optimising algorithms which also works with the Orcc development environment. This profiler is built to work with HLS-based designs and is not applicable to our processor-based approach. To develop a profiler for our framework, a cost model i.e. a set of metrics has been created as a means of quantifying the effectiveness of the decomposition and mapping of actors to the IPPro network architecture. To realise the cost model, architectural parameters/constraints which should be satisfied to achieve high-performance and area-efficient implementations need to be identified and a method needs to be determined which can quantify the identified metrics for performance/area measurements by a profiler.
For a many-core heterogeneous architecture, the metrics/constraints which are the deciding factors in the partitioning/decomposition process can be categorised as ‘performance-based’ and ‘area-based’. The important performance-based metrics are implemented and discussed here. The area-based metrics are a subject for future work and will be briefly discussed later. The three performance factors to be considered are:
-
Actor execution time which is the main factor affecting performance and can be estimated from the actor’s code. To find the exact execution time of an actor, it needs to be compiled first and its instructions counted. The actors with the longest delays which are parallelisable are suitable for acceleration.
-
Overheads incurred in transferring the image data to/from the accelerator also affect acceleration performance. If an actor requires the entire image to be available for processing or it produces large amount of data to be transferred to the host CPUs, the performance will probably improve by executing it in the host CPUs.
-
Average waiting time is that needed to receive the input tokens and send the produced tokens to another actor, although this could be included in the actor’s execution time.
Given a dataflow network of a design such as the one shown in Fig. 6 where actors’ execution times are reflected in their shapes, the performance can be analysed by considering its pipeline execution structure. This design has a total of 10 actors arranged in 6 columns where the number of cores in every column varies between 1 and 4. A section of the pipeline of this design is shown in Fig. 7. The three communication overheads considered are:
-
Overhead to transfer data from host CPUs to the accelerator and then distribute among cores (OH1);
-
Overhead to transfer data between actors through FIFOs (OH2) and;
-
Overhead to collect the processed data and transfer it back to the host CPUs (OH3).
Using this diagram, a main image processing performance metric, frames/s (fps), can be approximated considering the following features (along with the abbreviation of each feature):
-
D: the worst case delay (execution clock cycles) of all the stages (columns);
-
P: number of pixels in a frame;
-
C: number of pixels consumed on every pass;
-
F: hardware clock rate.
$$ fps \approx \frac{F}{D \times \frac{P}{C}} $$
(1)
In this calculation, the average overhead of the longest actor is included in its execution time. This overhead can usually be ignored since the actor with the longest delay should have its input tokens ready by the shorter actors which are quicker. Considering Eq. 1, it can be concluded that to improve the fps, we need to:
-
Increase SIMD operations, by generating multiple instances of the original design, using the same instruction memory for the corresponding instances and providing appropriate data distribution and collection controllers. This will decrease \(\frac {P}{C}\).
-
Decrease the execution times of cores by decomposing them; this will increase the number of columns in the design and hence the degree of parallelism will increase. In the equation, this will result in decrease of D. In Figs. 6 and 7, the decomposition of the actor with the worst-case delay in 2 nd column will improve the performance.
The host CPU could be considered as one stage of the dataflow and since its clock rate is higher than that of FPGA, the assigned actor’s execution clock cycles could be higher to run in parallel with the shorter actors executing on FPGA. If multiple short actors (compared to the average execution time expected to satisfy the required performance) are sequentially placed in the dataflow, they can be merged to reduce the overhead of token transfer through FIFOs and also reduce area utilisation as less cores will be used up by the design. If these short actors are placed at the start or end of the flow, they are the best candidates to be partitioned for execution in the host CPU. The three final short actors in Fig. 6 are merged and running in the host CPU, as indicated in Fig. 7. If placed in the middle of the dataflow, the cost of transmission to host CPU and then back to the FPGA will typically be high and it would be better to accelerate it.
The behavioural description of an algorithm could be coded in different formats:
-
No explicit balanced actors or actions are provided by the user.
-
The actors include actions which are balanced without depending on each other, e.g. no global variables in an actor is updated by one action and then used by the other ones. These actions need to be decomposed into separate actors.
-
The actors are explicitly balanced and only need to be partitioned into software/hardware execution.
There are two different types of decomposition: ‘row-wise’ and ‘column-wise’. In row-wise decomposition, there is no dependence among newly-generated actors while in column-wise decomposition, the new actors are dependent on each other. The first case mentioned above will most likely result in column-wise decomposition and the second one in row-wise. The row-wise implementation is preferred over column-wise, as in row-wise no overhead of token transmission is incurred compared to the column-wise where this overhead could be a limiting factor in the decomposition process. A combination of these two can also be implemented in certain conditions.
If the actors or actions are not balanced, a number of steps should be taken to decompose it. The main step is to find the basic blocks of the code. A basic block is a sequence of instructions without branches, except possibly at the end, and without branch targets or branch labels, except possibly at the beginning. The first phase of decomposition is breaking the program into basic blocks. Some examples of basic blocks in RVC-CAL are: If statement, while statement, foreach statement, and assignments. Then the ‘balance points’ of the actor should be found. The balance points divide the actor into multiple sets of basic blocks such that if each set is placed in a new actor, the overhead of transferring tokens among the sets will not create a bottleneck and the performance requirements of the algorithm will be satisfied. There could be more than one balance point available for grouping basic blocks in which the one with lower overhead should be used.
Figure 8 shows an example actor ActorMain.cal which does not meet the required performance and should be decomposed. The basic blocks of the actor are highlighted in this code. There are two balance points which satisfy the performance requirements; since either of them divides the code into two sets of basic blocks where the second one is dependent on the first one, this is a column-wise decomposition. A balance point should be chosen which reduces the token transmission through FIFOs; balance point 1 requires one extra token (LocVar1) compared to balance point 2. Therefore balance point 2 is the better choice.
A disadvantage of column-wise decomposition is that the required unprocessed tokens by an actor should pass through the preceding actors (for example Out4 := In3; assignment in Actor1.cal of Fig. 8), and the processed output tokens produced by first-layer actors should be passed through the following actors (for example Out1 := In1; assignment in Actor2.cal of Fig. 8). This adds to the token transmission overhead of the design. Column-wise decomposition, however does not need any changes to be made to the ports of surrounding actors. The communication overhead for the example of Fig. 8 is shown in Fig. 9.
If an actor includes actions which are balanced and independent of each other (with a linear scheduling), or equivalently, the basic block sets inside ‘one’ action are independent of each other around the balance point, the row-wise decomposition can be applied. In the example shown in Fig. 10, ActorMain.cal has two independent sets of basic blocks around the balance point, hence row-wise decomposition can be applied. This type of decomposition does not increase the token transfer overhead when compared against the original actor; it only changes the ports through which the tokens are communicated with the adjacent actors in the dataflow graph; the connecting ports of the neighbouring actors should change to fit the new structure. Figure 11 shows the decomposition impact on the port declarations of this example.
Metrics
As mentioned earlier, the classification of metrics involved in partitioning/decomposition is performance-based or area-based. In our implementation, we have considered the main system-level performance-based features, however, there are more metrics involved and the important ones are reviewed in this section. Some of these metrics are currently manually checked in our design, and automatic application of them will be in our future work. For a many-core heterogeneous architecture, the metrics/constraints involved in the partitioning/decomposition process can be categorised as core-level, network-level, and system-level ones. The important metrics of each level are discussed in the following and summarised in Table 3. To simplify the design process and multi-core network, every decomposed actor is limited to containing one action and being mapped to one soft-core.
Table 3 Important metrics used in decomposition phase
The important core-level metrics are as follows.
-
Actor’s number of instructions: a decomposed actor should have a functionality which can be described in 1000 instructions (limited by a single BRAM capacity).
-
Actor’s average execution time: this is a measure of average time needed to compute output tokens after reading input tokens. The reciprocal of actor execution time is its throughput which is a measure of the actual flow of tokens into a core in terms of bits per second.
-
Core code efficiency: this is a measure of code efficiency in terms of the ratio of ALU instructions to non-ALU instructions. Non-ALU instructions are mainly token read and write from/to external FIFOs and NOP instructions.
-
Peak register usage: this is a measurement of the maximum number of registers of local memory used by an actor including input, intermediate and output variables in a single iteration. The current architecture register limit is 32.
-
Core bandwidth: the theoretical maximum data rate achieved by a core which is directly proportional to the ratio of the number of input tokens required by the core to the number of instructions.
The important network-level metrics are as follows.
-
Core utilisation: multi-core processor array is made up of 4×C interconnected IPPro cores, where C is the number of columns. Each column is locally connected to the next in the PL before passing its results back up to the host (ARM) via the AXI bus. Mapping may not utilise all 4 cores in each column of the data-path.
-
Token production/consumption rate: this factor defines the dynamics of memory requirements on the interconnect and workload division between actors and is dependent on how the high level algorithm has been decomposed. This is the rate at which tokens are produced or consumed by an actor over a period of time.
-
Level of convergence: this is a measure of the maximum number of cores outputs that are connected to a single consumer input through the interconnect. Considering the current interconnect, the consuming core can only receive data from a maximum of four producing cores.
-
Level of divergence: this is similar to level of convergence but is a measure of the maximum number of consuming cores that are connected to a single core output through the interconnect.
-
Average degree of concurrency: this is a measure of average number of actors running ALU operations concurrently. Since the soft-cores run their code sequentially similar to the conventional CPUs, the real performance improvement of this design is the parallel execution of multiple sequential runs.
The important system-level metric is as follows.
-
Frame per second (fps): a high level analysis will report on this for a particular algorithm and includes estimated delays associated with the controllers and host CPU management software. If a system cannot meet the required fps, it will be deemed as a failure. As discussed earlier, Eq. 1 gives an estimation of its value to be used in partitioning/decomposition processes.
Compiler Infrastructure
Our developed compiler infrastructure stage of dataflow framework, shown in Fig. 5, is composed of three major steps. The first step investigates the xdf dataflow network file generated in the decomposition/SIMD application stage and assigns the actors to the processors on the network and keeps a record of the settings for each actor to communicate with the other ones to establish the data streams. Also an actor should send the tokens in a predefined order to the target actors. The target actors also expect the tokens in that order. This issue is resolved in this first step of compilation process. The second step of the compilation is the conversion of each actor’s RVC-CAL code to IPPro assembly code. Target specific optimisations are also carried out at this level. For instance, the IPPro is able to process MUL and ADD operations in a single clock cycle. The compiler will replace consecutive MUL and ADD operations with a single MULADD operation.
As will be explained later, a Zynq device has been used as a target in our project. The compiler is responsible to generate the settings for the AXI Lite Registers based on the algorithm, in order to help the controllers distribute the tokens among the cores and gather the produced results. Also some C control functions have been developed which depending on the algorithm, manage the implementation of the design. The parameters required by these functions are also generated by this compiler.