HipaccVX: Wedding of OpenVX and DSL-based Code Generation

Writing programs for heterogeneous platforms optimized for high performance is hard since this requires the code to be tuned at a low level with architecture-specific optimizations that are most times based on fundamentally differing programming paradigms and languages. OpenVX promises to solve this issue for computer vision applications with a royalty-free industry standard that is based on a graph-execution model. Yet, the OpenVX' algorithm space is constrained to a small set of vision functions. This hinders accelerating computations that are not included in the standard. In this paper, we analyze OpenVX vision functions to find an orthogonal set of computational abstractions. Based on these abstractions, we couple an existing Domain-Specific Language (DSL) back end to the OpenVX environment and provide language constructs to the programmer for the definition of user-defined nodes. In this way, we enable optimizations that are not possible to detect with OpenVX graph implementations using the standard computer vision functions. These optimizations can double the throughput on an Nvidia GTX GPU and decrease the resource usage of a Xilinx Zynq FPGA by 50% for our benchmarks. Finally, we show that our proposed compiler framework, called HipaccVX, can achieve better results than the state-of-the-art approaches Nvidia VisionWorks and Halide-HLS.


Introduction
The emergence of cheap, low-power cameras and embedded platforms have boosted the use of smart systems with Computer Vision (CV) capabilities in a broad spectrum of markets, ranging from consumer electronics, such as mobile, to real-time automotive applications and industrial automation, e.g., semiconductors, pharmaceuticals, packaging. The global machine vision market size was valued at $16.0 billion already in 2018, and yet, is expected to reach a value of $24.8 billion by 2023 [2]. A CV application might be implemented on a great variety of hardware architectures ranging from Graphics Processing Units (GPUs) to Field Programmable Gate Arrays (FPGAs) depending on the domain and the associated constraints (e.g., performance, power, energy, and cost). Yet, for sophisticated real-life applications, the best trade-off is often achieved by heterogeneous systems incorporating different computing components that are specialized for particular tasks.
Optimizing CV programs to achieve high performance on such heterogeneous systems usually goes along with sacrificing readability, portability, and modularity. The programs need to be tuned at a low level with architecture-specific optimizations that are typically based on drastically different programming paradigms and languages (e.g., parallel programming of multicore processors using C++ combined with OpenMP; vector data types, libraries, or intrinsics to utilize the SIMD 1 units of CPU; CUDA or OpenCL for programming GPU accelerators; hardware description languages such as Verilog or VHDL for targeting FPGAs). Partitioning a program across different computing units, and accordingly, synchronizing the execution is difficult. In order to  Community driven open-source implementations Well-known CV functions (e.g., optical flow) High-level abstractions that adhere to distinct memory access patterns (e.g., local) Custom node execution on accelerator devices (i.e., OpenCL) Acceleration of the custom nodes that are based on high-level abstractions achieve these ambitious goals, high development effort and architecture expert knowledge are required.
In 2014, the Khronos Group released OpenVX as a C-based API to facilitate cross-platform portability not only of the code but also of the performance for CV applications [29]. This is momentous since OpenVX is the first (royalty-free) standard for a graph-based specification of CV algorithms. Yet, the OpenVX' algorithm space is constrained to a relatively small set of vision functions. Users are allowed to instantiate additional code in the form of custom nodes, but these cannot be analyzed at the system-level by the graph-based optimizations applied from an OpenVX back end. Additionally, this requires users to optimize their implementations, who supposedly should not consider the optimizations of the performance. Standard programming languages such as OpenCL do not offer performance portability across different computing platforms [24,4]. Therefore, the user code, even optimized for one specific device, might not provide the expected high-performance when compiled for another target device. These deficiencies are listed in Table 1.
A solution to the problems mentioned above is offered by the community working on Domain-Specific Languages (DSLs) for image processing. Recent works show that excellent results can be achieved when highlevel image processing abstractions are specialized to a target device via modern metaprogramming, compiler, or code generation approaches [18,8,10]. These DSLs are able to generate code from a set of algorithmic abstractions that lead to high-performance execution for diverse types of computing platforms. However, existing DSLs lack formal verification, hence they do not ensure the safe execution of a user application whereas OpenVX is an industrial standard.
In this paper, we couple the advantages of DSLbased code generation with OpenVX (summarized in Table 1). We present a set of abstractions that are used as basic building blocks for expressing OpenVX' standard CV functions. These building blocks are suitable for generating optimized, device-specific code from the same functional description, and are systematically uti-lized for graph-based optimizations. In this way, we achieve performance portability not only for OpenVX' CV functions but also for user-defined kernels 2 that are expressed with these computational abstractions. The contributions of this paper are summarized as follows: -We systematically categorize and specify OpenVX' CV functions by high-level abstractions that adhere to distinct memory access patterns (see Section 4). -We propose a framework called HipaccVX, which is an OpenVX implementation that achieves high performance for a wide variety of target platforms, namely, GPUs, CPUs, and FPGAs (see Section 5). -HipaccVX supports the definition of custom nodes (i.e., user-defined kernels) based on the proposed abstractions (see Section 5.1). -To the best of our knowledge, our approach is the first one that allows for graph-based optimizations that incorporate not only standard OpenVX CV nodes but also user-defined custom nodes (see Section 5.2), i.e., optimizations across standard and custom nodes.

Related Work
The OpenVX specification is not constrained to a certain memory model as OpenCL and OpenMP, therefore enables better performance portability than traditional libraries such as OpenCV [19]. It has been implemented by a few major vendors, including Nvidia, Intel, AMD, and Synopsys [30]. The authors of [5,33,9,34,27] focus on graph scheduling and design space exploration for heterogeneous systems consisting of GPUs, CPUs, and custom instruction-set architectures. Unlike the prior work, [26] suggests static OpenVX compilation for low-power embedded systems instead of runtime-library implementations. Our work is similar to this since we statically analyze a given OpenVX application and combine the benefits of domain-specific code generation approaches [18,8,10,21,15,3]. Halide [18], Hipacc [8], and PolyMage [10] are image processing DSLs that provide language constructs and scheduling primitives to generate code that is optimized for the target device, i.e., CPUs, GPUs. Halide [18] decouples the algorithm description from scheduling primitives, i.e., vectorization, tiling, while Hipacc [8] and PolyMage [10] implicitly apply these optimizations on a graph-based description similar to OpenVX. CAPH [22], RIPL [25], and Rigel [6] are image processing DSLs that generate optimized code for FPGAs. Hipacc-FPGA [21] supports HLS tools of both Xilinx and Intel, while Halide-HLS [15], PolyMage-HLS [3], and RIPL only target Xilinx devices. CAPH relies upon the actor/dataflow model of computation to generate VHDL or SystemC code. Our approach could also be used to implement OpenVX by these image processing DSLs.
There is no publicly available OpenVX implementation for Xilinx FPGAs to the best of our knowledge. Intel OpenVino [7] provides a few example applications that are specific to Arria-10 FPGAs. Taheri et al. [28] provide some initial results for FPGAs, where the main attention is the scheduling of statistical kernels (i.e., histogram). The image processing DSLs in [21,3] use similar techniques to implement user applications as a streaming pipeline. Section 5.2.1 shows how to instrument these techniques for the OpenVX API. Omidian et al. [12] present a heuristic algorithm for the design space exploration of OpenVX graphs for FPGAs. This algorithm could be simplified by using HipaccVX' abstractions (see Section 4) instead of OpenVX' CV functions. Then it could be used in conjunction with HipaccVX to explore the design space of hardware/software platforms. Moreover, Omidian et al. [11] suggest an overlay architecture for FPGA implementations of OpenVX. The proposed overlay implementation requires the optimized implementation of OpenVX' CV functions, which could be generated by HipaccVX. Furthermore, an overlay architecture based on HipaccVX's abstractions, which is a smaller set of functions compared to OpenVX CV functions, could reduce resource usage in [11].
Intel's OpenVX implementation [1] is the first work extending the OpenVX standard with an interoperability API for OpenCL. This is supported in OpenVX v1.3 [32]. Yet, performance portability still cannot be assured for the custom nodes. An OpenCL code tuned for a specific CPU might perform very poorly on FPGAs and GPU architectures [24,4]. Contrarily to our approach, the performance of this approach relies on the user code.

OpenVX and Image Processing DSLs
In the following Sections 3.1 and 3.2, we briefly explain the programming models of OpenVX and image processing DSLs, respectively. Then, we discuss the complementary features of these approaches in Section 3.3, which are the motivation of this work.

OpenVX programming model
OpenVX is an open, royalty-free C-based standard for the cross-platform acceleration of computer vision applications. The specification does not mandate any optimizations or requirements on device execution; instead, it concentrates on software abstractions that are freed from low-level platform-specific declarations. The OpenVX API is totally opaque; that is, the memory hierarchy and device synchronization are hidden from the user. Typically, platform experts of the individual hardware vendors provide optimized implementations of the OpenVX API [30]. Listing 1 shows an example OpenVX code for a simple edge detection algorithm, for which the application graph is shown in Figure 2. An application is described as a Directed Acyclic Graph (DAG), where nodes represent CV functions (see Lines 14 to 18) and data objects, i.e., images, scalars (see Lines 4 to 12), while edges show the dependencies between nodes. All OpenVX objects (i.e., graph, node, image) exist within a context (Line 1). A context keeps track of the allocated memory resources and promotes implicit freeing mechanisms at release calls (Line 24). A graph (Line 2) solely operates on the data objects attached to the same context.
The data objects that are used only for the intermediate steps of a calculation, which can be inaccessible for the rest of the application, should be specified as virtual by the users. Virtual data objects (i.e., virtual images defined in Lines 9 to 12) cannot be accessed via read/write operations. This paves the way for system-level optimizations applied in a platform-specific back end, i.e., host-device data transfers or memory allocations [19].
The execution is not eager; an OpenVX graph must be verified (Line 20) before it is executed (Line 22). The verification ensures the safe execution of the graph and resolves the implementation types of virtual data objects. The OpenVX standard mandates that a verification procedure must (i) validate the node parameters (i.e., presence, directions, data types, range checks), and (ii) assure the graph connectivity (detection of cycles), at the minimum [31]. Optimizations of an OpenVX back end should be performed during the verification phase. The verification is considered to be an initialization procedure and might restructure the application graph before the execution. A verified graph can be executed repeatedly for different input parameters (i.e., a new frame in video processing).

Deficiencies of OpenVX
As mentioned above, the OpenVX standard relieves an application programmer from low-level, implementationspecific descriptions, and thus enables portability across a variety of computing platforms. In OpenVX, the smallest component to express a computation is a graph node (e.g., vxGaussian3x3Node) from the set of base CV functions. However, these CV functions are restricted to 12 vxCreateVirtualImage ( graph , 0, 0, VX_DF_IMAGE_VIRT ) }; 13 14  Listing 1: OpenVX code for an edge detection algorithm. The application graph derived for this OpenVX program is shown in Figure 2. a small set since OpenVX has a tight focus on crossplatform acceleration [32]. Custom nodes can be added to extend this functionality 3 , but, they leave the following issues unresolved: (i) Users are responsible for the performance of a custom node, who supposedly should not consider performance optimizations. (ii) Portability of performance cannot be enabled for the cross-platform acceleration of user code. (iii) The graph optimization routines cannot analyze custom nodes.
For instance, consider Figure 4 that depicts an OpenVX application graph with three CV function nodes (red) and a user-defined kernel node (blue). A GPU back end would offer optimized implementations of the vxNodes (e.g., Gauss), but the user code (custom node) is a black box for the graph optimizations.
Programming models such as OpenCL can be used to implement custom nodes. This enables functional portability across a great variety of computing platforms. However, the user should have expertise in the target architecture in order to optimize an implementation for high performance. Furthermore, OpenCL cannot assure the portability of the performance since the code needs to be tuned according to the target device, i.e., usage of device-specific synchronization primitives, exploitation of texture memory if available, usage of vector operations, or different numbers of hardware threads [24,4]. In fact, an OpenCL code optimized for an Instruction Set Architecture (ISA) has to be ultimately rewritten for an FPGA implementation in order to deliver highperformance [13].

Image Processing DSLs
Recently proposed DSL compilers for image processing, such as Halide [18], Hipacc [8], and PolyMage [10], enable the portability of high-performance across varying computing platforms. All of them take as input a highlevel, functional description of the algorithm and generate platform-specific code tuned for the target device. In this work, we use Hipacc to present our approach.
Hipacc provides language constructs that are embedded into C ++ for the concise description of computations. Applications are defined in a Single Program, Multiple Data (SPMD) context, similar to kernels in CUDA and OpenCL. For instance, Listing 2 shows the description of a discrete Gaussian blur filter application. First, a Mask is defined in Line 7 from a constant array. Then, input and output Images are defined as C ++ objects in Lines 12 and 13, respectively. Clamping is selected as the image boundary handling mode for the input image in Line 16. The whole input and output images are defined as Region of Interest (ROI) by the Accessor and IterationSpace objects that are specified in Lines 17 and 20, respectively. Finally, the Gaussian kernel is instantiated in Line 23 and executed in Line 24.

Combining OpenVX with Image Processing DSLs
Our solution to the posed challenges in Section 3.1.1 is introducing an orthogonal set of so-called computational abstractions that enables high-performance implementations for a variety of computing platforms (such as CPUs, GPUs, FPGAs), similar to the DSLs discussed in Section 3.2. These abstractions should be used to implement OpenVX' CV functions and, at the same time, be served to the user for the definition of custom nodes.
Assume that the geometric shapes in Figure 4 represent the abstractions above. By implementing both the OpenVX CV functions and the custom node using the  basic building block (different geometric shapes in the figure), a consistent graph is constructed for the implementation. Consequently, the problem of instantiating the user code as a black box is eliminated. Likewise, assume that all the CV functions of the OpenVX code in Listing 1 are implemented by using the computational abstractions called point and local (explained in Section 4). Then, its application graph (Figure 2) transforms into the implementation graph shown in Figure 3. This implementation graph could be used for targetspecific optimizations and code generation similar to the DSL compiler approaches for image processing.
In this paper, we implement the OpenVX standard by the computational abstractions explained in Section 4. We accomplish this task by developing a back end for OpenVX using Hipacc (as an existing image processing DSL) instead of standard programming languages. In this way, we get the best of both worlds (OpenVX and DSL works). Our approach relies on OpenVX' industry-standard graph specification and enables DSL-based code generation. The user is offered well-known CV functions as well as DSL elements (i.e., programming constructs, abstractions) for the description of custom nodes. As a result of this, programmers can write functional descriptions for custom nodes without having concerns about the performance; and, as a consequence, allows writing performance-portable OpenVX programs for a larger algorithm space.

Computational Abstractions
We have analyzed OpenVX' CV functions and categorized them into the computational abstractions summarized in Table 2. The categorization is mainly based on three groups of operators: (i) point operators that compute an output from one input pixel, (ii) local operators depend on neighbor pixels over a certain region, and (iii) global operators where the output might depend on the whole input image, (presented in Figure 5). We have identified the following patterns for the global operators: (a) reduction: traverses an input image to compute one output (e.g., max, mean), (b) histogram: categorizes (maps) input pixels to bins according to a binning (reduce) function, (c) scaling: downsizes or expands input images by interpolation, (d) scan: each output pixel depends on the previous output pixel. Warp, transpose, and matrix multiplication are denoted as global operator blocks.
Through the introduction of the node-internal computational abstractions, our approach enables additional optimizations that manipulate the computation (see  Table 2) are based on three groups of operators.  [27] and hardware-software partitioning [28]. An abstractionbased implementation allows expressing aggregated computations as part of the reconstructed graph. In this way, an implementation graph, as well as an application graph can be expressed using the same graph structure. Furthermore, using the proposed set of abstractions reduces code duplication compared to typical approaches, where the libraries are implemented using hand-written CV functions. For instance, 36 of OpenVX' CV functions can be implemented solely with the description of point and local operators as shown in Table 2; that is, a few highly optimized building blocks for a single target platform (e.g., GPU) can be reused.

The HipaccVX Framework
In this paper, we developed a framework, called Hipac-cVX, which is a DSL-based implementation of OpenVX. We extended OpenVX specification by Hipacc code interoperability (see Section 5.1) such that programmers are allowed to register Hipacc kernels as custom nodes to OpenVX programs. The HipaccVX framework consists of an OpenVX graph implementation and optimization routines that verify and optimize input OpenVX applications (see Section 5.2). Ultimately, it generates a devicespecific code for the target platform using Hipacc's code generation. The tool flow is presented in Figure 1.

DSL Back End and User-Defined Kernels
OpenVX mandates the verification of parameters and the relationship between input and output and parameters as presented in Listing 4. There, first, a user kernel and all of its parameters should be defined (lines 6 to 26). Then, a custom node should be created by vxCreateGenericNode (Line 30) after the user kernel is finalized by a vxFinalizeKernel call (Line 27). The kernel parameter types are defined, and the node parameters are set by vxAddParameterToKernel (lines 20 to 26) and vxSetParameterByIndex (lines 31 to 33), respectively.
We extended OpenVX by vxHipaccKernel function (Line 6) to instantiate a Hipacc kernel as an OpenVX kernel. The Hipacc kernels should be written in a separate file and added as a generic node according to the OpenVX standard [32]. Programmers do not have to describe the dependency between Hipacc kernels as in Listing 2, instead, they write a regular OpenVX program to describe an application graph. This sustains the custom node definition procedure of OpenVX. Ultimately, the HipaccVX framework verifies and optimizes a given OpenVX application, generates the corresponding Hipacc code, and employs Hipacc for device-specific code generation.
OpenVX' CV functions are implemented as a library by using our extension for Hipacc code instantiation. For instance, the HipaccVX implementation of the vxGaussian3x3Node API is shown in Listing 4. Users can simply use these CV functions as in Listing 1. A minority of OpenVX functions are implemented as OpenCV kernels since they cannot be fully described in Hipacc. These are listed in Table 2 with a Software label instead of a Hipacc abstraction type. As future work, we can extend Hipacc to support these functions.

Optimizations Based on Code Generation
We inherited many device-specific optimization techniques by implementing a Hipacc back end for OpenVX. Hipacc internally applies several optimizations for the code generation from its DSL abstractions. These include memory padding, constant propagation, utilization of textures, loop unrolling, kernel fusion, threadcoarsening, implicit use of unified CPU/GPU memory, and the integration with CUDA Graph [8,20,16,17]. At the same time, Hipacc targets Intel and Xilinx FPGAs using their High-Level Synthesis (HLS) tools. There, an input application is implemented through application circuits derived from the DSL abstractions and optimized by hardware techniques such as pipelining and loop coarsening [21,13,14].

OpenVX Graph and System-Level Optimizations
As mentioned before, an OpenVX application is represented by a DAG G app = (V, E), where V is a set of vertices, and E is a set of edges E ⊆ V × V denoting data dependencies between nodes. The set of vertices V can further be divided into two disjoint sets D and N (V = D ∪ N , D ∩ N = ∅) denoting data objects and CV functions, respectively.
Both data (i.e., Image, Scalar, Array) and node (i.e., CV functions) objects are implemented as C ++ classes that inherit the OpenVX Object class. Vertices v ∈ V of our OpenVX graph implementation consist of OpenVX Object pointers. The verification phase first checks if an application graph G app (derived from the user code, see, e.g., Listing 1) does not contain any cycles. Then, it verifies that the description is a bipartite graph, i.e., Finally, the verification phase applies the following optimizations:

Reduction of Data Transfers
Data nodes of an application graph that are not virtual must be accessible to the host, while the intermediate (virtual) points of a computation should be stored in the device memory. We distinguish these two data node types by the set of non-virtual data nodes D nv and the set of virtual data nodes HipaccVX keeps this information in its graph implementation and determines the subgraphs between non-virtual data nodes, which can be kept in the device memory. In this way, data transfers between host and device are avoided.

Elimination of Dead Computations
An application graph may consist of nodes that do not affect the results. Inefficient user code or other compiler transformations might cause such dead code. A less apparent reason could be the usage of OpenVX compound CV functions for smaller tasks. Consider Sobel3x3 as an example, which computes two images, one for the horizontal and one for the vertical derivative of a given image. As the OpenVX API does not offer these algorithms separately, programmers have to call Sobel3x3, even when they are only interested in one of the two resulting images. Our implementation is based on abstractions and allows a better analysis of the computation compared to OpenVX' CV functions, i.e., the Sobel API is implemented by two parallel local operators as shown in Figure 3. HipaccVX optimizes a given application graph using the procedure described in Algorithm 1. Conventional compilers do not analyze this redundancy if utilizing the host/device execution paradigm (e.g., OpenCL, CUDA); that means, when OpenVX kernels are offloaded to an accelerator device, and device kernels are executed by the host according to the application dependency (see Section 6.2).
Algorithm 1 assumes that the non-virtual data nodes whose input and output degrees are zero must be the inputs (D in ) and the results (D out ) of an application, respectively. Other non-virtual data nodes could be input, output is only an adaptor that requires no change in the application graph [23]. In the worst case, the graph has |V | − 2 output data nodes. That is, the complexity of Algo-

Evaluation and Results
We present results for a Xilinx Zynq ZYNQ-zc706 FPGA using Xilinx Vivado HLS 2019.1 and an Nvidia GeForce GTX 680 with CUDA driver 10.0. We evaluate the following applications: As image smoothers, we consider a Gaussian blur (Gauss) and a Laplacian filter with a 5 × 5 and 3 × 3 local node, respectively. The filter chain (FChain) is an image pre-processing algorithm consisting of three convolution (local) nodes. The SobelX determines the horizontal derivative of an input image using the OpenVX vxSobel function. The edge detector in Figure 2 (EdgFig2) finds horizontal edges in an input image, while Sobel computes both horizontal and vertical edges using three CV nodes. The Unsharp filter sharpens the edges of an input image using one Gauss node and three point operator nodes. Both Harris and Tomasi detect corners of a given image using 13 (4 local + 9 point) and 14 (4 local + 10 point) CV nodes, respectively. These applications are representative to show the optimization techniques discussed in this paper. The performance of a simple CV application (e.g., Gauss) solely depends on the quality of code generation, while graph-based optimizations can further optimize the performance of more complex applications (e.g., Tomasi ).  Fig. 6: Throughput for different versions of the same corner detection application (consisting of 9 kernels) on the Nvidia GTX680 (higher is better). The blue bars denote an increasing number of CV functions implemented as user-defined nodes using C ++ . In OpenVX, these user-defined functions have to be executed on the host CPU, which leads to a performance degradation; whereas, HipaccVX accelerates all user-defined nodes on the GPU.
Laplacian uses the OpenVX' custom convolution API and EdgFig2 consists of redundant kernels.

Acceleration of User-Defined Nodes
User-defined nodes can be accelerated on a target platform (e.g., GPU accelerator) when they are expressed with HipaccVX' abstractions (see Section 5.1). A C ++ implementation of these custom nodes results in executing them on the host device. This is illustrated in Figure 6 for a corner detection algorithm that consists of nine kernels. The CPU codes for these custom nodes are also acquired using Hipacc. As can be seen in Figure 6, HipaccVX provides the same performance invariant to the number of user-defined nodes, whereas using the OpenVX API decreases the throughput severely since each user-defined node has to be executed on the host CPU.

System-Level Optimizations based on OpenVX Graph
Reduction of Data Transfers HipaccVX eliminates the data transfers between the execution of subsequent functions on a target accelerator device, as explained in Section 5.2.1. This is disabled for naive implementations. The improvements for the two applications are shown in Figure 7. HipaccVX' throughput optimizations reach a speedup of 13.5.   Fig. 9).

Evaluation of the Performance
In Figure 10, we compare HipaccVX with the Vision-Works (v1.6) provided by Nvidia, which provides an optimized commercial implementation of OpenVX. Hipac-cVX, as well as typical library implementations, exploit the graph-based OpenVX API to apply system-level optimizations [19], such as reduction of data transfers (see Section 5.2). Additionally, HipaccVX generates code that is specific to target GPU architectures and applies optimizations such as constant propagation, thread coarsening, Multiple Program, Multiple Data (MPMD) [8]. As shown in Figure 10, HipaccVX can generate implementations that provide higher throughput than VisionWorks. Here, the speedups for applications that are composed of multiple kernels (Harris, Tomasi, Sobel, Unsharp) are higher than the ones solely consisting of one OpenVX CV function (Gauss and Laplacian). This performance boost is, to a large extent, due to the locality optimization achieved by fusing consecutive kernels at the compiler level [16]. This requires code rewriting and the resource analysis of the target GPU architectures. There was no publicly available FPGA implementation of OpenVX at the time this paper was written. Therefore, in Table 3, we compare HipaccVX with Halide-HLS [15], which is a state-of-the-art DSL targeting Xilinx FPGAs. As can be seen, HipaccVX uses fewer resources and achieves a higher throughput for the benchmark applications. HipaccVX transforms a given OpenVX application into a streaming pipeline by replacing virtual images with FIFO semantics. Thereby, it uses an internal representation in Static Single Assignment (SSA) form. Furthermore, it replicates the innermost kernel to achieve higher parallelism for a given factor v. For practical purposes, we present results only for  Xilinx technology. Prior work [13,21] shows that Hipacc can achieve a performance similar to handwritten examples provided by Intel for image processing. This also indicates that the memory abstractions given in Table 2 are suitable to generate optimized code for HLS tools. Figure 11 compares the throughputs that were achieved from the same OpenVX application code for different accelerators. Here, we generated OpenCL, CUDA, and Vivado HLS (C ++ ) code to implement a given application on an Intel i7-4790 CPU, an Nvidia GTX680 GPU, and a Xilinx Zynq FPGA, respectively. GPUs and FP-GAs can exploit data-level parallelism by processing a significantly higher number of operations in parallel compared to CPUs. This makes them very suitable for computer vision applications. Modern GPUs operate on a higher clock frequency compared to existing FP-GAs, therefore they could provide higher throughput for the abundantly parallel applications. This is the case for Gauss and Unsharp. Whereas, FPGAs can exploit temporal locality by using pipelining and eliminate unnecessary data transfers to global memory between consecutive kernels. Therefore, all the FPGA implementations in Figure 11 achieve a similar throughput.

Conclusion
In this paper, we presented a set of computational abstractions that are used for expressing OpenVX' CV functions as well as user-defined kernels. This enables the execution of user nodes on a target accelerator similar to the CV functions and additional optimizations that improve the performance. We presented HipaccVX, an implementation for OpenVX using the proposed abstractions to generate code for GPUs, CPUs, and FP-GAs.