1 Introduction

Since public adoption of mobile platforms is pervasive and their unused performance potential is so high, it is important to consider compatibility with these platforms in designing new parallel programming models and tools. Most modern mobile devices run Android or iOS Operating Systems (OS), the former being much more widespread. The development of native Android applications is done mainly in the Java and Kotlin programming languages, compiled into Java bytecode and seamlessly interoperating and running on top of the same runtime. Using this type of managed high-level programming language simplifies the development of interactive applications at the cost of adding overhead that hinders execution performance. The potential of providing parallel and accelerated execution capabilities to Java-based applications has been explored for several years, evidenced by all the work around Java Grande [1], or using Java for large-scale parallel applications. However, we believe that even though the programmability problem of creating these types of Java applications has not been completely solved, the demands of the mobile sector make this issue more relevant than ever. Typically, managed programming languages, such as Java, tend to run slower than native ones, like C/C++, due to the overhead of their runtime systems and higher levels of abstraction from the hardware and OS. However, some of these shortcomings have improved over time due to improvements in virtual machine optimizations and Just-in-Time (JIT) compilation techniques [2].

Nevertheless, Java execution on accelerators still requires specialized Java Virtual Machines (JVMs) or libraries or code translation tools that integrate Java bytecode execution and native libraries with low-level access to accelerators. As a consequence, the programmability effort is still very high. Alternatives as language bindings like JCUDA [3] or JOCL [4] are geared toward experts and do not significantly reduce the programming effort. There are multiple approaches to address the issue of accelerating Java applications trying to reduce programming complexity. Projects like Sumatra [5] and TornadoVM [6] propose implementation features or extensions to JVMs and standard libraries that require a reduced effort of developers. Although working at the JVM level has the benefit of being able to decide at runtime, through a JIT compilation process, the processor to target for each task or a dynamic task migration is not always possible to rely on the use of custom JMVs, such as in Android devices.

Tools such as Paralldroid [7], Aparapi [8], Rootbeer [9], or ParallelME [10], on the other hand, can work on standard JVMs, and they can translate Java code or compiled bytecode into native or accelerated implementations. Each alternative defines its parallel programming model on top of Java, adding certain restrictions on what Java code is supported for parallel or accelerated execution. This makes it difficult to port parallel code written for one of these environments to another, and most of them have to solve a similar set of challenges, mainly relating to memory management, on top of making their particular parallel programming model work efficiently. Despite these efforts, none of the existing approaches has been widely adopted by the developer or scientific community, so there is no standard system that implements hardware acceleration of Java code yet.

In the Fancier framework [11], we proposed a common Java API that simplifies the automatic production of efficient native code for acceleration, which can be used as a platform to build independent automatic acceleration tools. This API, built on top of the OpenCL 1.1 standard libraries, includes fixed-size vector data types, a math library, and multiple containers for primitive data types and images, and it is designed to emulate the value semantics used in native languages, allowing a simpler mapping of Java code to C/C++ or OpenCL for accelerators. Fancier provides containers with transparent zero-copy memory support on unified memory systems while providing direct access to the memory from the Java, native, and OpenCL contexts. This is particularly relevant on mobile SoCs where memory copy overhead can easily become a performance bottleneck. The approach does not require modification to the Java compiler or JVM; it only depends on OpenCL.

In this paper, we present the FancyJCL framework,Footnote 1 and we propose a high abstraction layer that hides implementation details when developing parallel Java programs for mobile devices. Genericity, flexibility, and efficiency are basic issues in the design strategy. Parallelism is provided transparently through a sequential interface so that sequential users may access the parallel system. Our proposal is close to Aparapi [8]. However, they operate at the bytecode layer which may generate dependences on the VMs and could penalize the efficiency. FancyJCL is based on the Fancier framework and provides a set of “sequential” classes, methods, and data structures that encapsulate access to the parallel context, providing easy development and fast prototyping. No previous knowledge is needed about JNI and OpenCL since all the glue code for JNI and OpenCL is wrapped into FancyJCL. A very simple, apparently sequential encoding produces parallel and efficient OpenCL code. This ease of programming increases development productivity, improves the delivery of new parallel applications due to the rapid development time, and contributes to the efficient exploitation of new emerging architectures. FancyJCL also benefits from some of the Fancier advantages since OpenCL has been chosen as the hardware acceleration backend, support for general-purpose computations and widespread adoption among all types of architectures, from low-power SoC to state-of-the-art high-performance computing accelerated distributed systems, is provided. The efficient memory management avoids unnecessary copies among the different memory spaces and increases the performance. The decoupled design of the FancyJCL classes is introduced without loss of efficiency. The benchmarks of FancyJCL-generated code show good speedups and ease of use on different mobile devices tested over a wide range of representative image processing kernels.

The rest of the paper is structured as follows: Section 2 introduces the process by which Android applications are compiled and executed. Section 3 presents a summary of the Fancier framework and its advantages. Section 4 describes in detail the building of FancyJCL on top of Fancier. Section 5 includes the performance analysis developed to validate our approach, and in Sect. 6, we give our conclusions and future related lines of work.

2 Application execution on android

Android is a Linux-based operating system mainly designed for mobile devices, such as smartphones and tablets. Several development models are supported to manage the resources allocated to each application. Each model has different features that have to be used in different parts of the application to get the best compromise between performance and development cost. These development models and their implementation trade-offs are outlined in this section.

Managed execution Android native applications are developed mainly using the Java and Kotlin managed programming languages, providing simple development environments with reasonable performance. Both are compiled into Java Bytecode, allowing transparent interoperability between those languages. However, their higher level of abstraction comes with a performance cost over native programming languages. To mitigate this, since Android 5.0, Java Bytecode is transformed and optimized in various stages until it is compiled ahead of time (AOT) into binaries that run natively with help from Android Runtime (ART) [12], as shown in Fig. 1. Recent Android releases use a hybrid just-in-time (JIT) and AOT profile-guided optimization process [13]. Its performance, in many cases, is still far from that of the Native code.

Fig. 1
figure 1

Compilation and execution of an Android application

Native execution Java applications can interface with native code by using the Java Native Interface (JNI) [14], also supported by Android [15]. The native implementation of the method require some JNI API calls as a bridge between Java and C to the actual work to be done by it, adding some development complexity and execution overhead. As an advantage, it allows to link and use Android-supported native system libraries (OpenGL, Vulcan...) or any other native independent library, and existing native code can be integrated instead of re-implementing it in Java or Kotlin. Compute or memory-intensive codes can be accelerated.

Accelerated execution The acceleration of regular Android applications is possible through the use of the officially supported Renderscript language [16]. Conversely, OpenCL is a multi-platform standard for general-purpose accelerated execution that has existed for longer than RenderScript, and it has enjoyed continued support by most of the main SoC vendors at the core of modern smartphones, such as Arm or Qualcomm. Although this alternative has not received official support from Android, its widespread adoption, higher performance potential, and finer-grained control make it an interesting alternative. Additionally, the officially supported framework for Android acceleration from version 12 is Vulkan instead of RenderScript, for which work is being done to allow OpenCL kernels to run [17]. OpenCL is our best option as this work is focused on general-purpose cross-platform acceleration.

The main disadvantage of OpenCL or Vulkan compared to Renderscript is that they demand a higher development effort to be integrated into a managed application written in Java or Kotlin and a steeper learning curve. This is why approaches like those presented in this paper are still relevant and necessary.

3 The Fancier framework

In this section, we describe the Fancier framework [11], a multi-platform, modular, and extensible programming interface to accelerate Java applications without requiring any special features from the JVM they are running in. FancyJCL is supported on top of the Fancier execution model.

3.1 Overview

Fancier uses OpenCL as the high-performance multi-platform backend for accelerated execution. It targets Linux and Android platforms and defines a Java and Native API modelled after the OpenCL C programming language for accelerators. Fancier provides several utilities that simplify the integration of C/C++ and OpenCL execution to a Java application efficiently. Two main features from Fancier have been decisive in building FancyJCL:

  • Fancier defines a Java and C/C++ subset of features and functions where the mismatch between the reference semantics of Java and the value semantics of C/C++ and OpenCL is addressed. Fancier forces value semantics by returning a new result object when operating. The pressure on the Java memory management is reduced by pre-allocating objects for all intermediate results. The Fancier unification of the Java, Native, and OpenCL languages greatly reduces development costs and has no performance penalty. Fancier facilitates the translation of a standardized Java subset to C/C++ and OpenCL for accelerated execution into a Java application and their runtime integration by creating a unified interface and library.

  • The information flow between Java, C/C++ and OpenCL is difficult and requires manual management. There are multiple ways of performing this, but most incur in performance penalties that may go unnoticed. Fancier implements optimal strategies for container data types that work transparently and significantly reduces the development cost. Unified memory systems allow sharing memory buffers between a host processor and accelerators, which can provide great performance gains. However, the optimal management of this type of memory becomes difficult when buffers must be accessed from managed, native, and accelerated contexts. The Fancier memory management strategy features efficient and transparent zero-copy read and write access from all contexts.

Accelerated Java applications using Fancier for OpenCL execution are able to take full advantage of all the features of the OpenCL 1.1 standard, with the advantage that passing data between the Java, C/C++ and OpenCL layers is much simpler, and the control flow between Java and C/C++ is more seamless. As a result, Fancier enables the design and implementation of parallel accelerated programming models on top of Java, and automated transformation tools can easily generate high-performance Fancier Native or OpenCL C code.

3.2 Fancier: APIs and memory management

The Fancier framework provides three layers: the Fancier Java API, the Fancier Native API, and the Fancier OpenCL API. The three layers implement all the functionality to allow an algorithm to be expressed similarly in Java, C++ or OpenCL so its translation from and to any of them is trivial. Arrays, vectors of sizes 2, 3, 4 and 8 and basic types Byte, Short, Int, Long, Float, and Double are supported in the three layers. All Fancier containers are allocated through OpenCL API calls. The functions for managing buffers are much simpler than manually writing JNI and OpenCL API calls. The Fancier Native API allows to obtain and compile OpenCL C kernels at runtime that add math functions supported by Fancier Java and Fancier Native, which are not included in the OpenCL C standard library. In the case of DirectBuffer Java objects, Fancier uses the native address to provide zero-copy operations.

Data transfers between contexts have great implications for performance. An OpenCL-accelerated Java application deals with three independent memory spaces: the Java-managed heap, the OS-managed native heap, and the OpenCL device memory. Fancier handles the memory buffers efficiently, taking advantage of unified memory and avoiding unnecessary copies. The Fancier functions also manage memory synchronizations. Moreover, Fancier includes methods to get the ByteBuffers and copies of their content when it is assumed that the copies will perform more efficiently than the direct use of the ByteBuffer. In summary, Fancier enables transparent use of the unified memory architecture, avoids unnecessary memory movements, and allows reading and writing data from the three layers. Java, Native, and OpenCL.

4 The FancyJCL framework

This section presents the FancyJCL framework, a high-level tool oriented to Java developers willing to accelerate their applications on mobile devices. Core ideas as ease of use, genericity, flexibility, and efficiency are basic goals of the design strategy in FancyJCL.

4.1 Overview

FancyJCL provides a high-level sequential abstraction layer so that users without any expertise in native or parallel programming can use it in a simple way. FancyJCL collects a package of sequential classes, methods, and data structures that encapsulate access to native and parallel contexts, allowing easy development and fast prototyping. No prior knowledge is needed about parallelism, JNI or OpenCL since all the glue code for JNI and OpenCL is wrapped into FancyJCL. A very simple, apparently sequential encoding produces parallel efficient OpenCL code. Since FancyJCL is based on Fancier, it is also multi-platform, extensible, and does not require any special features from the JVM they are running in, so the generated parallel code can be executed into a mobile device but also on server or desktop computers.

Genericity appears naturally since OpenCL is the hardware acceleration backend, with support for general-purpose computations, fine-grained hardware control, and widespread adoption among all types of architectures. The object-oriented design makes FancyJCL a flexible tool that can be easily extended to new scenarios. For example, optimization techniques such as vectorization or tiling could be introduced in a transparent manner. The Fancier parallel structures are decoupled from the FancyJCL classes so no loss of efficiency is introduced by the approach. The zero-copy memory support provided by Fancier avoids unnecessary copies among the different memory spaces, increasing the performance. Improved application quality, increased use of parallel architectures by non-expert users, rapid inclusion of emerging technology benefits into their systems, and dynamic and transparent optimization enhancements are some of the benefits that can be reached.

4.2 Design

Fig. 2
figure 2

The FancyJCL architecture

Figure 2 shows a general overview of the FancyJCL architecture. FancyJCL builds a layer on top of Fancier that encapsulates the Fancier initialization, the glue code necessary for the JNI and OpenCL generation, the mapping to Fancier data structures and objects, the generation of the functional Kernel including inputs and outputs, and also the full OpenCL program and the synchronization among the different contexts. Some of these actions are supported by Fancier, and the FancyJCL interface wraps up and adapts the user calls but, in some others, more elaborated extra code is added internally in FancyJCL to generate new code or before a call to the Fancier framework is performed.

Figure 3 goes deeper in the architecture and shows a set of representative FancyJCL classes, objects and methods. The FancyJCL environment is initialized by calling the initialize() method of the FancyJCLManager static class. This initializes the FancyJCL and Fancier contexts at the same time and creates the JNI, the OpenCL context and the OpenCL queues. The list of namespace and attributes that will be passed through the different contexts is created.

Fig. 3
figure 3

FancyJCL use case diagram

The FancyJCL Stage class is designed to create execution units that manage information about an algorithm, its parameters, and the configuration run, i.e., the range of the index variables and how many GPU threads to use, among others. Objects declared in the Stage class are intended to run a parallel process on input data objects to produce output data objects. The setInputs and setOutputs methods of a Stage object will manage the corresponding java objects. FancyJCL converts them into Fancier objects; if the declared object belongs to the DirectBuffer class (or a derived from it), only pointer reassignments to the Fancier objects will be needed, and there will not be data copies among the different contexts. This has an important impact on overall performance. Otherwise, a copy is performed into a Fancier object, and from that moment on, no additional copies among the different contexts are needed. The parallel procedure is assumed to be a simplified kernel, defining the operation to be applied to the input and output data, passed through the setKernelSource method. The predefined variables \(d_0,\ldots , d_k,\ldots , d_{n-1}\) manage the index on dimension \(k-th\), which eases and encapsulates the call to the get_global_id(k) OpenCL kernel function. Before an execution is launched the length and number of threads to use per dimension must be fixed (setRunConfiguration()). This method calls to the OpenCL functions to create the OpenCL kernel. The Stage class provides methods to run the kernel, including synchronous blocking and asynchronous executions of several kernels, and the corresponding methods to ensure synchronization among the different contexts. An attribute list created in FancyJCLManager keeps a list of references to the Stage which addresses the attributes. This allows the output to be the input of the same kernel or a different kernel.

Fig. 4
figure 4

Computing in FancyJCL a weighted average of elements of two buffers and storing it in a third buffer

4.3 FancyJCL example

An example illustrating the use of FancyJCL appears in Figs. 3 and 4, where a simple code is shown that computes the average of two buffers. The flow control follows the steps previously described. No native code is included by the user into their application nor glue code. Every declaration, definition, and method call adopt the natural coding style in Java. The user only needs to know two FancyJCL classes, FancyJCLManager and Stage and some of their methods. The only difference is the notion of a simplified kernel. The user must understand how to express the computation of an input data item to produce the output. This is straightforward if the user is familiar with the CUDA or OpenCL streaming programming model, but, in any case, it involves a quite simple semantic and does not require too much effort from the user side. Some attention must also be paid by the user to the synchronization of the results. This piece of program (about 20 lines) encapsulates a code of about 56 lines.

Another important element that usually goes unnoticed is the effort involved in the cross-compilation for this case, where java and native code are mixed. Since FancyJCL is delivered in a precompiled package, the extra compilation stage and CMake management are removed from the development cycle. Compiling is faster, and the knowledge required to do it is less.

5 Evaluation

Three Android devices have been used to evaluate the performance improvements of the FancyJCL API over Java reference implementations (Table 1). Their results are representative of a range of recent SoCs present in modern Smartphones. Two of the devices are off-the-shelf products, and the other is an HDK (mobile Hardware Development Kit). FancyJCL takes advantage of two main key aspects: the existence of a GPU that is programmable using OpenCL and the unified memory model that allows for implementing zero-copy strategies when executing code in CPU and GPU contexts.

Table 1 Hardware platforms

Several image filter kernels have been developed to evaluate the performance of the FancyJCL library. The kernels have been programmed naively, meaning that no parallel strategy has been applied to the FancyJCL implementation, such as loop unrolling, vectorization, or sharing memory in a workgroup (local memory). The only strategy followed is to establish a large global workgroup size. The image processing kernels have been implemented using a 32-bit RGBA pixel format. The collection of kernels implemented is the following:

  • Bilateral An edge-preserving smoothing filter is a stencil code; each neighbor is weighted according to its color and distance to the center pixel.

  • Median A median filter is a stencil code that evaluates the neighbors of each pixel within a given radius and applies the median intensity of these input pixels to the output. Our implementation only uses the red pixel channel, producing a gray scale output.

  • Posterize A filter that applies pre-selected colors to given ranges of input pixel intensity is a pixel-wise kernel that is called multiple times, one per range. We use five ranges in our testing.

  • Levels A pixel-wise kernel applies a saturation and contrast levels change to an image.

  • Fisheye A distortion kernel applies a fisheye lens effect to an image. The coordinates of each pixel are transformed using several math functions, and the resulting output pixel is calculated through a bilinear interpolation of these transformed coordinates.

  • Contrast A pixel-wise kernel applies a parameterized contrast enhancement of an image.

  • Convolution 5×5; 3×3 Convolution kernels using a 5×5 or 3×3 mask implemented without loops.

  • GrayScale: A pixel-wise kernel converts a color image to gray scale.

  • GaussianBlur: A smoothing filter based on the Gaussian function is implemented as two successive passes where one applies the filter considering the horizontal neighbors and the other considers only the vertical neighbors of each pixel. This reduces computation and provides the same result as a single-kernel variant because it is separable.

Each of these kernels has been implemented in Java and FancyJCL, the Java reference implementation uses Java arrays (byte []) to fetch, process, and write pixel data and the FancyJCL is a parallel OpenCL implementation using DirectBuffer Java Buffers and taking advantage of the FancyJCL library.

Due to the highly dynamic performance characteristics of these SoCs and because of the power and thermal constraints they have, the testing algorithms must be repeated enough times so that the standard deviation is low enough for the average to be considered a valid measurement. Also, in the multiple-context application such as the use case of executing in the GPU, the benchmarking process must be aware of when to synchronize with the CPU. If DirectBuffer Objects are used in an architecture that implements unified memory, synchronization of the input and output memory buffers is not necessary. To achieve fairness, the measurement process in GPU implementations is the following:

  • Start measuring time

  • Enqueue execution of the algorithm on GPU N times

  • Wait until the OpenCL queue finishes

  • End time_measurement

  • Return time_measurement / N

Fig. 5
figure 5

Speedups of the Java averaged execution time by the FancyJCL equivalent implementation of 5 groups of kernels with 8 resolutions. One subplot per SoC

Fig. 6
figure 6

Normalized average of time measurements

The class FancyJCL.Benchmark allows for executing a benchmark a number of times and includes a synchronization prologue that will only be executed once but will be included in the measured time. The benchmarking application was compiled in release mode, producing native binaries in arm64-v8a mode (64-bit). The implementation of the image-processing kernels was evaluated over the following set of image inputs of different sizes: VGA (\(640\times 480\)); XGA (\(1024\times 768\)); HD1 (\(1280\times 720\)); HD2 (\(1366\times 768\)); HD+ (\(1600\times 900\)); FHD (\(1920\times 1080\)); QHD (\(2560\times 1440\)); UHD (\(3840\times 2160\)). The number of executions was variable for each experiment. A fixed execution time of at least 30 s was set as experiment_time. The number of executions N is therefore the amount needed to take the experiment_time. In all cases, N was greater than 10 and the measured standard deviation was lower than 0.01.

Table 2 shows the average execution times and speedups obtained from the computational experience developed. Figure 5 shows the speedups for a selected group of kernels on each SoC. For almost every experiment, the average execution time of the FancyJCL version was shorter than that of its Java reference implementation. The average speedup is of  23. This means that by using this library, not only is a significant increase in speed gained, but also the complexity of the code is greatly reduced by hiding all the OpenCL library calls and the native C++ code. The worst case is the Mean filter, since it fetches a large number of pixels to be sorted, and a local memory strategy should be used to achieve a significant speedup. The best case is the Bilateral filter, where most of the computational cost lies on floating point mathematical operations such as multiplication and power because their Adreno implementations are much faster.

The speedups achieved in the SoC SD845 are about twice those of the SD865 or SD888. This is explained since the CPU and the GPU did not evolve at the same rate between SD845 and SD865, while they did between SD865 and SD888. The difference in performance between the CPU and GPU is greater for the SD845 than for the SD865 or SD888. This can be better appreciated using our data in Fig. 6, where the normalized execution times are compared for the three SoC and for CPU and GPU. The GPU evolution behaves almost linearly, while the CPU evolution behaves exponentially.

6 Conclusion

We have presented FancyJCL, a framework that provides easy development and fast prototyping to generate parallel OpenCL codes in mobile devices. A set of Java classes is provided to the developer to fill code gaps in the apparently sequential methods of these classes. The set of classes made available to the user is quite reduced, so the learning curve is very low and the parallelism is offered in a transparent manner. Genericity and efficiency are guaranteed through the internal use of the Fancier Library. Fancier constitutes a promising backend for parallel programing models built on top. It is not constrained by any particular JVM and efficiently solves the problem of data movement among the various memory spaces involved in parallel computations in a mobile device. The ease of use and efficiency of FancyJCL are validated by benchmarking a wide range of representative image processing kernels on three different SoC architectures. The flexibility of the framework is inherent to the design of FancyJCL; for example, natural extensions to FancyJCL could provide support for Image2D, vectorization, or auto-tuning just by extending the RunConfiguration method. More elaborated extensions would allow us to generate OpenGL or Vulcan code, which is supported by a larger number of vendors, or generate code in a different language as JavaScript to cover a greater community of mobile developers.

Table 2 Mean of execution times in milliseconds