1 Introduction

Machine vision offers potential to revolutionise how we perform activities in security, manufacturing production, autonomous vehicles and robotics applications by utilising advances in machine and deep learning algorithms. The implementation of such systems, however, presents considerable performance challenges using current technology. A key decision is to determine the amount of embedded processing that needs to be carried out at source, as this can reduce the communication bandwidth used to transmit the necessary data to the centralised controller. In applications with limited bandwidth, this ability could be critical in achieving the real-time latency needs. For complex machine vision scenarios, this will involve co-execution of multiple dynamically varying workloads in the edge devices. For example, in video security, the system may need to perform object detection, people identification and facial recognition at various times. Moreover, these algorithms may run at different frame rates and may require different accuracy and operational times. Thus, the challenge is to best utilise the resources of the edge device to execute the specific workloads at any specific time.

One approach is to ’over resource’ the system design and turn off and on functionality as needed. This will not be practical, however, as it does not meet the stringent computational power and memory operational constraints of emerging mobile vision systems. It is critical to ensure a maximum Quality of Service (QoS) when processing varying workloads, in order to maintain the algorithm’s maximum operating accuracy and frame processing rate. We can employ offline accuracy tuning techniques to statically trade off accuracy with performance, but it does not allow the frame rate to be dynamically adapted [1, 2]. This means that either the algorithm’s accuracy is conservatively designed to allow maximum frame processing rate (thus providing suboptimal accuracy even at lower throughput requirements) or the maximum achievable frame rate is reduced for higher accuracy. Along with a workload, the dynamic system environment such as available memory or energy to execute a task, can also vary.

This work targets optimisation of QoS for dynamic machine vision workloads using transprecision techniques. We use the term transprecision to include all methods, i.e. arithmetic precision and algorithmic variations, that can be employed dynamically to change the operating precision of the outcome. For NN-based machine vision applications described here, the precision of outcome is defined as the accuracy of correctly classified inputs.

The paper proposes a design approach that gives the best trade-off at runtime between varying QoS parameters under the operating constraints. It is an extension of our work using transprecision computing to trade-off accuracy of computations or algorithms at runtime against resource usage [3]. To evaluate performance for machine vision applications on resource constrained devices, we use a relevant neural network (NN) architecture search that statically generates a range of implementations that target a resource-precision trade-off [4].

The original contributions include:

  • A system flow and associated runtime environment that uses the generated design space of [4] to incorporate transprecision to suit dynamically varying requirements.

  • The detection of the optimal parameters for the required QoS under specific memory and energy constraints by profiling 3000 NN configurations on a GPU of an edge device.

  • Creation of various optimisation objectives, e.g. minimisation of execution time, power/energy consumption or maximisation of accuracy, given the accuracy constraint.

  • Application of Integer Linear Programming (ILP) at runtime to find the optimal combination of configurations for a specific NN, meeting varying constraints, under an allowable energy/memory consumption budget. It monitors all of the specified constraints and tracks the applications scheduled for execution, thus allowing automatic detection of, and transparent switching to, the optimal configuration to maximise the overall QoS.

The additional contributions covered in this contribution include:

  • Extension of the initial study by exploiting available CPUs as an alternative co-scheduling resource. It provides detailed profiling of the same parameters on the CPU, while highlighting the key similarities and differences between CPU and GPU computations, as well as analysis of the suitability of the transprecision technique for CPU-based processing.

  • Optimisation of the developed runtime system to offload some of the frames during the transitioning of NN configurations to the CPU, allowing the overall aim of enhancing the QoS to be achieved.

  • Elaboration on the dynamic behavior of the system and the suitability of the proposed pre-emptive scheduling for NN inference workloads.

The paper is organised as follows. Section 2 provides a rewritten and expanded version of the background of the original contribution. Section 3 formulates the problem and outlines details of the framework. Section 4 describes the experimental setup and analyses results including the latest CPU-based work. The conclusions are given in Sect. 5.

2 Background

Neural networks bring improved functionality and capability to a range of multimedia applications such as autonomous driving and robotics. These applications require increasingly higher computation and memory resources which may prohibit their use on resource constrained devices [5, 6]. Such a challenge draws attention to efforts that trade off accuracy for speed and memory usage in various neural network architectures [7]. Previous studies have demonstrated that sufficient model accuracy can be achieved by adopting lower arithmetic precision in the case of certain tensors [8]. In some cases, arithmetic precision of aspects of the application have been reduced to 8-bits, 4-bits or even down to 2-bits [9,10,11]. Other work has tried to approximate certain parts of network and evaluated them on various devices [12, 13]. This suggests that the adoption of reduced precision can possibly provide huge energy gains with some minor model accuracy losses. For instance, up to a 50\(\times\) energy saving was achieved in a k-means clustering algorithm, by allowing a classification accuracy loss of 5% [12]. Similarly, a neural network approximation approach was able to accelerate a kinematics application by up to 26x, while incurring only an accuracy loss of less than 5% [13].

This has motivated the design of low precision processing architectures for NNs, some of which are being commercially exploited such as in Google’s recent TPUs. However, in most existing work, the choice of the scaling parameters such as precision, is chosen statically under specific resource constraints, neglecting the fact that most constraints and thus optimal parameters may dynamically change during execution; this is discussed in more detail below.

2.1 Neural Architecture Search under Resource Constraints

The Neural Architectural Search (NAS) [14] is a methodology to seek the optimal configuration of neural network in terms of depth, width, and resolution according to a given training dataset. It uses a hyperparameter search method gleaned from the training dataset in order to identify the optimal sets that produce the best accuracy for the given dataset. In cases where the compute resource budget is limited such as in mobile devices, many NAS variants that exploit the trade-off between accuracy and latency, have been proposed; these maximise resource efficiency for a given compute resource budget [15,16,17,18,19]. Most NAS approaches use a uniformly large space architecture search to identify the less computationally complex models. Unfortunately, this can be time consuming and can lead towards the creation of larger models.

To overcome such issues, the work in [4] proposes a narrow NAS method that shrinks the search space in different regions and aggregates it to construct a complete search. It utilises a configuration file to encompass the rules used for generating the narrow search spaces using candidate samples; using the rules, these are constructed in order to satisfy the constraints for a new model such as the model size, the number of floating-point instructions, etc.. In all, 3000 reduced-accuracy models were distributed across a range of sizes for Densenet [20], MobilenetV2 [21], Googlenet [22], PNASnet [23] and Resnext [24]. The resulting smooth Pareto-Optimal (PO) front was created for a different accuracy-latency trade-off, achieving a 7% higher object detection accuracy using the same memory constraints when compared to the original models. It then used a NAS variant to develop the NN models customised for a static compute resource.

Compared to the previous platform-aware NAS variants, the work presented here is able to determine the optimised NNs according to runtime varying constraints, i.e., the NAS variant for runtime varying compute resource.

2.2 Runtime Adaptation of NNs

The above research has showcased the feasible accuracy-speed trade offs; in such static selection approaches, however, the algorithm can be designed conservatively in order to provide either the maximum required frame rate at lower accuracy or target the maximum accuracy with a lower frame rate. To showcase this aspect, Table 1 depicts the achievable frame rates at the 100% and 90% of the maximum accuracy on the same hardware platform for the NN models considered in this paper; this will be discussed in more detail later. The data in the table highlights the limitation of a statically designed approach. It shows that if the static objective is set as frame throughput, the accuracy drops by 10% for all cases including when a lower throughput is required. Similarly, a static objective to maximise accuracy can lower the maximum achievable throughput by \(5.4\times\). This motivates our work on dynamic exploration of suitable scaling parameters of NNs.

Table 1 Accuracy and frame rate trade-off for static approaches

There has been a considerable body of work on exploring runtime adaptation of NNs according to given compute resource constraints [1, 25,26,27,28,29]. The single NestDNN NN model [1] switches between multiple capacities of the NN during runtime according to accuracy and inference latency requirements. During training, unimportant features from the original model are pruned to generate the smallest possible or “seed” model. Similarly, Yu et al. [29] proposes the Slimmable Neural Network, in which the filter parameters are shared from a smaller capacity model to increase the capacity of the NN. Another study [25] proposes to use a runtime decision mechanism to switch between multiple NNs dynamically, according to video content and computational latency, in order to improve the real-time object detection accuracy.

The authors in [30] propose a method that switches between multiple NNs during runtime according to a dynamic processing. For example, when the inference latency of a NN is increased and violates a latency requirement due to a newly assigned throughput requirement, a runtime decision maker downgrades the current NN during runtime to meet the latency constraint. Since the required depth of a NN depends on the problem complexity, an “early exiting” technique [26, 27] was proposed to allow the NN to classify an object as early as possible, by having multiple exit classifier points in a single NN. However, these approaches struggle to ensure predictable performance as the throughput can be input dependent. Furthermore, the adopted single objective during the training process, i.e. latency vs accuracy, does not allow optimisation of any other objectives at runtime, such as memory, energy consumption. For example, the study in [31] shows that weight or computation reduction may not linearly translate into energy savings, and NNs should be optimised at the training stage to guarantee considerable energy savings. Finally, the runtime pruning of a single network can create irregularity and sparsity in the network leading to a low computing efficiency [32].

Compared to the previous methods, our approach leverages resource-accuracy trade-off more effectively to seek optimal NNs suitable for dynamically varying resource constraints by utilising multiple, smaller, independent NNs.

2.3 Dynamic Precision Scaling

Several dynamic precision scaling approaches were proposed for signal processing applications to save energy consumption [33,34,35]. Nguyen et al. [35] proposed a dynamic precision scaling method for Wide-band Code Division Multiple Access receiver applications to minimise the energy consumption under a Signal to Noise Ratio (SNR) constraint of the input signal. The idea behind it was to adapt the computational precision to the level of SNR on the input signal. This method seeks the least sufficient word length for computation according to the allowable quantisation noise power, given a dynamically varying SNR constraint of the input signal. In a case study of a SNR varying from 0 dB to 10 dB, this approach can save up to 25% of energy consumption. Lee and Gerstlauer [34] proposed a dynamic precision scaling approach to seek power savings for Orthogonal Frequency Division Multiplexing (OFDM) applications, that dynamically adapts word lengths to a time-varying environment such as different wireless channel conditions. This method receives the input arguments including SNR of the input signal and a target bit error rate of the output signal, and finds the optimal set of word lengths for all variables in an OFDM system by leveraging statistical analysis of quantisation noise coupled with additive white Gaussian noise models. The experimental results showed that the optimal word lengths were found within 1 to 2 seconds using an Intel i7 core, implying this approach would be feasible in practice for dynamic precision scaling. Cladera et al. [33] proposed an energy-efficient dynamic precision scaling approach for OFDM receivers on FPGAs that utilises prior direct energy consumption measurement information according to different bit-width deployment. Unlike [34], given a target bit error rate and a SNR of the input signal, this approach selects the least sufficient bit width satisfying the target bit error rate by utilising prior energy consumption measurement information.

Dynamic precision approaches were also proposed for scientific computation as well as other signal processing applications to improve energy efficiency [36,37,38,39]. Lee et al. [38] proposed a dynamic precision approach for linear solvers to improve energy efficiency. The usage of software emulated high precision arithmetic was minimised in solving linear systems with double precision accuracy for forward error to save energy consumption. In [37], arbitrary dynamic precision usage was investigated to improve energy-efficiency, compared to [38].

In [40], dynamic precision scaling was explored as a knob alongside voltage overscaling in arithmetic units within an extended design space that included energy, delay and QoS. In [39], a dynamic precision scaling technique within a pipelined core was proposed for facilitating voltage over-scaling while limiting the QoS loss. In [36], a design methodology for dynamically adapting the algorithmic effort and supply voltage in low power logic and memory architectures was presented. Such hardware focused design methodologies showcased the benefits of dynamic precision scaling to dynamically adjust the power and quality but these were not applied within NNs and combined with runtime policies as in this paper.

Overall, unlike the previous dynamic precision scaling approaches that adapt arithmetic precision to a level of noise of the input and a target accuracy, our approach can seek optimised network architectures during runtime according to dynamically varying compute resource constraints.

2.4 Autotuning with Search Algorithm Selection

Ansel et al. [41] proposed a program autotuning method, named OpenTuner, that selects best performing search algorithm out of multiple search algorithm candidates based on an objective function considering the exploitation/exploration trade-off. The OpenTuner employed a search module to search the best configuration of multiple parameters and a measurement module to measure a configuration chosen by the search module. OpenTuner was designed to seek the best configuration of compiler parameters (e.g., GCC, G++, etc.) according to a given program to optimise the computational speed. Rubio-González et al. [42] proposed an automated precision tuning, named Precimonious, to improve computational speed. Precimonious employs the delta debugging based search algorithm to test if replacing any variable in a program with a lower precision violates either accuracy constraint or performance constraint.

The previous autotuning methods found the optimised arithmetic precision or compiler flags according to the application characteristics offline. On the contrary, our approach can find optimised network architectures during runtime according to dynamically varying compute resource constraints.

3 Proposed Framework

The main aim of the framework is to maintain the required frame rate by trading off accuracy for tasks in a way that the average operating accuracy is maximised (other optional constraints and objectives are discussed later in Sect. 3.3). Let us consider there are N co-executing but independent frame processing tasks with a defined number of input frames per second. Each task has an available set of NN configurations that trade-off accuracy against execution time per frame. The high level optimisation problem to maximise average operating accuracy for all tasks can then be defined as:

$$\begin{aligned} \begin{aligned}&\underset{X\in \Omega ^N}{\text {maximise}}&\mathrm {average\_accuracy}(X) \\&\text {subject to}&\sum _{i=1}^N F(x_i) \cdot T(x_i) \le 1 \\ \end{aligned} \end{aligned}$$

where T represents the execution time per frame, F represents the frames per second and \(X = x_1, x_2,\ldots ,x_N\) represent the chosen NN configuration for each unique task, each of which has been chosen from the set of all available configurations, \(\Omega\).

Unlike previous runtime adaptation approaches, our work utilises multiple smaller, independent networks to achieve the best resource-accuracy trade-off for given runtime-varied, compute resource and accuracy constraints. The approach illustrated in Fig. 1, makes use of a large design space generated via narrow searches on various segments of the NN models (layers, channels, weights, etc..,) [4]. Targeted narrow searches to cover a wide design space provide broadly varying computing characteristics in terms of power, memory, energy consumption and latency. This work profiles and selects the optimal configurations to suit runtime varying objectives, optimised via the design of an ILP-based problem definition and solution. The runtime selection environment allows transparent switching of network configurations.

The methodology exploits the design space to create PO points targeting different optimisation objectives as variation in the NN segments can create varying data flows, as well as spatial and temporal computational densities; with varying computing resources, models may behave differently. For example, a low-memory model may achieve the lowest execution time on a memory constraint device, while a lower spatial computation complexity may provide the best efficiency for energy constraint environments.

Figure 1
figure 1

Framework overview.

The first step thoroughly profiles the models for all of the different parameters that may need to be optimised or constrained during runtime. This includes profiling on both the GPU and the CPU. The data is analysed and used in the design of runtime with an aim of co-scheduling execution on the CPU and GPU. The profiling is used to create PO curves with different parameters, enabling a targeted design space that can be optimised further at runtime as per the varying objectives and constraints. We define the optimisation function using ILP-based formulation. ILP is used in situations where some or all of the decision variables can be represented as integers or discrete values, rather than continuous, such as in our case of trying to find a solution from a set of distinct model configurations. It provides an easy way to define constraints and has been found to consistently provide high quality solutions for problems targeting low-latency tasks scheduling [43,44,45]. The overhead of ILP is reduced further by optimisations such as reducing search space by applying PO selection on inputs.

The problem formulation via ILP allows runtime tunable constraints to be set as well as permits variation in the optimisation objectives. We also design a runtime environment that enables transparent switching of various networks. It monitors runtime varying constraints, chooses appropriate PO selection, manages inputs to the ILP solver for the optimal solution and enables transparent switching of the networks while enabling the maximum QoS.

3.1 Profiling and Analysis of Neural Networks

Using the generated models, execution time, power and memory utilisation are profiled on both the CPU and GPU. Although the CPU can be expected to be slower than the GPU for NN inferences, it can provide additional resources when trying to meet the required QoS. The profiling helps to identify when the CPU can be used to share the task’s execution.

For the GPU, profiling is done using both the standard PyTorch and Nvidia’s TensorRT library models, the latter of which is a closed source library that optimises the execution for the GPU on the underlying platform. The TensorRT library parses the input model into subgraphs. It then identifies areas of optimisation, such as evaluating nodes with static values to replace with constants. Subgraphs that directly map to TensorRT operations are replaced with TensorRT layers that are optimised for underlying architectures. The remaining subgraphs are interlinked with TensorRT layers and the resultant network, with the same functionality as input network, is stored as a representation, called engine, which is an encoded file that needs to be read and decoded as an engine build process at runtime. For truncated networks, optimising via TensorRT is key as truncation can result in lower resource utilisation.

Although the underlying hardware provides support for half precision execution, the gains in execution time may not be significant if there is not enough parallelism, particularly for truncated lightweight networks. However, the lower resource utilisation may lead to lower power utilisation. Similarly, half precision requires lower storage space and thus lower overhead for loading and decoding engines at runtime.

Finally, along with profiling for resource requirements, we have also analysed the benchmarks for standard deviation in execution time during multiple runs. This helps to verify the suitability of pre-emptive scheduling of resources.

3.2 Pareto-optimal Selection

Pareto-optimal (PO) selection identifies points which provide an optimal trade-off between two or more parameters, one of which is always accuracy and the others may include execution time, power and memory. Whilst it reduces the runtime overhead, there is a trade-off in terms of the quality of the resulting solution as it discards the design points which do not provide an optimal point based on the input parameters. To overcome this, different PO selections are created and used at runtime as per the optimisation objective.

For a standard QoS metric, two parameters, accuracy and execution time, are sufficient but for hardware constrained devices, power and memory constraints may also apply. Thus, varying dimensions representing one or more of these constraint parameters (in different combinations) are introduced during PO selection. In one set of combinations, we apply both time and the selected parameter (power or memory) during PO selection against accuracy while in the other, we only consider the selected parameter and accuracy. The PO selections result in a small search space and improved optimisation of the selected parameters against accuracy.

3.3 Optimisation Function

Description: Let us a consider a set of N dynamically varying tasks \(Q = \{q_{1}, q_{2},... q_{N}\}\) to be executed on a device. Dynamic operation means that the throughput requirements per task can vary with time including going inactive. We define the throughput as the required frames per second (FPS) which for the given task set, Q, can be given by \(F = \{f_{1}, f_{2},... f{N}\}\) where \(0 \le f_i \le 30\). Furthermore, each task, \(q_{i}\), has multiple configurations where each configuration’s implementation can provide varying throughput against accuracy, power/energy consumption and memory utilisation. For simplicity and without losing any advantages, we combine the configurations targeting algorithmic truncation and arithmetic precision into a single set which for task \(q_{i}\) can be represented by \(C_{i} = \{ci_{1}, ci_{2},... ci_{M}\}\). Then, for each configuration set. \(C_{i}\), we have an associated set of execution times, \(T_{i}\), accuracies, \(A_{i}\), power consumption, \(P_{i}\), and memory usage, \(M_{i}\).

In an ideal scenario, all tasks can run at the maximum accuracy while achieving the required FPS. However, for higher throughput requirements under low computing power, the aim is to select an appropriate configuration for each task that maximises the average operating accuracy, \(A_{avg}\), while maintaining the maximum average FPS and adhering to other optional constraints such as the minimum threshold accuracy per task, total power and memory consumption. Moreover, the runtime objectives can also be changed to minimise parameters such as power/energy/memory consumption, while ensuring a minimum threshold for others.

The ILP problem formulation aims to find a set, \(C_{optimal} =\{co_{1}, co_{2},...co_{N} \} \mid co_{i} \in C_{i}\), which represents one configuration each for all of the tasks to be co-executed. If the corresponding execution time per frame, accuracy, power, energy and memory consumption of the found set of configurations are represented by sets \(T_{optimal}\), \(A_{optimal}\), \(P_{optimal}\), \(E_{optimal}\) and \(M_{optimal}\), respectively, and elements in each are represented by small case of the same letter and subscript o, the constraints for the problem are defined by:

  • the total execution time to process all frames for all tasks at the used FPS value should be less than 1 second.

    $$\begin{aligned} \sum _{i=1}^{N} t_{o-i} \times FPS_{i} \le 1 \end{aligned}$$
  • individual tasks’ execution time should be higher than corresponding set thresholds to ensure tasks’ priorities and resource distribution criteria.

    $$\begin{aligned} x > T_{th} \forall x \in T_{optimal} \end{aligned}$$
  • tasks’ individual accuracies should be higher than corresponding thresholds set for respective task. Threshold can also be set for average accuracy for all tasks when optimising objectives other than the accuracy.

    $$\begin{aligned} x_{i} > A_{th-i} \forall x \in A_{optimal} \mid 1 \le i \le N \end{aligned}$$
  • peak power consumption for all tasks should be less than the threshold.

    $$\begin{aligned} x < P_{th} \forall x \in P_{optimal} \end{aligned}$$
  • total energy consumption in Joules should be less than the threshold.

    $$\begin{aligned} \sum _{i=1}^{N} e_{o-i} =\sum _{i=1}^{N} p_{o-i} \times t_{o-i} \times FPS_{i} < E_{th} \end{aligned}$$
  • total memory consumption of all tasks should be less than the threshold. All of peak power, energy and memory consumption thresholds can also be set individually for each task.

    $$\begin{aligned} \sum _{i=1}^{N} m_{o-i} < M_{th} \end{aligned}$$

where the subscript th defines the threshold set by the user for various parameters. Please note that all constraints other than the first, that requires all frames execution within available time, are optional and/or platform dependent.

The joint memory consumption of all NN models is used as they are simultaneously kept in memory. This avoids reloading the models on each context switch. The thresholds for different parameters are user-defined, can be applied at any given time and are runtime variable. For example, the individual task’s requirements for the execution time and accuracy can be defined for only a selection of tasks or can be completely omitted.

Although the maximisation of accuracy is the key motivation of this work, the optimisation objective can optionally be dynamically selected from either of:

  • maximising average accuracy, i.e., sum of the operating accuracies for all tasks;

  • minimising total memory consumption;

  • minimising combined energy consumption for all tasks.

Accuracy and energy optimisation provide a direct performance improvement. Minimisation of memory can enhance performance for other co-executing tasks, not directly targeted by the optimisation function. Each objective can be targeted while keeping constraints on others. Constraints can also be set on the same parameters that are being optimised to restrict the solver from over-optimising.

3.4 Runtime System

The runtime monitors the changing constraints, generates new configurations and implements transparent switching of configurations. It is lightweight and responsive while acting to make sure optimal temporal utilisation of computing resources.

In a stable system state, the optimal networks for all tasks are already loaded into the main memory and the frames for all tasks are being processed with the maximum QoS. The runtime monitors all of the system and task constraints. When there is a change in these, runtime goes into a transition state to switch to a new optimal configuration, while allowing the tasks to continue execution in the current configuration.

During the transition state, the runtime calls the ILP solver to find the next optimal combination. In the rare cases where a solution cannot be found, the runtime uses a user-provided priority to iteratively reduce the allowed FPS per task by 1. In this scenario, if there is a minimum time allocation constraint set for any task, it checks if that particular task is already within its limit and avoids further FPS reduction, by switching to a simpler but less optimal heuristic. Eventually, the ILP solver generates the optimal combination of configurations to be used for each task. The runtime then checks which tasks require a change in configuration. The configuration change adds an overhead due to the time spent on copying the model into main memory and the TensorRT desearlisation process which decodes stored engines to make them available for use by the TensorRT.

To reduce this overhead, the runtime implements two optimisations targeting execution on CPU and GPU. Firstly, for frames being processed on the GPU, the runtime uses a buffer to load a new configuration in the background. Whenever a transition is started, the runtime allocates the time as needed for the tasks that do not need a configuration change, even though they may still be processing a different number of frames than the previous iteration. The remaining time (excluding the solution finder time) is equally divided across tasks which require a configuration change. For these tasks, a FPS is selected which can be accommodated in that time slot, even if it is lower than required.

The new configurations are loaded in parallel. In the first implementation, the runtime switched to a configuration as soon as it was available for a task. The remaining time was re-evaluated and allocated to the tasks for which a configuration load was pending. With this approach, an extra step was required for updating the scheduling of frames for some tasks. It also resulted in uncertainty in the achieved FPS for that cycle, as we found that the standard deviation for engine build times was higher than for the model processing, due to dynamic memory contentions.

To tackle these challenges, we determined a fixed latency of one second for transition period. This considers the worst case as typical combined configuration loads were measured and seen to be lower than one second for up to 3 new task configurations load in single cycle. Thus, in the current implementation, the frames of tasks which require a configuration update and cannot be fit onto GPU during transition state, are offloaded to the CPU. The process continues until all of the tasks have their configurations updated, at which point the runtime exits the transition state.

Please note that the design of the runtime optimisation via transprecision can also be applied to applications other than NNs and machine vision. However, the analysis of accuracy-execution time trade-offs and computing characteristics such as predictability of execution time needs to be performed beforehand to achieve an exact understanding of net gains.

4 Results

4.1 Experimental Setup

In our experiments, we used a NVIDIA Jetson Nano platform, comprising a Quad-core ARM A57 @ 1.43 GHz CPU and 128-core Maxwell GPU. It has 4 GB, 64-bit LPDDR4 main memory and a class 10 ScanDisk micro-SD card with up to 170MB/s read and 90MB/s write speeds. The runtime was created in Python3.8, the NN architectures were defined using PyTorch v1.7 and then optimised using Nvidia TensorRT v3.0 for GPU.

For evaluation, we run in parallel up to 5 independent tasks (Tasks 1 - 5), where each task represents one of the five different NNs described in Sect. 2. We run each task while varying parameters, such as the FPS, accuracy requirements, etc. The accuracy values used for analysis are relative values compared to the maximum achievable for each network. All models are built for input images of size 32 \(\times\) 32 pixels.

To compare the performance and to allow usage where the ILP solver cannot find a solution, the runtime also implements other heuristics. These include: Fair FPS which tries to allocate time per task that maintains same FPS for all tasks; Fair Time which allocates the same time slot to all tasks and; Greedy which allocates the requested time to the first and the remaining time to the second task and so on.

From the starting point of 3000 models, the pareto-optimal selections choose 128 models for the GPU for transprecision-based execution, requiring a disk usage of 395 MB. The approach acts to keep only the model being used at runtime in the DDR memory. However, with the small memory usage of models specifically designed for resource constrained devices, there is an option to keep everything in DDR memory to reduce latency further. Models that run on CPU have a storage format different from GPU and require separate storage of similar size.

4.2 Speedup Analysis

Firstly, a benchmark analysis for the various environments is generated by analysing the change in execution time, power and memory when varying the parameters. These include use of the TensorRT library, as well as half precision number representation when executing on GPU and finally, the best configurations when executing the models on CPU.

4.2.1 TensorRT

Table 2 gives details of the average and maximum improvement over the baseline of the single precision computation for various levels of accuracy via use of TensorRT on GPU. It shows that it can speed up execution by up to \(18.8\times\). Generally, this is due to better optimisation of the resource utilisation and comes at up to \(7\times\) higher power usage as the resource usage increases. The memory consumption varied inconsistently, i.e, it was higher for some cases while lower in others.

4.2.2 Half Precision

The comparison against single precision when using half precision arithmetic on the GPU is provided in Table 3. Although the execution time for half precision is similar to single precision on average, it can provide better performance in some cases. Moreover, it provides more significant improvement over power usage (50% or less on average for some of the networks) owing to lower resource utilisation as well as lower memory traffic resulting in more than 50% lower engine build times on average. The provided measurements highlight important information for designers looking to optimise lightweight neural networks on GPUs for the Jetson Nano platform.

Table 2 Speedup over the baseline of the single precision computation using TensorRT on GPU.
Table 3 Relative values (\(\times\)) for half precision over single precision on GPU.

4.2.3 Execution on CPU

We also provide an analysis of executing the same models on CPU and compare a selection of parameters with GPU. Firstly, we look at the execution time variation with accuracy for GPU and CPU in Fig. 2. We use the best configurations for CPU (from single or half precision execution). It is clear from the different scaling on the x-axis that the GPU performs much faster. The average speedup of GPU over CPU is also summarised in Table 4. The CPU is unable to process even a single frame for some of the models at maximum accuracies in a second; Mobilenetv2 is the slowest as the processing of a single frame at more than 96% accuracy takes more than 1 second. However, for lower accuracy and for other models, the execution time for CPU drops, allowing the possibility of up to a few frames processing in less than 1 second.

Table 4 Average (\(\times\)) for GPU over CPU.

Indeed, the speedup achieved by trading off accuracy is much higher on CPU than GPU. In Fig. 3, the relative cost, in terms of execution time and relative to the lowest accuracy, is shown for achieving higher accuracies on GPU and CPU. Although the CPU was found to be slower than GPU in terms of the absolute execution time, the relative cost is also much higher for CPU. This means that the gains made by transprecision approaches on CPU can be much higher, if the GPU is not available and the same resource-accuracy trade-off needs to be optimised for applications running on the CPU only.

Figure 2
figure 2

Achievable accuracies at various execution times.

Figure 3
figure 3

Execution time relative to minimum accuracies for various models for CPU and GPU.

The variation of memory utilisation of models against accuracy on CPU shows similar behaviour as on GPU.

The energy consumption follows a different pattern. Although the energy consumption is much higher on the CPU due to the higher execution time and power consumption (Table 4), the scaling of energy with execution time is almost linear. On GPUs, although the energy generally increases with accuracy, individual data points can be displaced. This is due to the fact that varying computation and data-flow architectures for differing accuracies, present varying capabilities for optimising underlying GPU resource usage with an average GPU utilisation for all architectures and models lying around 57%. The CPU showed 100% occupancy, leading to a more stable dynamic power consumption around 2 W, as compared to varying power numbers for GPU in Table 2. The memory consumption can be both low or high on CPU as compared to GPU as shown in Table 4.

Generally, the GPU outperforms the CPU by large margins and CPU cannot provide a viable alternative. However, the analysis is presented for usage in scenarios where a GPU device is not available or when CPUs can be used to complement GPUs; for example, in our proposed framework where it is used to provide support when the GPU is switching models. Furthermore, the CPU numbers provided are for single core execution and a speed-up gain of up to \(4\times\) can be achieved by utilising all 4 CPU cores. Please also note that ARM Cortex-A57 are the low power energy efficient cores. Higher performance can be achieved if the platform contains high performance cores such as ARM Cortex-A78 or ARM Cortex-X1.

4.3 Dynamic Variation in Execution Time

The proposed runtime approach is based on preemptive scheduling using an offline profiling. Hence, before analysing the performance of the proposed heuristics and runtime, we take a look at dynamic variation in execution time of the used models. This allows us to verify that the throughput rate is repeatable and, if needed, put a margin on execution time such that the system performs as intended even in the worst case scenario. To do that, we execute each model 100 times and determine the instantaneous and average standard deviation in execution time. Figure 4 presents the standard deviation as a percentage of total execution time for various levels of accuracy. Although the standard deviation can be high for some levels of accuracy, the average for all tasks is lower than 8 for both the GPU and CPU.

Considering this analysis, we maintain a margin of 10% to tackle for dynamic variations. Furthermore, for all tasks and accuracy levels, the averages for GPU and CPU lie at 5.5% and 5.4%, respectively. Moreover, the standard deviation is higher for the lower execution time particularly for CPU. This implies that the variation can be due to runtime OS scheduling variations. Thus, performance and energy gains can be achieved if a margin is set every time based on the individual execution times of tasks i.e. higher for smaller tasks and vice versa.

Figure 4
figure 4

Standard deviation in execution time for various tasks.

Figure 5
figure 5

Total and per task execution time for various heuristics.

Figure 6
figure 6

Accuracy and achieved frame throughput for various heuristics.

4.4 Changing FPS per Task

We provide an analysis for the various heuristics used at runtime including the designed transprecision-based solution that looks to maximise accuracy while maintaining a 100 % frame rate, the Fair Time distribution, the Fair FPS per task and the Greedy heuristic.

For all approaches other than transprecision-based, the time to find solution is negligible and does not affect the achieved FPS. For the transprecision-based approach, the ILP solver can take between 80 - 180 ms with an average engine build time of 20 - 80 ms for different NNs and with a maximum of 400 ms. On average, the engine updates required per iteration are for 1 - 2 tasks. As mentioned earlier, the engine update is done in parallel and does not stall the processing. The only effect is that for the task requiring engine update, a lower number of frames may be processed. Note that the time required to find a solution is also taken from the budget for the tasks for which engines need to be updated. The FPS degradation due to the overhead induced by the update process is considered when reporting the achieved FPS. However, the significance of both solution and engine update times depend on how often the solution needs to be recalculated and the engines are updated. For our experiments with a solution recalculation time of 5 s, i.e., incoming parameters changing every 5 s, the overhead was less than 0.06%.

First of all, we vary the required FPS per task and analyse the utilisation of time by individual tasks and all tasks combined in a one second cycle when using different heuristics. Figure 5 shows that the non-optimised heuristics saturate the achievable FPS per task early within the available time budget and stay constant after that. Although the individual task times are difficult to see, it indicates a general trend. For example, with Greedy, some tasks do not ever get executed after a certain FPS, as the first task uses the whole time budget. For Fair Time, most individual task times are similar. For Fair FPS, although individual tasks can have different times, they are constant after saturation. Only for transprecision, individual task times can vary for higher FPS depending on the accuracy used for each task. All heuristics are able to finish processing in real-time before the deadline (1 second) showing that the pre-emptive scheduling is effective in this case, more so for transprecision execution which is able to fill the 1 second slot to a higher percentage due to its flexibility. We kept a 50 ms guard band every second, to account for CPU-GPU synchronisation and communication.

Next, we show the achieved QoS with varying required FPS per task in Fig. 6. The QoS is represented by the operating average accuracy (the relative accuracy against the maximum possible for each task) and the average of operating FPS relative to the required FPS for all tasks. As shown, non-optimised heuristics never use any configuration with lower accuracy and always maintain a maximum accuracy of 100%, but the achieved FPS drops sharply with increasing FPS per task; it goes as low as 8.67% for 30 FPS per task. Fair Time performs the best with an average of 48.78% and a minimum of 20% achieved FPS. The transprecision-based solution is able to maintain 99.92% FPS up to 30 FPS - the solution provides 100% FPS but the drop is due to the ILP solver and engine load overheads as explained above; here, we have not considered the offloading onto CPU which we analyse later. The high FPS comes at a slight drop of accuracy with an average of 97.83% and a minimum of 83.79% at 30 FPS. Even at 99% accuracy (at 15 FPS per task), the approach provides \(1.6\times\) higher FPS as compared to the next best.

4.5 Changing Accuracy

To explore the effectiveness of optimisation of memory and energy usage, we vary the minimum threshold accuracy for each task and analyse the findings in Fig. 9. To optimise the memory and energy usage, we make use of PO selections that are not solely based on time. This is due to the fact that the memory and energy usage does not always follow the same trade-off with accuracy as does the execution time, even though accuracy generally increases with all the parameters, i.e., time, energy and memory. To illustrate this, we show the variation of memory and energy consumption of the models with accuracy for PO selection based solely on time in Fig. 7. As we see, the same model can simultaneously have a higher execution time and a lower energy/memory usage when compared to a different model. This is due to varying underlying computation and data flows and the efficiency of resource utilisation. Thus, we proposed the use of different PO selections to feed the ILP solver when optimising execution time, energy consumption and memory usage.

Figure 8 shows the memory and energy usage of models after PO selections based on memory and energy, respectively. Along with adding some configurations that time base PO selection would have omitted, this results in 50% reduction in the number of models that are fed to the ILP solver, while still achieving the optimal results.

Figure 7
figure 7

Memory and energy scaling (relative to maximum values for each model type) with accuracy for various tasks.

Figure 8
figure 8

Pareto-optimal selections for memory and energy against accuracy.

In order to further clarify different PO selections against accuracy, we define the criteria as follows. For time only, the selection finds the optimal points that provide best accuracy against time; for memory/energy, it uses points that provide the minimum memory/energy for any particular accuracy irrespective of time; finally, for time + memory/energy, it gives the optimal points based on time and then discards those which do not give the optimal accuracy based on memory/energy values as compared to adjacent values. We focus only on the targeted parameter, i.e., the total memory when using memory-based PO selections, and energy when using energy based PO in our results in Fig. 9. Using the appropriate PO selection, the optimisation objective is then altered to target minimisation of either memory or energy instead of accuracy, while keeping the rest of the constraints the same.

The results in Fig. 9 show that appropriate PO selections improve the efficiency of ILP solver. For example, the memory only PO selection for memory optimisation can provide on average 4.1% lower memory utilisation at 3% lower accuracy, while the energy only can provide 50.2% lower energy solutions at 3.7% lower accuracy. The time PO still gives the highest average accuracy. The time + memory/energy provides a trade-off between two extreme PO selections and achieves average memory/energy and accuracy values in between both.

Figure 9
figure 9

Memory and energy optimisation against varying threshold accuracy while using different pareto-optimal (PO) selections.

Figure 10
figure 10

Accuracy and achieved frame throughput for various heuristics for randomly varying runtime constraints.

Figure 11
figure 11

Accuracy and execution time when executed with objective to maximise accuracy for randomly varying runtime constraints.

Figure 12
figure 12

Memory and energy objective optimisation executed with the appropriate pareto-optimal selections for randomly varying runtime constraints.

4.6 Random Parameters

Finally, we evaluate the heuristics under a more dynamic and constrained environment, by varying the individual threshold accuracies and FPS per task at runtime. We vary the total energy and memory available for all tasks, as well as the peak power. The parameters are varied randomly per iteration (every 5 seconds) while their ranges are selected to stress the system. For all non-optimised heuristics, all constraints are ignored to run the maximum accuracy models and only FPS is varied.

As mentioned earlier, we use the CPU to offload some of the frames during the transition state. For CPU processing, we use up to 2 CPU cores, each allocated 1 task each. This is to leave the rest of the cores for OS jobs and for communicating with the GPU. We target that the extra frames have to be processed with a maximum latency of 1 second. With more acceptable latency, the CPU can process more frames up to achieving an overall frame rate of 100%. As with GPUs, there is an overhead of loading the model into the memory every time for processing. This is lower than GPU as it does not need decoding but can still be up to a couple of hundred milliseconds. This can be avoided by keeping a fixed accuracy model in the memory, however, this is not considered in this work as we target adaptive accuracy. The model can also be loaded for the CPU along with the GPU for later use when the GPU model needs replaced. Please also note that along with execution time, memory and energy consumption are not considered as constraints for the CPU execution in the optimisation solver and are only considered an extra cost of the transition period.

As with the previous set of experiments, we firstly compare the achieved QoS in terms of accuracy and FPS, for various heuristics in Fig. 10. All heuristics other than the proposed solution run at maximum accuracy for all tasks; however, the FPS drops significantly with Fair Time achieving maximum of 61% while for Greedy, it drops to 41% on average. On the contrary, transprecision is able to achieve 99.9% FPS at slightly reduced 97.5% accuracy.

As for the contribution of the CPU, the average achieved FPS without using the CPU during transition state was 99.0%. The achievable is less than 100% due to the overheads as discussed earlier. Please note that the overhead will be lower if transition happens less frequently than current interval of 5 seconds. More importantly, the minimum achieved average FPS for all tasks in any iteration is 97.3% with CPU usage as compared to 89.6% without it. Furthermore, with the use of the CPU, there is an increase in memory and energy usage. Considering the memory and power utilisation only for the transition duration when the CPU is used for processing, the memory slightly increased by 0.6% while the energy increased more significant by 11%. In further experiments targeting energy and memory optimisation per PO selections, we do not use the CPU offloading as it is only the transition cost and applies to all PO selections.

Next, we analyse the effect of varying PO selections on different optimisation objectives on the same data set. Firstly, Fig. 11 analyses the effect of PO selection while optimising the same objective, namely accuracy. The figure shows contrasting results to Fig. 9 in that the optimised for time approach, using time based PO selections, does not give the highest average accuracy. This is because in the current set of experiments, the energy and memory are more constrained than time and thus the PO selections based on energy/memory can offer better configurations to achieve highest accuracy. In this scenario, energy and memory optimal selections can provide 98.2% and 97.8% accuracy respectively as compared to 97.5% for the time-based selection. This suggests that in addition to optimising for a certain parameter, PO selections could be chosen, based on which resource is the most constrained.

Finally, using the same set of inputs, instead of optimising for accuracy, we optimise for energy and memory, while using appropriate PO selection for each, and compare with only time-based selection. Figure 12 shows that the time-based PO selection performs worse in terms of energy and memory consumption. On the contrary, the energy- and memory-based integrated optimisations can provide 49% and 4% lower energy and memory respectively, than the time-based selection.

4.7 General Comments

The work has proposed an approach along with a detailed trade-off analysis of various parameters with an aim to provide a direction for maintaining the maximum quality of service (QoS) within resource constraints for machine vision. However, the design space is large and it is not possible to provide exhaustive analysis. A number of observations can be made about some of the unexplored options.

The non-optimal heuristics have been kept relatively fixed to keep a steady baseline. To improve performance, they can use a selection of fixed accuracy models. This can also provide low memory execution (although we allocated as much memory to them as needed in experiments). Furthermore, energy for these heuristics may be controlled by executing a smaller number of frames.

The last set of experiments were run while putting constraints on all parameters of interest. More analysis can be undertaken with lenient constraints on some of the parameters. For example available memory can be increased to analyse energy optimisation and vice versa. Furthermore, frame rate can also be considered an objective and be optimised while fixing other constraints such as accuracy.

5 Conclusion

A key challenge for machine vision is to incorporate additional functionality with limited resources. In particular, it is becoming increasingly important to be able to adapt dynamically the operational accuracy of NN models in edge devices. A core issue is the ability to determine the optimal operating point with changing workloads, operating environment and optimisation objectives. The paper has described a transprecision approach based on transparently switching of NN configurations, thereby achieving gains for various optimisation objectives while maintaining maximum QoS. Results of employing 3000 NN configurations on an edge device employing a GPU were shown and discussed in more detail, thus adding to the contribution from the original publication [3]. The work was then extended to exploring these configurations on the CPU.