1 Introduction

Among the continual technological advancements in the field of Artificial Intelligence (AI), particularly within Deep Learning (DL) and its artificial neural networks, two primary technological challenges can be identified that must be addressed for their development. Firstly, more complex networks require computational capabilities only achievable by supercomputers to attain manageable training times. At the same time, during the inference stages, there is an increasing trend toward utilizing accelerators to achieve performance enhancements. Secondly, closely tied to the above, is the substantial energy consumption associated with these operations, particularly during the training phase.

The challenging integration of this technology within the context of mobile devices and the Internet of Things (IoT) becomes evident. In these types of devices, which prioritize low-power consumption at the expense of reduced performance, it is commonplace to encounter scenarios where performance has been significantly decreased to accommodate inference processes. This adjustment is often necessary due to the impracticality of conducting training on such devices given their limited specifications.

Deep Learning has gained prominence in certain domains over classical machine learning techniques by demonstrating enhanced learning capabilities and lower generalization errors, particularly when dealing with vast amounts of data. This is evident in areas such as Computer Vision [1], as well as in other domains like Speech Recognition [2] and Natural Language Processing [3].

In the context of this research, facial recognition was our selected use case. The approach involves starting with pre-trained neural networks for facial recognition and leveraging knowledge transfer techniques to adapt these networks to the specific task. This technique has shown that maintains a high level of accuracy and performance while utilizing a relatively small dataset for training [4], described in Sect. 3. As a final step, the model is optimized using both Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT). These optimizations aim to reduce computational load, leading to lower energy consumption, while ensuring minimal impact on accuracy, even in worst-case scenarios.

We studied DL models through knowledge transfer from pre-trained models, aiming for these models to identify a specific person’s face amid other faces. This objective is pursued while maintaining reasonable performance and energy consumption on an IoT device.

In our use case, we have used the TensorFlow library for model development, TensorRT for optimization, and the NVIDIA Jetson AGX Orin development kit as the specialized IoT device. The results show manageable training times on the Jetson device, as well as greater efficiency compared to those achieved on the desktop computer (DC). Thanks to the optimization process, the performance gap in inference between both devices has been significantly reduced. While PTQ achieves greater acceleration, QAT achieves higher precision and energy efficiency, especially on the IoT device. Finally, the Deep Learning Accelerator (DLA), now known as the accelerator, has exhibited a significant degradation in its performance due to compatibility issues, although it has considerably higher energy efficiency than any of the other processing units. We outline the main contributions of this work.

  • Demonstrating the feasibility in terms of performance and energy efficiency of conducting Deep Learning models training on an IoT device.

  • Evaluating the behavior of quantization processes in binary models trained through transfer learning with a small dataset. Instead of reducing precision, quantization processes have shown the capability to enhance generalization to new data, achieving slight increases in accuracy.

  • Analyzing the most recent version of NVIDIA’s DLA inference accelerator to date, confirming compatibility issues with pre-trained models, affecting performance similarly to its earlier version. However, there is an observed improvement in energy efficiency for models with fewer incompatibilities when compared to GPU usage.

This document is structured as follows: related work is presented in Sect. 2. Section 3 describes the methodology used in our research including technologies, devices, and neural networks used in development. Section 4 provides a detailed analysis of the evaluation results, while Sect. 5 summarizes the key findings and conclusions of this study.

2 Related work

The increase in computational power of modern devices and the vast amount of stored information available today have facilitated the rise and expansion of AI beyond its initial domains. The proliferation of AI has also led to a proportional increase in its use cases, encompassing progressively more complex tasks such as computer vision and natural language processing. This is exactly where Deep Learning gains prominence over classical AI algorithms [5].

The possibilities offered by Deep Learning continue to be a subject of debate and study [6]. However, along with its advantages, this technology also presents challenges. The high cost of developing a neural network from scratch, the requirement for increasingly sophisticated models to address progressively complex tasks, and the need for substantial amounts of data to enable accurate generalization by the models are among these challenges. Expanding this technology to IoT devices also presents the challenge of running complex networks on low computational power and energy-efficient devices.

In this work, we will take advantage of the capability of knowledge transfer [6] capability of neural networks as a method to address one of the challenges mentioned earlier. This technique applies the knowledge gained from solving one problem to a different yet related problem. In addition, methods such as artificial data augmentation [7] will be used to mitigate these challenges.

Several authors have studied implementations of Deep Learning systems on IoT devices, for example, a system for authentication in intelligent medical systems [8] or access control in a smart home [9], both employing a Raspberry Pi 3. It has been demonstrated that despite achieving accurate results, the major impediment is the limited computational capacity of these devices to run such systems. In [10], the feasibility of employing an autonomous driving system on the Jetson AGX Xavier device is studied, showcasing the achievable promising performance. Part of this success can be attributed to the DLA’s capability to alleviate the GPU’s computational load.

In this paper, we use the quantization technique [11] as the primary optimization method, as it has shown significant acceleration in popular computer vision models; along with the study of the two Deep Learning Accelerators available on the Jetson Orin device, specifically designed for the inference process.

In [10], a different optimization method is proposed compared to the one used in our research. It is based on task scheduling algorithms to prevent the CPU or GPU from idling while waiting for a task to complete. This optimization aims to take advantage of the features of a specific device. However, in IoT devices with limited computational capacity, the potential acceleration may be lower.

Some studies, such as [12], have evaluated the impact of methods such as PTQ and QAT on optimizing transfer learning-trained models. In our work, with a binary two-class model and a smaller dataset, we evaluate not only how these methods impact performance, but also their influence on the models’ generalization ability. Additionally, we assess how they perform when combined with an AI accelerator. Also, in  [13], a comprehensive analysis was conducted on how quantization with different precisions, bit widths of operations, affects accuracy and energy consumption in three convolutional neural networks. In our research, we also emphasize the impact of quantization on performance, using both GPUs and an accelerator.

In [14], a comprehensive analysis is conducted on the impact of Post-Training Quantization on both a convolutional neural network and a Transformer-based network [15]. The study involves energy consumption measurements, in addition to performance evaluations, on multiple devices using various frameworks. In contrast, our proposal focuses on analyzing multiple models that have served as the basis for knowledge transfer. Our objective is to measure whether the loss of precision caused by quantization is reduced as a result of the simplification resulting from the reduction in the number of classifiable categories. Furthermore, the use of DLAs is mentioned in the cited work, but comprehensive performance and consumption measurements are lacking, which differs from our study in which we individually evaluated these aspects.

Our work diverges from previous studies that assess the optimization of transfer learning-trained models, including [16] that explores multiple models and [17] whose optimization focuses exclusively on reducing computational load. We examine both energy efficiency and performance across two devices, in contrast to prior research that only concentrates on inference tasks. Additionally, our study assesses the feasibility of conducting training on an IoT device, considering time and energy consumption.

3 Methodology

As introduced in Sect. 1, our use case is a facial recognition system capable of recognizing the face of a specific person versus any other face.

To train and evaluate the system, we have 1,100 images in total, evenly divided between two categories. These images are all 128 × 128 pixels in size and each feature an individual face. 550 of these images, showcasing a range of faces, are sourced from the 'Flickr-Faces-HQ dataset [18]’. The remaining 550 images, which depict a specific individual, were collected personally and represent one of the authors of this paper. Due to the relatively small size of the dataset, data augmentation techniques are applied during training to increase the dataset’s size and diversity.

Regarding the devices used for the computational experiment, Table 1 shows the components and specifications of both the NVIDIA Jetson and the desktop computer (DC) used as reference. It is important to consider that the Jetson’s consumption is capped at a maximum of 50W. It is worth highlighting the Deep Learning Accelerator (DLA), designed to enhance both performance and efficiency in deep learning model inference.

Table 1 Devices components specifications

In the specifications of the Jetson AGX Orin, detailed information about the power consumption of its individual components is not provided. Instead, the Jetson AGX Orin allows for power consumption adjustment through various power modes, ranging from 15, 30, and 50 watts, up to a maximum performance mode of 60 watts. Additionally, the device is equipped with not just one, but two DLAs.

3.1 Neural networks

For the development of the system, we will utilize the TensorFlow platform and the Keras library, employing transfer learning with five distinct neural networks as base models: MobileNet-v2 [19], ResNet50 [20], ResNet152 [20], Xception [21], and Vgg16 [22]. Table 2 provides an overview of the characteristics of the aforementioned networks based on their available TensorFlow implementations. “Depth” refers to the number of layers, while the “Parameters” represents the total count of neurons and connections between layers.

Table 2 Neural networks comparison

The MobileNet network is known for its computational efficiency, making it suitable for devices with hardware limitations. ResNet50 and ResNet152 are based on the concept of residual blocks, enabling deeper training and improved image classification accuracy. Xception, on the other hand, introduces separable convolutions, which reduce the number of parameters and operations compared to other architectures. Lastly, the VGG16 network is known for its simplicity and depth, using small-sized convolutions and multiple layers, resulting in a high learning capacity.

For training these networks, binary accuracy and loss value metrics have been chosen, given that our model only categorizes into two classes.

3.2 Optimization

The main optimization method chosen is quantization, a technique that involves reducing the precision of parameter values in a model by approximating them to a discrete set. Quantization is a process where ’precision’ refers to the number of bits used for data representation or calculations, while ’accuracy’ refers to the quality of the model in terms of the number of times it provides the correct solution. In other words, ’precision’ focuses on data and operation granularity determined by the bit count, whereas ’accuracy’ measures how closely the model aligns with desired results or objective truth.

In this work, both PTQ and QAT have been used. PTQ, as illustrated in Fig. 1, involves calculating the scaling factors used to reduce precision after training. A representative dataset, known as the calibration dataset, is utilized to capture the distribution of activations for each activation tensor. Subsequently, these distribution data are used to compute the scale value for each tensor. The distribution of each weight is also utilized to compute the weight scale.

Fig. 1
figure 1

PTQ workflow for models with transfer learning

The workflow for QAT shown in Fig. 2 is different from what we saw in PTQ because this method involves inserting Quantization (Q) and Dequantization (DQ) nodes into the original model. These nodes simulate quantization loss and add it to the training loss to calculate the scaling values with minimal precision loss. The final quantization phase involves applying the previously calculated values. When using transfer learning, we were only able to apply quantization to the base model, while the standard process involves adding the nodes to a pre-trained model and applying a few epochs for fine-tuning.

Fig. 2
figure 2

QAT workflow for models with transfer learning

This approach allows a balance between optimizing the computational load and maintaining accuracy in the classification task.

For each of the models, four optimized versions have been developed, modifying the bit width of operations using the PTQ process: 32-bit floating-point (FP32), 16-bit floating-point (FP16), and 8-bit integers (INT8). However, for QAT, the optimization has only been performed in INT8. Additionally, an extra model with 8-bit integer precision has been created for execution on the DLA using only PTQ, as it is incompatible with QAT.

In our use case, we have used TensorRT for optimization. This library performs a series of additional optimization operations to improve the performance of deep learning models on GPU hardware. This includes quantization of previously generated layer fusion, removal of unused layers, optimization of normalization layers, removal of redundant computations, precision tuning, convolution, and activation fusion, parallel execution, and generation of specialized inference code. These operations can yield varying results depending on the characteristics and architectures of the GPU on which it is executed.

3.3 Measurement of energy consumption

In order to estimate the energy consumption of training and inference tasks in the DC, the EML (Energy Measurement Library) API [23] has been utilized. Developed by the High-Performance Computing Group (GCAP) at the University of La Laguna, this tool simplifies energy consumption measurement by providing a unified interface for different types of hardware. It encapsulates the RAPL interface for Intel systems and uses NVML for NVIDIA GPUs, among other capabilities.

In the case of the Jetson device, information from various INA3221 power rails is provided, allowing one to measure power consumption. The rail VDD_GPU_SOC comprises the SoC (System on a Chip), which is the main chip of the Jetson, excluding major sub-blocks like the CPU, GPU, and accelerators. The rail VDD_CPU_CV holds the CPU and various accelerators available on the device, including the DLA. Direct and independent measurement of the energy consumption of the CPU and accelerators is not possible due to the device’s design.

4 Experimental results

In this section, the computational results obtained during the training and inference phases are described. Values are categorized based on their impact on either performance or energy efficiency. This distribution provides a clear overview of how different aspects of the models’ execution are affected by various optimizations and hardware configurations. Power consumption measurements were taken using the 50-watt power mode of the Jetson device.

In this chapter, the measurement of standard deviation has been omitted, as most cases exhibit negligible values, except in throughput. In this context, throughput, measured in seconds and linked to latency, shows that even minimal fluctuations in the latter (measured in milliseconds) can cause significant variations. These low levels of variability reflect that the performance of the models is consistent and stable. The findings suggest that these models have a robust ability to maintain consistent response times and energy efficiency, even when performing changes in operational conditions, proving their reliability in diverse scenarios.

In the training phase, the measurements correspond only to the execution of the task, ignoring the data loading times in memory. However, in inference, the performance and energy consumption measurements correspond to the process of sending data to the processor, executing the inference, and retrieving the result. Due to the difference in model sizes and the need to use a calibration dataset in PTQ with INT8, in the quantization phase the optimization time corresponds to loading the initial model, optimization, and writing the optimized model; meanwhile, in QAT training, without taking data loading into account, we measure the ratio of total time increase since the number of iterations in the trainings is variable.

4.1 Training phase

In this section, we will show the computational results obtained for the Training step. The values shown correspond to the average value of multiple executions since we use the early stopping criteria [24] in our use case. This means that each training run involves a different number of iterations, resulting in different values of time and energy consumption.

Table 3 shows the execution times of the training phase on each device and processing unit. The times obtained on the Jetson device CPU are at most 50% higher than those obtained on the DC CPU, except for the Xception network where the Jetson is faster. The shorter training time is a result of the model converging to the best solution in fewer iterations. However, it is important to note that this difference lacks statistical significance, as further tests are likely to equalize the number of epochs and potentially extend the total training time. The timings observed on the GPUs demonstrate that the Jetson requires approximately four times longer than the DC and even more for the VGG16 network.

Despite considerably longer training times on the GPU of the Jetson, with ratios consistent with the differences in their specifications, the training times still peak at around eight minutes. It is reasonable to conclude that the Jetson device is well-equipped to handle the training of the models on its own.

Table 3 Training Phase Runtime (s)

The energy consumption measurements presented in Table 4 show a contrasting result compared to the training times. The Jetson device CPU is capable of performing the training phase while consuming between 60% and 80% less energy than its DC counterpart. On the other hand, the Jetson-GPU achieves a reduction in energy consumption ranging from 75% to 88%. These values highlight that the lower Jetson performance is offset by a remarkable energy efficiency when compared to the processing units available in the DC.

Table 4 Training Phase Energy Metrics (J)

4.2 Inference phase

For the inference phase, the impact of optimization on accuracy, latency, throughput, energy consumption, and performance per watt was measured for each model and device.

Accuracy measures the model’s ability to classify and predict data correctly. Latency refers to the sum of time it takes for data to be copied to the GPU, inference time, and copying back from the GPU. On the other hand, processing capability (throughput) refers to the number of images the network can process within a specific time interval. Finally, the energy consumption of performing a single inference has been measured as well as the number of inferences each device can execute per watt consumed.

All the measurements shown in this subsection have been obtained using the GPUs of both devices and the inference accelerator. This is because TensorRT optimizes models for a specific GPU or accelerator. Each inference is performed on a batch of 32 images rather than individual images to achieve maximum performance.

As in training, the results correspond to the average values obtained in multiple executions. In latency, throughput, and consumption measurements, one execution corresponds to the realization of a total of 1000 inferences, and then, the result is divided by the total number of executions. To ensure that each execution is performed under equal conditions, we first wait a reasonable amount of time to prevent one execution from affecting the next. Second, a warm-up phase is performed to bring the GPU out of a low-power state and ensure that it is operational before taking measurements. This helps to ensure that the GPU is not in a hibernation state when measurements are taken.

The results in Fig. 3 show that the impact of optimization on the accuracy of the DC model is minimal, with a slight increase in the QAT optimized models. In Jetson, it can be observed that even some models optimized with PTQ improve the accuracy of the original model, in addition to those optimized with QAT. From these results, we can infer two things: firstly, model accuracy might be slightly affected by the target GPU of TensorRT; secondly, when using transfer learning and drastically reducing the number of classes in the model, it might suffer from noise or overfitting, which is mitigated by quantization. Second, when utilizing transfer learning and drastically decreasing the model’s class count, there might arise issues such as noise or overfitting. However, these challenges can be mitigated or minimized through the application of quantization, ultimately leading to marginal improvements in accuracy compared to the unoptimized model.

Fig. 3
figure 3

Accuracy after optimization

In Fig. 4, the latency values are shown for each network, model, and device. It can be observed that the results maintain the trend observed in the training times, with MobileNet once again exhibiting the lowest latency, while the Jetson device achieves latency values approximately twice as those obtained on the DC.

When comparing the TensorRT-optimized models at different precision levels (bitwidth of operations), a substantial reduction in latency is evident, even with the original 32-bit floating-point precision. INT8 operations achieve a reduction of around 90 percent compared to the original model. The models optimized with QAT at INT8 achieve results comparable to those optimized with PTQ at FP16 precision, with two exceptions. The performance of the model developed from the Xception network and optimized with QAT-INT8 is comparable to that achieved by the PTQ-FP32 optimized model, showing lower performance compared to the rest of the models. In contrast, we find the scenario observed in the VGG16 network, where the QAT-INT8 model shows a significant improvement, reaching the performance level of the PTQ-INT8 model. These findings highlight the variability in the response of different models to QAT optimization. The accelerator significantly reduces the performance of the TensorRT-optimized models. The primary reason for this poor performance is the incompatibility of a substantial portion of the layers in the pre-trained networks. When the model is executed on the accelerator, data must be constantly sent back to the GPU, leading to a noticeable negative impact on performance.

Fig. 4
figure 4

Inference latency

Figure 5 presents the results of the throughput capacity, which exhibits an inverse trend compared to the previously observed behavior. That is, a decrease in the precision of operations leads to a higher number of predictions per second. Undoubtedly, the MobileNet network benefits the most from the optimization process, followed by the ResNet50 network. The other models show a similar improvement factor among them. Similarly to latency, QAT at INT8 generally achieves worse values than using PTQ, except for the VGG16 network.

Fig. 5
figure 5

Inference Throughput

Figure 6 illustrates the acceleration values obtained from the latency data. These values represent the speed-up achieved by each optimized model, showcasing the improvement in performance compared to the original model’s execution time. In the context of latency addressed in this case, the speed-up is defined as the ratio between the latency of the original model and that of the optimized model. This relationship reflects the reduction in time and, consequently, the improvement in speed. The data reaffirm the previous conclusions and allow us to assert that MobileNet and VGG16 are the networks that best adapt to QAT. On the contrary, the Xception network achieves an excessively low performance improvement when using INT8.

Fig. 6
figure 6

Inference Latency Speedup

In Fig. 7, the energy consumption values are compared for each network, model, and device. Following a similar pattern observed in the latency analysis, both in the DC and in the Jetson, the optimization process achieves a substantial reduction in energy consumption, especially for heavier networks such as ResNet152 and Vgg16. The model with the lowest energy consumption in the DC achieves values similar to those of the model with the highest energy consumption in the Jetson, highlighting the notable efficiency of the device. As expected due to their higher latency, the models optimized with QAT and those for the DLA consume more energy per inference, although they still significantly outperform the original model in terms of energy efficiency.

Fig. 7
figure 7

Inference Energy Consumption

In conclusion, Fig. 8 illustrates the performance per watt in each of the devices. Measurements on the DC show a generalized decrease in performance per watt after the optimization process, indicating that the reduction in latency and increase in throughput come at the cost of reduced energy efficiency. On the Jetson device, on the other hand, a widespread improvement in performance per watt is observed, especially with INT8 quantization. In contrast to the previous results, QAT generally achieves better performance per watt than models optimized with PTQ in the Jetson, compensating for its lower performance with better energy efficiency, again with the exception of the Xception network.

The most remarkable aspect is the efficiency of the accelerator, which despite having worse latency and throughput, achieves significantly higher performance per watt than any of the other optimized models executed solely on the GPU, demonstrating greater energy efficiency compared to the graphics processor.

Fig. 8
figure 8

Inference Performance per Watt

In summary, Table 5 shows the evolution of results post-optimization. Values are calculated by comparing the optimized model against the original model as follows: for accuracy, the relative error is used; for latency and energy consumption, the reduction ratio is employed; and for throughput and performance per watt, the increase in ratio. In particular, the values highlighted in bold indicate the most substantial improvements resulting from optimization, representing the highest values within each row and providing better performance compared to their counterparts in that specific measurement. QAT notably leads to improved accuracy, while DLA significantly outperforms in terms of performance per watt. INT8 precision achieves the most favorable values in terms of both performance and efficiency. Additionally, the DC achieves remarkable acceleration and reduced power consumption, except for the VGG16 network, where the Jetson outperforms.

Assessing the distinct impacts of TensorRT optimizations and quantization in isolation present a challenge due to their interdependent nature, relying heavily on model specifics and the target GPU or accelerator architecture. However, estimating the individual impact of TensorRT optimization could potentially involve comparing the original model’s performance with models optimized through PTQ at FP32 precision. This hypothetical impact aligns with the observations made by comparing the ’tf’ and ’ptq_fp32’ columns in Figs. 4, 5, 7, and 8, the acceleration of the ’ptq_fp32’ model depicted in Figure 6 and the FP32 columns of Table 5. It is crucial to note that isolate these effects is complex given the limited information provided in the library’s documentation and the entwined execution within the optimization framework.

Table 5 Summary of optimization impact by model and precision

4.3 Quantization phase

After observing the advantages obtained with optimization of the models in the previous section, this section details the overhead incurred by the optimization process.

The ’optimization time’ corresponds to the total execution time of the optimization process conducted by TensorRT. To take into account variations in the number of training iterations, we calculated the average time per iteration. This allowed us to determine the ratio of increase in training time with QAT to standard training.

The time required to optimize the models on each device is shown in Fig. 9. It can be observed that in the DC, this time does not exceed 100 s in the worst case, which is the Xception network, while for the rest, it does not exceed a minute. On the Jetson device, this time increases to around 4 min, again with the Xception network achieving the worst results. Optimization times, especially when using PTQ in INT8, can sometimes exceed the training phase. However, based on the time improvements achieved in performance and energy efficiency explained earlier, this cost can be considered acceptable because it only needs to be executed once. Finally, the optimization times with QAT are low, as expected, since the quantization process occurs during training. It is also confirmed that this method performs worse on the ResNet and Xception networks.

Although these running times are not directly related to the performance of the optimized model, this information is useful for assessing whether the potential improvement in model performance justifies the cost involved in the optimization process. For QAT, it is essential to consider both the optimization time and the impact of performing quantization calculations during the training process.

Fig. 9
figure 9

Optimization time

In Fig. 10, the increase in training time caused by QAT in the training phase is shown. Since training was already slower on the CPU, the increases are smaller, but in the worst case, this phase can take up to 3.5 times longer. On the other hand, the increase in GPU time is significant, primarily due to its short training times, resulting in a ratio increase of more than 25. It is worth noting that this method leads to a minimal increase in GPU time for the Xception and VGG16 networks. However, considering the previously observed poor performance of the Xception network with this method, the VGG16 network emerges as the optimal choice for QAT.

Fig. 10
figure 10

Time increase ratio in QAT Training

5 Conclusion

This study focuses on the usage and optimization of representative convolutional networks for facial recognition, primarily using quantization so that an IoT device can run them effectively. The aim is to analyze performance and energy efficiency in training and inference tasks. Measurements were obtained on the NVIDIA Jetson AGX Orin IoT device, compared to a general-purpose computer equipped with an Intel i7-1260P processor and an NVIDIA RTX 3080 graphics card.

The results obtained after optimizing the models demonstrate the feasibility of conducting training on the IoT device as well as the potential to achieve performance similar to that of a higher-spec processor while consuming less energy due to the optimization process. Post-Training Quantization achieved the best performance values; however, Quantization-Aware Training achieved improvements in precision and energy efficiency, except for the Xception network. DLA 2.0 showed lower performance than the device’s GPU due to neural network layer incompatibility, but excelled in energy efficiency, surpassing other processing units. The cost of PTQ optimization has proven to be manageable compared to the benefits obtained. However, the increase in QAT training times is significant in certain networks. For future work, we plan to measure the impact on performance and efficiency of varying the input power on the Jetson device, in order to analyze the possibility of improving energy efficiency at the cost of reducing performance, considering environments with strict power requirements. We also consider in our agenda the analysis of the generalization of the results obtained in this paper to some other network models and their feasibility in IoT devices.