A power-aware vision-based virtual sensor for real-time edge computing

Contoli, Chiara; Calisti, Lorenzo; Fabrizio, Giacomo Di; Kania, Nicholas; Bogliolo, Alessandro; Lattanzi, Emanuele

doi:10.1007/s11554-024-01482-0

A power-aware vision-based virtual sensor for real-time edge computing

Research
Open access
Published: 30 May 2024

Volume 21, article number 103, (2024)
Cite this article

Download PDF

You have full access to this open access article

Journal of Real-Time Image Processing Aims and scope Submit manuscript

A power-aware vision-based virtual sensor for real-time edge computing

Download PDF

Chiara Contoli¹,
Lorenzo Calisti¹,
Giacomo Di Fabrizio¹,
Nicholas Kania¹,
Alessandro Bogliolo¹ &
…
Emanuele Lattanzi¹

343 Accesses
Explore all metrics

Abstract

Graphics processing units and tensor processing units coupled with tiny machine learning models deployed on edge devices are revolutionizing computer vision and real-time tracking systems. However, edge devices pose tight resource and power constraints. This paper proposes a real-time vision-based virtual sensors paradigm to provide power-aware multi-object tracking at the edge while preserving tracking accuracy and enhancing privacy. We thoroughly describe our proposed system architecture, focusing on the Dynamic Inference Power Manager (DIPM). Our proposed DIPM is based on an adaptive frame rate to provide energy savings. We implement and deploy the virtual sensor and the DIPM on the NVIDIA Jetson Nano edge platform to prove the effectiveness and efficiency of the proposed solution. The results of extensive experiments demonstrate that the proposed virtual sensor can achieve a reduction in energy consumption of about 36% in videos with relatively low dynamicity and about 21% in more dynamic video content while simultaneously maintaining tracking accuracy within a range of less than 1.2%.

Object detection using YOLO: challenges, architectural successors, datasets and applications

Article 08 August 2022

A review of convolutional neural networks in computer vision

Article Open access 23 March 2024

YOLO-based Object Detection Models: A Review and its Applications

Article 14 March 2024

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The edge computing paradigm is gaining momentum thanks to the advent of real-time, low-power Graphical Processing Units (GPUs) and Tensor Processing Units (TPUs) able to run tiny Machine Learning (ML) models on resource-constrained devices. The shift of paradigm of moving intelligence from the cloud towards the edge provides benefits in terms of privacy, latency, and bandwidth [2, 49]. Of particular interest are applications enabled by combining edge computing with object detection and tracking tasks, such as remote sensing image [11, 24], video surveillance [1, 18], human–computer interaction [48, 54], and autonomous driving [7, 36], to cite a few.

Object detection and tracking are two distinct computer vision tasks. Initially, those tasks were mainly carried out on powerful cloud server [12, 37]; subsequently, the idea was to adopt a collaborative approach between the cloud and the edge, where the two tasks are split across the edge computing architecture [6, 33]. A more recent approach envisages the design of efficient machine and deep learning solutions that are lightweight enough to be deployed on resource-constrained edge devices [13, 45]. Three well-known and widely adopted multi-object tracking-by-detection algorithms are Simple Online Realtime Tracker (SORT) [3], DeepSORT [43], and Intersection over Union (IoU) [35]. Many works in the existing literature compare the performance of SORT and DeepSORT algorithms against other state-of-the-art trackers [20, 31, 41, 44, 51]. The same applies to the IoU algorithm [10, 34, 46, 50]. Despite the rich literature, those works lack the investigation of the energy efficiency of the tracker algorithms. Only a few papers, e.g., [30], consider skipping some frames for performance improvement. However, the authors consider a fixed number of frames to be skipped and do not explore the energy efficiency of their algorithm.

If detection and tracking algorithms are required to run on low-power edge devices, their power consumption requires attention because of the challenges posed by resource constraints [55, 56]. Despite the high attention paid to energy expenditure, most of the literature only focuses on the impact of the detection phase. When the tracking phase is considered, many works, e.g., [47], only consider the single object tracking case. Our work advances existing literature by proposing an adaptive frame rate strategy for the IoU multi-object tracker to provide a real-time, power-aware, energy-efficient tracking algorithm.

1.1 Our contributions

In this paper, we propose the real-time vision-based virtual sensors paradigm for energy-efficient multi-object tracking on edge devices. We first thoroughly describe our proposed system architecture, with a particular focus on the Dynamic Inference Power Manager (DIPM). We implement and deploy the virtual sensor and the DIPM to perform extensive experimental measurements to prove the effectiveness and efficiency of our proposed methodology. Specifically, we consider the Single Shot Detector (SSD) MobileNet [9], and the Train Adapt Optimize (TAO) TrafficCamNet [28] as object detectors, and the lightweight Intersection over Union (IoU) tracking algorithm [38]. Our testbed uses the NVIDIA Jetson Nano [27] as an edge device platform, and we tested it on well-known benchmarks based on the Multi-Object Tracking (MOT) challenge [25]. Results show that the proposed virtual sensor can achieve a reduction in energy consumption of about 36% in videos with relatively low dynamicity and about 21% in more dynamic video content while simultaneously maintaining tracking accuracy within a range of less than 1.2%.

Our contributions are summarized as follows:

Real-time vision-based virtual sensors: a family of synthetic sensors that process data from camera sources and extract anonymous numerical information. The virtual sensor boosts privacy by consolidating data processing on the edge device without sending sensitive data to a centralized server.
Dynamic Inference Power Manager: enhances the virtual sensors by implementing an adaptive frame rate approach to allow energy savings while preserving tracking accuracy.
deployment on the NVIDIA Jetson Nano: we highlight the advantages of our methodology compared to conventional non-power-aware edge computing approaches.

The rest of the paper is organized as follows: Sect. 2 provides related work on object detection and tracking on edge device platforms. Section 3 describes the design principles of the proposed vision-based virtual sensor and provides a detailed description of the proposed DIPM. In Sect. 4, we provide background on the performance evaluation metrics typically employed to evaluate tracking systems. Section 5 presents the experimental setup and methodology employed to assess the efficacy of our solution. Section 6 discusses the results and findings, while Sect. 7 concludes the paper.

2 Related work

This section reviews the relevant literature on object detection and tracking efforts on testbed implementation considering edge hardware platforms. We also review the literature on adopted strategies to perform energy-efficient tracking systems.

Graphics Processing Units (GPUs) and Tensor Processing Units (TPUs), combined with machine and deep learning techniques, provide the opportunity to execute real-time computation-intensive tasks, such as object detection and tracking, at the edge. Many works in literature explored the feasibility of running the object detection task on GPUs or TPUs, often evaluating its impact in terms of detection accuracy, training and inference time, and energy efficiency [11, 14, 19, 32]. Note that these works do not investigate the impact of the object tracking task.

A pioneering work that considered the impact of the tracking task (in addition to the detection task) is the one by Casares and Velipasalar [5]. The authors proposed a lightweight algorithm that adaptively determines the smart camera idle duration to maximize energy saving. Their adaptive methodology is based on the speed of tracked objects, and detection is performed in smaller regions instead of the whole frame. Frames are dropped during the entire idle duration. Experiments were performed on a smart camera. It is worth mentioning that the authors investigated a scenario in which only one or, at most, two objects are tracked by the camera.

Zhao et al. deployed on an Internet of Things (IoT) and edge computing testbed a real-time object tracking system [53]. In their demo paper, the authors proposed splitting a You Only Look Once (YOLO) detection network (version YOLOv2tiny) across a Raspberry Pi 3B+, the IoT end device, and the NVIDIA Jetson TX2, the edge server. Their goal was to minimize the energy consumption of the IoT device while meeting the latency required by the user. However, the authors did not report any consumption measurement. Zhang et al.[52] also investigated an IoT and edge computing testbed. The authors proposed an algorithm based on the background-aware correlation filters tracker to improve the discriminative correlation filter tracking algorithm. They measured the accuracy and robustness of their proposed algorithm but did not investigate the energy consumption.

A low-power and real-time deep learning-based multiple object visual tracking system implemented on an NVIDIA Jetson TX2 was proposed by Blanco-Filgueira et al.. The authors proposed the integration of a hardware-oriented pixel-based adaptive segmenter detector with the Generic Object Tracking Using Regression Networks (GOTURN) tracking algorithm. GOTURN is a convolutional neural network that leverages deep learning to perform multi-object tracking. The authors performed experimental measurements of the power consumed by the board when varying the number of tracked objects and different Jetson TX2 operation modes. Despite the lack of a testbed, Inoue et al. also investigated real-time object tracking tailored to energy-saving [15, 16]. The authors proposed an algorithm that adaptively adjusts the frame rate based on the target object’s speed. However, the algorithm was not deployed on a hardware device, and the impact of their adaptive frame rate on energy consumption was only analyzed theoretically via simulations. A more recent study explored the capability to run an entire detection and tracking system on a resources-constrained device. In [29], Paissan et al. proposed PhyNets, a backbone sub-network based on a MobileNet. They used YOLOv2 as the object detector and the Simple Online Realtime Tracker (SORT) algorithm as the tracker for localizing, classifying, detecting, and tracking objects on an STM32H743 microcontroller unit. The detector and the tracker are properly combined to obtain the best performance on the target hardware.

Compared to the existing literature, we propose a real-time vision-based virtual sensor equipped with a dynamic inference power manager system based on an adaptive frame rate approach to allow an energy-efficient tracking system. We perform a thorough set of experiments on an NVIDIA Jetson Nano edge platform to prove the effectiveness and efficiency of our proposed methodology.

3 The vision-based virtual sensor

The proposed vision-based virtual sensor is made of five key components: (i) camera sensor, (ii) object detection artificial neural network, (iii) object tracking algorithm, (iv) metrics extractor, (v) Dynamic Inference Power Manager (DIPM).

The schema in Fig. 1 illustrates the flow of data originating from the sensor camera (blue arrows), the information used by the DIPM (orange arrows), and the signal generated by the DIPM instructing the camera to skip frames (green arrow).

3.1 Object detector

An object detector is a pre-trained machine-learning model to identify objects of interest in a real-time video stream. Concerning this, we analyze the characteristics of two widely used models designed for mobile and edge devices: the Single Shot Detector (SSD) MobileNet and the Train Adapt Optimize (TAO) TrafficCamNet.

SSD MobileNet was introduced by Google in 2018 as a specialized machine learning model for mobile and embedded computer vision tasks [9]. The MobileNet v2 architecture comprises a standard fully convolutional layer with 32 filters, followed by 17 residual modules. Each module includes a 1 $\times $ 1 convolutional layer, a 3 $\times $ 3 depth-wise separable convolutional layer, and a Relu6 excitation function. The SSD MobileNet v2 extends the network model through the feature pyramid network technology by including the SSD classifier. This network is, therefore, a two-part model: the feature extractor provided by the MobileNet v2 network and the classifier for identified images provided by the SSD layers [8].

TrafficCamNet is based on NVIDIA’s DetectNet v2 and leverages ResNet18 as the feature extractor. This model is designed to identify objects falling into four categories: (i) cars, (ii) persons, (iii) two-wheelers, and (iv) road signs from video captured from an elevated viewpoint [28]. The literature also refers to this network architecture with GridBox object detection. Indeed, the bounding box regression technique is employed to partition the input image of size $960\times 544$ into a grid. The final bounding box coordinates and category labels are derived by clustering algorithms, such as Density-Based Spatial Clustering of Applications w/ Noise or Non-Maximum Suppression, which post-process the initial confidence detections.

3.2 Object tracker

An object tracker is an algorithm designed to monitor the movement of one or multiple objects as they progress over time. It utilizes the result of the object detector network as input, which returns a list of detected objects along with their bounding boxes. The primary objective of a tracker algorithm is to associate each object with its previous detection, thereby tracking its movement between successive frames. Moreover, it assigns a unique ID to each newly detected object and removes those no longer visible.

To meet real-time energy constraints in devices with limited computing capacity, the complexity of the object algorithm must be considered. For this reason, various lightweight tracking algorithms have been suggested, such as the Intersection over Union (IOU) tracker.

The IOU tracker is designed to help the continuous tracking of multiple objects. Its approach involves associating each new detection with its counterpart from the previous frame based on their intersection over union exceeding a specified threshold denoted as $\sigma _{IOU}$. The IOU measure is defined as:

$$\begin{aligned} IOU(a,b)=\frac{Area(a)\cap Area(b)}{Area(a)\cup Area(b)} \end{aligned}$$

(1)

Here, a represents the current detection being considered, and b refers to the active tracks from previous frames.

Any detection not associated with an existing track creates a new one, while tracks lacking assigned detections are terminated. To improve tracking accuracy, a filtering mechanism is implemented using the parameter $t_{min}$ to eliminate tracks with a duration shorter than a specified threshold. Short-duration tracks are often associated with false positives, contributing unwanted noise to the output. The implementation of IOU addresses this issue by mandating that tracks must include at least one high-scoring detection. Such procedure ensures genuine associations with objects of interest while still permitting the inclusion of low-scoring detections, maintaining overall track completeness [4].

3.3 Metrics extractor

The metrics extractor algorithm takes input data from both the object detector and object tracker, including the object’s category, the unique ID assigned by the tracker, and the bounding box coordinates. This information is used to derive the following metrics:

The number of unique objects detected by the network.
The mean speeds of each tracked object.
The mean distance traveled by each tracked object.

To mitigate the impact of large or close objects on speed and distance calculations, the means are normalized with the diagonal of each bounding box.

These metrics serve various purposes within the virtual sensor system, such as measuring a person’s time in specific areas, tallying the number of people in a room or traversing a passage, and determining the speed and trajectory of objects in an environment.

The data collected by the sensor is post-processed and transmitted periodically to the cloud for storage and further analysis. An essential consideration is privacy protection; the data sent to the cloud contains only numerical values derived from the metrics extractor, omitting images or parts of video frames. This cumulative and anonymized data ensure i) security and confidentiality, and ii) respect individuals’ privacy.

3.4 Dynamic inference power manager

Modern webcams, with their ability to record video streams at exceptionally high frame rates, present a computational challenge for online object-tracking systems. In the conventional approach, each frame from the video source undergoes object detection and tracking algorithms, incurring in substantial energy costs.

When tracking slow-moving objects, there is a slight variation in position between each frame, suggesting that we can skip some frames without affecting the object’s position. This situation offers significant energy savings by selectively bypassing the execution of detection and tracking algorithms for those frames. However, in the case of rapidly moving objects, being conservative in frame skipping is crucial to prevent the risk of losing or tracking them as different objects. The DIPM aims to address these challenges.

The DIPM dynamically adjusts the frame rate based on the positions and speeds of objects in video frames. The problem of predicting the number of frames to skip is complex. At a given frame $f_t$, the object detector creates a bounding box around each subsequent tracked object. Figure 2 displays a rectangle $A'B'C'D'$ representing the bounding box of width $w'$ and height $h'$ centered around the currently tracked object at frame $f_t$. Computing the object’s speed $(S_x, S_y)$ in pixels per frame, it becomes possible to estimate its future position after N frames, represented by the rectangle $A''B''C''D''$.

When estimating the number of frames to skip, the DIPM needs to consider the operational dynamics of the tracking algorithm to avoid disruptive interference. The IOU tracker relies on the intersection over the union of rectangle areas for re-identifying tracked objects. Let $f_r$ be the resuming frame, i.e., the frame lasting N from now. So, for instance, considering $f_r$ as the frame in the future when the object detection and tracking will be recomputed, the following relation must be met:

$$\begin{aligned} IOU(a_{f_t},a_{f_r}) >= \sigma _{IOU} \end{aligned}$$

(2)

where $a_{f_t}$ and $a_{f_r}$ are the bounding boxes of the a-th object at current and at resuming frames, respectively. Predicting the maximum number of frames that can be skipped involves finding the last $f_r$ that satisfies Eq. 2.

Assuming that the tracked objects do not change their dimension over time, we can estimate the area of the intersection of the two boxes as:

$$\begin{aligned} I = (w'-NS_x)(h'-NS_y) \end{aligned}$$

(3)

While the union of the areas of the two rectangles is:

$$\begin{aligned} U = 2w'h' - I \end{aligned}$$

(4)

Replacing those equations inside the IOU formula we obtain:

$$\begin{aligned} IOU(a_{f_t},a_{f_r}) = \frac{(w'-NS_x)(h'-NS_y)}{2w'h' - (w'-NS_x)(h'-NS_y)} \end{aligned}$$

(5)

For each tracked object, we must find the maximum value of N that satisfies:

$$\begin{aligned} \frac{(w'-NS_x)(h'-NS_y)}{2w'h' - (w'-NS_x)(h'-NS_y)} >= \sigma _{IOU} \end{aligned}$$

(6)

It is important to note that the value of N is derived from the speed calculated between the last two detections, assuming that the speed remains constant in the future. However, in reality, the speed of the object can change rapidly. If the object accelerates shortly, there is a risk of overestimating N, potentially losing the association with the old track. Similarly, if it decelerates, there is a risk of underestimating N and losing the opportunity to skip more frames.

Algorithm 1 presents the pseudo-code for the iterative algorithm serving as the foundation of the proposed DIPM.

For each frame, the algorithm takes as input a list of detections paired with the IDs assigned by the tracker, the maximum number of frames to skip $n_{max}$, the IOU threshold $\sigma _{IOU}$, and a caution factor represented by the parameter $\alpha $ used as a speed modulator.

Initially, the number of frames to skip is set to its maximum value. Subsequently, the algorithm iterates through each detection, computing the object’s speed since the previous frame. After scaling the speed estimation by the $\alpha $ parameter, the algorithm attempts to predict the IOU value in each subsequent frame, starting from the next. This process continues until the IOU ratio exceeds the threshold limit or until the number of frames to skip surpasses the defined maximum value. The number of frames to skip obtained through this method must then be reduced by one to meet the loop’s condition.

Rows 10 to 14 update the value of f2s by always saving the minimum value found so far. The maximum number of frames to skip is determined by the first object that fails to satisfy $\sigma _{IOU}$. Notice that the $\alpha $ parameter is a cautionary factor, modulating the speed value. To be precise, a higher value of $\alpha $ results in a lower calculated intersection I.

4 Multi-object tracker performance evaluation metrics

In this section, we provide background on the performance evaluation of multi-object tracking systems, focusing on the metrics typically employed for the purpose.

A critical issue regarding trackers’ performance evaluation is deciding the metrics that are helpful to validate the algorithm compared to others. The issue of findings shared performance metrics and datasets to provide a common evaluation framework to compare different tracking strategies is well-known [40]. To overcome this issue, Leal-Taixé et al. proposed Multiple Object Tracking (MOT), a benchmark whose purpose is to provide a common standardized framework to evaluate multiple-object tracking methods [21, 25].

The MOT challenge is an evolving large-scale benchmark tailored to provide a fair comparison between tracking methods [21]. To achieve this goal, two key components are first, the highly diverse and challenging set of video sequences and, second, a set of common evaluation metrics. Sequences include dozens of object classes, including pedestrians and vehicles from static and moving cameras, low and high image resolution, varying weather conditions, and times of the day. Instead, the most common metrics are multi-object tracking accuracy (MOTA), multi-object tracking precision (MOTP), mostly tracked trajectories, mostly lost trajectories, identity switches (IDSW), and number of track fragmentations (Frag). The MOTA measures how many distinct errors (e.g., missed targets, ghost tracks, or IDSW) are made, whereas MOTP measures how well persons are localized. The IDSW measures how many trajectories corresponding to different targets are erroneously merged into a single one. The Frag measures how many ground truth trajectories are lost for any number of frames after successfully being tracked and before tracking is resumed.

More recently, Luiten et al. discussed and proposed the higher order tracking accuracy (HOTA) metrics [23]. The HOTA metrics envisage decomposing into a family of sub-metrics the evaluation of different error types independently of each other, thus enabling the analysis of i) the effect of detection, ii) association, and iii) localization errors. It is worth mentioning that the authors evaluate the effectiveness of the HOTA metrics on the MOTChallenge benchmark. The three error types happen when there is a mismatch between the ground truth and predicted tracks set. In particular, a detection error arises when the tracker misses the prediction of detections contained in the ground truth or when it predicts detections not contained in the ground truth. An association error arises when the tracker gives dissimilar unique prediction IDs to two detections that should have identical detection IDs or when it gives the identical unique prediction ID to two detections that have dissimilar unique detection IDs. A localization error arises when space domain mismatches between predicted and ground truth detections occur.

Inspired by those works, in this paper we consider the following evaluation metrics: (i) HOTA, (ii) association precision (AssPr in Sect. 6), (iii) identity switches (IDSW), and (iv) track fragmentation (Frag).

5 Experimental setup

This section outlines the hardware platform, the development framework, the video benchmarks, and the experimental setup employed for measuring energy consumption.

5.1 Edge device

As a benchmark platform, we chose the Jetson Nano device introduced by NVIDIA in 2019. This single-board computer offers a compact and affordable Artificial Intelligence (AI) and edge computing application solution. It has an integrated NVIDIA Maxwell GPU, a quad-core ARM Cortex-A57 MPCore CPU, 4 GB of RAM, and 32 GB of storage [27]. The hardware configuration provides substantial processing power, making it well-suited for a broad spectrum of AI tasks, such as computer vision, robotics, machine learning, and IoT applications. The Jetson Nano’s energy-efficient design, GPIO pins, and diverse connectivity options make it an exceptional choice for exploring and implementing AI at the edge.

5.2 Software framework

The interaction with object detection models and trackers has been achieved via the Jetson Inference framework provided by NVIDIA.

Jetson Inference is a dedicated software platform designed to facilitate the efficient deployment of AI applications on the NVIDIA Jetson platforms. With a comprehensive range of pre-trained deep learning models, tools, and Application Programming Interfaces specifically tailored for edge computing in computer vision [26], this platform simplifies the development process. It enables developers to harness the computational power of Jetson GPUs for real-time inference tasks, including object detection, image classification, and semantic segmentation. The versatility of Jetson Inference extends seamlessly to creating intelligent and responsive applications, especially in fields such as robotics, the Internet of Things (IoT), and autonomous systems. These domains heavily depend on low latency and real-time AI capabilities, which are essential for the successful operation of applications.

5.3 Video benchmark

To evaluate the effectiveness of the proposed virtual sensor in monitoring both vehicle traffic and pedestrian flow, a set of videos was employed to emphasize key aspects of each application. The benchmark videos consist of edited clips sourced from YouTube, tailored to a more manageable length for testing purposes.

A 120-second video with a frame rate of 30 fps was used to evaluate the virtual sensor’s capability to track pedestrian movement. This footage captures pedestrian activity along a sidewalk in the city center of Budapest, Hungary [22]. Conversely, a 210-second video with a frame rate of 30 fps was employed to benchmark the virtual sensor’s performance in monitoring vehicle traffic. This video records cars traveling on Route 28 in West Dennis, Cape Cod, Massachusetts, filmed from the roadside for a comprehensive view [42]. Both videos were filmed using a stationary camera, ensuring a consistent and stable perspective for evaluation.

5.4 Energy measurement setup

To evaluate energy consumption, we supplied power to the edge device using an NGMO2 Rohde & Schwarz dual-channel power supply [39], which maintains a constant voltage. Concurrently, we monitored the voltage decrease across a sensing resistor ($0.06\ \Omega $) integrated in series with the power supply. During the experiment, we sampled the signal using a National Instruments NI-DAQmx PCI-6251 16-channel data acquisition board [17].

6 Characterization results

This section presents the characterization results of the experiments conducted with the virtual sensor on several video benchmarks. We characterized how the DIPM affects object detection and tracking in terms of energy savings and accuracy loss, taking as baseline the results obtained by the same detection/tracking algorithms without the DIPM.

6.1 Detection and tracking behavior

To evaluate how the DIPM influences the detection and tracking behavior, we logged the total number of unique objects actively tracked in each frame of the video benchmarks. Figure 3 shows the log trace obtained when the DIPM is off (continuous blue line) and active (dashed red line) with three different DIPM parameter configurations. In particular, the graph on the top of the figure was obtained with a highly aggressive power manager ($\alpha =1$ and $n_{max}=6$), the graph in the middle with intermediate values ($\alpha =4$ and $n_{max}=6$), and with a strongly precautionary DIPM configuration ($\alpha =16$ and $n_{max}=6$) on the bottom of the figure. Notice that the vertical green bars identify each frame that has been skipped because of the DIPM. Obviously, for each skipped frame, not having any information about its content available, the log trace has been reconstructed by repeating the data related to the last frame before skipping.

The global scenario shows that as the aggressiveness of the DIPM increases, the virtual sensor identifies the objects on the scene with an increasingly more significant delay. In fact, by eliminating some frames, each new object entering the scene will start to be tracked late. The trace also shows that some objects, which appear in the scene for a few frames (peaks of very low amplitude in the blue trace), are not traced because skipping some frames, in this case, reduces the length of the track which falls below the minimum duration threshold represented by the $t_{min}$ parameter of the IOU tracker. On the other hand, as the DIPM becomes more conservative, the corresponding trace approximates the ground truth with ever greater precision.

From the energy point of view, the plot shows that during relatively stationary moments of the video, where few objects are present or move slowly, the number of skipped frames appears to be high. On the other hand, in the more excited phases, the DIPM avoids skipping frames to facilitate the tracker’s task.

6.2 MOT17 results

To quantitatively evaluate the DIPM influence on the tracking system, we executed the MOT17 benchmark using different values of $\alpha $ and $n_{max}$. Figure 4 shows four of the most representative MOT metrics calculated when varying the DIPM parameters.

In particular, both HOTA and AssPr, respectively (a) and (b) plots, rapidly increase for increasing value of $\alpha $ until reaching the value recorded for DIPM-off (red dashed line). At the same time, the lower the value of $n_{max}$, the lower the metrics gap. Interestingly, for $\alpha = 4$, the gap for both metrics appears to be lower than 1 point percentage regardless of the value of $n_{max}$. As both HOTA and AssPr deal with the association error, having a similar trend concerning the DIPM parameters suggests that the latter introduces errors precisely in the association phase, increasing, for example, the object switches.

To confirm this, Fig. 4c reports the trend of the number of identity switches (IDSW), which effectively shows higher values for parameters that make the DIPM more aggressive (i.e., forcing skipping more frames). Consequently, the fragmentation of the tracks is also more significant when a greater number of frames are skipped, as reported in Fig. 4d.

Ultimately, the global trend of the MOT17 tests highlights that the DIPM, if appropriately calibrated, does not introduce an excessive distortion in the dynamics of object tracking; indeed, even in the worst cases, the decrease in metrics remains within 5%. Moreover, for a medium precautionary configuration, the reduction of some key MOT metrics appears to be lower than a point percentage.

6.3 Impact of the DIPM on the metrics extraction

To estimate how the DIPM impacts the accuracy of the virtual sensor, we conducted a set of experiments to extract several relevant metrics regarding the objects moving on the video. In particular, we calculated the total number of unique objects and the average normalized speed and distance from the two video benchmarks, both with and without DIPM. For each metric, we calculate the percentage error concerning the value obtained when the DIPM was disabled (ground truth). In Fig. 5, the charts show the percentage error introduced in the number of unique objects (top), in the normalized distance (middle), and normalized speed (bottom) when varying the value of $\alpha $ and $ n_{max}$ in the benchmark containing cars.

In all three cases, the error appears more significant for values of $\alpha $ close to 1, whereas it decreases as $\alpha $ increases. At the same time, for higher values of $n_{max}$, the highest error values are obtained. This behavior is because the lower the value of $\alpha $, the higher the number of skipped frames will be, resulting in a higher probability of losing track and switching objects. Consequently, the error in the number of unique objects wrongly increases. In the same way, if the probability of losing a trace rises, the length of the traces will decrease, which entails a wrong length estimation. Finally, the average object speed measurement also suffers from a more significant number of skipped frames, probably due to the consequent increase in object switches.

Regarding the impact of modifying the $n_{max}$ parameter, denoting the strict cap on skippable frames, the plots show that, on average, the error tends to increase for higher values. Given that this parameter essentially signifies the maximum duration during which objects remain untrackable, its association with the error in metric estimation becomes evident.

In general, even if an influence of the DIPM on the correctness of the extracted metrics is evident, by appropriately sizing the values of $\alpha $ and $n_{max}$, it is possible to limit the errors to within a few percentage points. For instance, with $\alpha = 5$ and for any value of $n_{max}$, the reduction of performance metrics is never above five percent.

6.4 Energy saving

During testing, we measured the power consumption of the NVIDIA Jetson device using the measurement setup described in Sect. 5.

Figure 6 shows two power traces collected during object tracking with DIPM enabled (top) and without it, i.e., the traditional IOU tracking methodology (bottom). Notice that each burst of power consumption, reaching about 5 Watts, refers to the inference and tracking of a single frame. The energy saving from each skipped frame is evident from the comparison between the two traces. Indeed, thanks to the fact that the GPU is not used (the inference phase is not executed), the power consumption drops below 2 Watts for each skipped frame with a net gain of over 3 Watts. Notice that the residual power consumption accounts for the normal CPU execution, which, in these tests, continues to decode the input frames. A strategy to obtain further energy savings is to turn off image capture by the webcam driver by interacting directly with the operating system during the skip of the frames. In this work, however, we avoid implementing operating-system-dependent strategies to demonstrate the general validity of the proposed approach.

The energy saving achieved with our DIPM with respect to the traditional IOU approach is further explored in Fig. 7. Indeed, Fig. 7 reports the percentage of energy savings obtained for different values of the two DIPM parameters ($\alpha $ and $n_{max}$) when tracking cars (a) and people (b). As expected, the lower the precautionary parameter $\alpha $, the more the frames skipped, resulting in higher energy saving. Increasing $n_{max}$ further increases the number of skipped frames, which leads to further energy reduction. Comparing the two video benchmarks, it is clear that the amount of energy saving also depends on the content of the video. Indeed, the video containing people (b), for which the most negligible saving is obtained, is more dynamic than the video of the cars (a), and it shows very few frames in which no one appears in the scene.

Finally, to evaluate the trade-off between energy consumption and accuracy, we plotted the Pareto charts showing the energy expenditure ratio (i.e., the ratio between the energy consumed with and without the DIPM) compared to the percentage error measured on the normalized speed (Fig. 8). A Pareto chart helps visualize the relationship between different cost metrics and provides a visual tool to identify the best trade-off points. For instance, in Fig. 8, the best points, which are highlighted with red circles, are those that simultaneously minimize the energy expenditure ratio and the error in the speed calculation. On benchmarks containing cars (a), the best trade-off occurs with $n_{max}$ equal to 10, while for people, with $n_{max}$ equal to 2. In both cases, the corresponding value of $\alpha $ was 2. In these points, for an accuracy error just above 1%, the proposed virtual sensor saves up to 36% and 21% of energy, respectively, in cars and people benchmarks. The energy saving is much more significant in videos with static scenes with the sporadic presence of objects. However, even in the case of dynamic scenes with many objects, non-negligible energy savings and efficiency are achieved.

7 Conclusions

The convergence of edge computing devices with GPUs, TPUs, and machine learning has revolutionized computer vision and real-time tracking, enabling tasks to be executed at the edge for minimized latency and enhanced privacy. This research focuses on the design and characterization of real-time vision-based virtual sensors within distributed systems, placing a strong emphasis on energy efficiency. These virtual sensors, derived from synthetic processing of camera data, extract anonymous numerical information, such as the number of people in a given area or their average permanence time.

Through empirical experiments on a real hardware platform, the study’s findings reveal that the proposed virtual sensor can achieve a reduction in energy consumption in the range of 21% to 36%, accompanied by a decrease of less than 1.2% in tracking accuracy. Extensive tests on the MOT17 benchmarks reveal that the DIPM integrated into the virtual sensor, if properly calibrated, introduces a limited impact on object tracking. Indeed, for a medium precautionary configuration, the reduction of performance metrics is never above five percent.

Data availability

The paper has no associated data. All the benchmarks used in the paper are publicly available.

References

Angadi, S., Nandyal, S.: A review on object detection and tracking in video surveillance. Int. J. Adv. Res. Eng. Technol. 11, 9 (2020)
Google Scholar
Aslanpour, M.S., Gill, S.S., Toosi, A.N.: Performance evaluation metrics for cloud, fog and edge computing: A review, taxonomy, benchmarks and standards for future research. Internet of Things 12, 100273 (2020)
Article Google Scholar
Bewley, A., Ge, Z., Ott, L., Ramos, F., Upcroft, B.: Simple online and realtime tracking. In: 2016 IEEE international conference on image processing (ICIP), pp. 3464–3468. IEEE, IEEE (2016)
Bochinski, E., Eiselein, V., Sikora, T.: In: 2017 14th IEEE international conference on advanced video and signal based surveillance (AVSS), pp. 1–6. IEEE, IEEE (2017)
Casares, M., Velipasalar, S.: Adaptive methodologies for energy-efficient object detection and tracking with battery-powered embedded smart cameras. IEEE Trans. Circuits Syst. Video Technol. 21(10), 1438–1452 (2011)
Article Google Scholar
Chen, Y.Y., Lin, Y.H., Hu, Y.C., Hsia, C.H., Lian, Y.A., Jhong, S.Y.: Distributed real-time object detection based on edge-cloud collaboration for smart video surveillance applications. IEEE Access 10, 93745–93759 (2022)
Article Google Scholar
Chiu, H.k., Li, J., Ambruş, R., Bohg, J.: Probabilistic 3d multi-modal, multi-object tracking for autonomous driving. In: 2021 IEEE international conference on robotics and automation (ICRA), pp. 14227–14233. IEEE (2021)
Chiu, Y.C., Tsai, C.Y., Ruan, M.D., Shen, G.Y., Lee, T.T.: Mobilenet-ssdv2: An improved object detection model for embedded systems. In: 2020 International conference on system science and engineering (ICSSE), pp. 1–5. IEEE, IEEE (2020)
Dong, K., Zhou, C., Ruan, Y., Li, Y.: Mobilenetv2 model for image classification. In: 2020 2nd International Conference on Information Technology and Computer Application (ITCA), pp. 476–480. IEEE (2020)
Gong, W., Wang, J., Mao, L., Lu, L.: A pig tracking algorithm with improved iou-tracker. In: International Conference on Agri-Photonics and Smart Agricultural Sensing Technologies (ICASAST 2022), vol. 12349, pp. 303–312. SPIE (2022)
Guo, Y., Zhou, L.: Mea-net: a lightweight sar ship detection model for imbalanced datasets. Remote Sens. 14(18), 4438 (2022)
Article Google Scholar
Guo, Y., Zou, B., Ren, J., Liu, Q., Zhang, D., Zhang, Y.: Distributed and efficient object detection via interactions among devices, edge, and cloud. IEEE Trans. Multimed. 21(11), 2903–2915 (2019)
Article Google Scholar
Huang, Z., Yang, S., Zhou, M., Gong, Z., Abusorrah, A., Lin, C., Huang, Z.: Making accurate object detection at the edge: Review and new approach. Artif. Intell. Rev. 55(3), 2245–2274 (2022)
Article Google Scholar
Hui, Y., Lien, J., Lu, X.: Early experience in benchmarking edge ai processors with object detection workloads. In: International Symposium on Benchmarking, Measuring and Optimization, pp. 32–48. Springer (2019)
Inoue, Y., Ono, T., Inoue, K.: Real-time frame-rate control for energy-efficient on-line object tracking. IEICE Trans. Fundam. Electron. Commun. Comput. Sci. 101(12), 2297–2307 (2018)
Article Google Scholar
Inoue, Y., Ono, T., Inouer, K.: Situation-based dynamic frame-rate control for on-line object tracking. In: 2018 International Japan-Africa Conference on Electronics, Communications and Computations (JAC-ECC), pp. 119–122. IEEE (2018)
Instruments, N.: Pc-6251 datasheet (2020). http://www.ni.com/pdf/manuals/375213c.pdf
Jha, S., Seo, C., Yang, E., Joshi, G.P.: Real time object detection and trackingsystem for video surveillance system. Multimed. Tools Appl. 80(3), 3981–3996 (2021)
Article Google Scholar
Kang, P., Somtham, A.: An evaluation of modern accelerator-based edge devices for object detection applications. Mathematics 10(22), 4299 (2022)
Article Google Scholar
Kapania, S., Saini, D., Goyal, S., Thakur, N., Jain, R., Nagrath, P.: Multi object tracking with uavs using deep sort and yolov3 retinanet detection framework. In: Proceedings of the 1st ACM Workshop on Autonomous and Intelligent Mobile Systems, pp. 1–6 (2020)
Leal-Taixé, L., Milan, A., Schindler, K., Cremers, D., Reid, I., Roth, S.: Tracking the trackers: An analysis of the state of the art in multiple object tracking. arxiv 2017. arXiv preprint arXiv:1704.02781
Life: Stillcamlife - 16 minutes in budapest hungary beautiful people walking and cars driving by. https://www.youtube.com/watch?v=N79f1znMWQ8
Luiten, J., Osep, A., Dendorfer, P., Torr, P., Geiger, A., Leal-Taixé, L., Leibe, B.: Hota: A higher order metric for evaluating multi-object tracking. Int. J. Comput. Vision 129, 548–578 (2021)
Article Google Scholar
Mehmood, M., Shahzad, A., Zafar, B., Shabbir, A., Ali, N.: Remote sensing image classification: A comprehensive review and applications. Math. Probl. Eng. 2022, 1–24 (2022)
Google Scholar
Milan, A., Leal-Taixé, L., Reid, I., Roth, S., Schindler, K.: MOT16: A benchmark for multi-object tracking. arXiv:1603.00831 [cs] (2016)
dusty nv: Jetson inference (2017). https://github.com/dusty-nv/jetson-inference
NVIDIA: Jetson nano (2020). https://developer.nvidia.com/embedded/jetson-nano
NVIDIA: Trafficcamnet (2023). https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ tao/models/trafficcamnet
Paissan, F., Ancilotto, A., Farella, E.: Phinets: a scalable backbone for low-power ai at the edge. ACM Trans. Embedded Comput. Syst. 21(5), 1–18 (2022)
Article Google Scholar
Pereira, R., Carvalho, G., Garrote, L., Nunes, U.J.: Sort and deep-sort based multi-object tracking for mobile robotics: Evaluation with new data association metrics. Appl. Sci. 12(3), 1319 (2022)
Article Google Scholar
Praveenkumar, S., Patil, P., Hiremath, P.: Real-time multi-object tracking of pedestrians in a video using convolution neural network and deep sort. In: ICT Systems and Sustainability: Proceedings of ICT4SD 2021, Volume 1, pp. 725–736. Springer (2022)
Puchtler, P., Peinl, R.: Evaluation of deep learning accelerators for object detection at the edge. In: KI 2020: Advances in Artificial Intelligence: 43rd German Conference on AI. Bamberg, Germany, September 21–25, 2020, Proceedings 43, pp. 320–326. Springer, Springer (2020)
Pudasaini, D., Abhari, A.: Scalable object detection, tracking and pattern recognition model using edge computing. In: 2020 Spring Simulation Conference (SpringSim), pp. 1–11. IEEE (2020)
Qin, Z., Zhou, S., Wang, L., Duan, J., Hua, G., Tang, W.: Motiontrack: Learning robust short-term and long-term motions for multi-object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17939–17948 (2023)
Rahman, M.A., Wang, Y.: Optimizing intersection-over-union in deep neural networks for image segmentation. In: Bebis, G., Boyle, R., Parvin, B., Koracin, D., Porikli, F., Skaff, S., Entezari, A., Min, J., Iwai, D., Sadagic, A., Scheidegger, C., Isenberg, T. (eds.) Advances in Visual Computing, pp. 234–244. Springer International Publishing, Cham (2016)
Chapter Google Scholar
Ravindran, R., Santora, M.J., Jamali, M.M.: Multi-object detection and tracking, based on dnn, for autonomous vehicles: A review. IEEE Sens. J. 21(5), 5668–5677 (2020)
Article Google Scholar
Ren, J., Guo, Y., Zhang, D., Liu, Q., Zhang, Y.: Distributed and efficient object detection in edge computing: Challenges and solutions. IEEE Netw. 32(6), 137–143 (2018)
Article Google Scholar
Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., Savarese, S.: Generalized intersection over union: A metric and a loss for bounding box regression. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 658–666 (2019)
Schwarz, R.: Ngmo2 datasheet (2020). https://www.rohde-schwarz.com/it/brochure-scheda-tecnica/ngmo2/
Smith, K., Gatica-Perez, D., Odobez, J., Ba, S.: Evaluating multi-object tracking. In: 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05)-workshops, pp. 36–36. IEEE (2005)
Sundararaman, R., De Almeida Braga, C., Marchand, E., Pettre, J.: Tracking pedestrian heads in dense crowd. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3865–3875 (2021)
TexasHighDef: Cars driving on route 28 "road noise". https://www.youtube.com/watch?v=FOo0AbigryE
Wojke, N., Bewley, A., Paulus, D.: Simple online and realtime tracking with a deep association metric. In: 2017 IEEE international conference on image processing (ICIP), pp. 3645–3649. IEEE (2017)
Xiao, B., Kang, S.C.: Vision-based method integrating deep learning detection for tracking multiple construction machines. J. Comput. Civ. Eng. 35(2), 04020071 (2021)
Article Google Scholar
Xu, R., Nikouei, S.Y., Chen, Y., Polunchenko, A., Song, S., Deng, C., Faughnan, T.R.: Real-time human objects tracking for smart surveillance at the edge. In: 2018 IEEE International conference on communications (ICC), pp. 1–6. IEEE (2018)
Yang, K., Zhang, H., Zhou, D., Dong, L., Ma, J.: Iasa: An iou-aware tracker with adaptive sample assignment. Neural Netw. 161, 267–280 (2023)
Article Google Scholar
Yang, Z., Wang, X., Wu, J., Zhao, Y., Ma, Q., Miao, X., Zhang, L., Zhou, Z.: Edgeduet: Tiling small object detection for edge assisted autonomous mobile vision. IEEE/ACM Trans. Netw. (2022)
Yin, R., Wang, D., Zhao, S., Lou, Z., Shen, G.: Wearable sensors-enabled human-machine interaction systems: from design to application. Adv. Func. Mater. 31(11), 2008936 (2021)
Article Google Scholar
Yu, W., Liang, F., He, X., Hatcher, W.G., Lu, C., Lin, J., Yang, X.: A survey on the edge computing for the internet of things. IEEE access 6, 6900–6919 (2017)
Article Google Scholar
Yuan, D., Shu, X., Fan, N., Chang, X., Liu, Q., He, Z.: Accurate bounding-box regression with distance-iou loss for visual tracking. J. Vis. Commun. Image Represent. 83, 103428 (2022)
Article Google Scholar
Zhang, G., Yin, J., Deng, P., Sun, Y., Zhou, L., Zhang, K.: Achieving adaptive visual multi-object tracking with unscented kalman filter. Sensors 22(23), 9106 (2022)
Article Google Scholar
Zhang, H., Zhang, Z., Zhang, L., Yang, Y., Kang, Q., Sun, D.: Object tracking for a smart city using iot and edge computing. Sensors 19(9), 1987 (2019)
Article Google Scholar
Zhao, Z., Jiang, Z., Ling, N., Shuai, X., Xing, G.: Ecrt: An edge computing system for real-time image-based object tracking. In: Proceedings of the 16th ACM Conference on Embedded Networked Sensor Systems, pp. 394–395 (2018)
Zhu, M., Sun, Z., Zhang, Z., Shi, Q., He, T., Liu, H., Chen, T., Lee, C.: Haptic-feedback smart glove as a creative human-machine interface (hmi) for virtual/augmented reality applications. Sci. Adv. 6(19), 8693 (2020)
Article Google Scholar
Zhu, S., Ota, K., Dong, M.: Energy-efficient artificial intelligence of things with intelligent edge. IEEE Internet Things J. 9(10), 7525–7532 (2022)
Article Google Scholar
Zoni, D., Galimberti, A., Fornaciari, W.: A survey on run-time power monitors at the edge. ACM Comput. Surv. 2, 2 (2023)
Google Scholar

Download references

Acknowledgements

This work was partially funded by MIMIT, under FSC project “Pesaro CTE SQUARE”, CUP D74J22000930008.

Funding

Open access funding provided by Università degli Studi di Urbino Carlo Bo within the CRUI-CARE Agreement.

Author information

Authors and Affiliations

Department of Pure and Applied Sciences, University of Urbino, Piazza della Repubblica 13, Urbino, Italy
Chiara Contoli, Lorenzo Calisti, Giacomo Di Fabrizio, Nicholas Kania, Alessandro Bogliolo & Emanuele Lattanzi

Authors

Chiara Contoli
View author publications
You can also search for this author in PubMed Google Scholar
Lorenzo Calisti
View author publications
You can also search for this author in PubMed Google Scholar
Giacomo Di Fabrizio
View author publications
You can also search for this author in PubMed Google Scholar
Nicholas Kania
View author publications
You can also search for this author in PubMed Google Scholar
Alessandro Bogliolo
View author publications
You can also search for this author in PubMed Google Scholar
Emanuele Lattanzi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chiara Contoli.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Contoli, C., Calisti, L., Fabrizio, G.D. et al. A power-aware vision-based virtual sensor for real-time edge computing. J Real-Time Image Proc 21, 103 (2024). https://doi.org/10.1007/s11554-024-01482-0

Download citation

Received: 31 January 2024
Accepted: 20 May 2024
Published: 30 May 2024
DOI: https://doi.org/10.1007/s11554-024-01482-0

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A power-aware vision-based virtual sensor for real-time edge computing

Abstract

Similar content being viewed by others

Object detection using YOLO: challenges, architectural successors, datasets and applications

A review of convolutional neural networks in computer vision

YOLO-based Object Detection Models: A Review and its Applications