1 Introduction

The edge computing paradigm is gaining momentum thanks to the advent of real-time, low-power Graphical Processing Units (GPUs) and Tensor Processing Units (TPUs) able to run tiny Machine Learning (ML) models on resource-constrained devices. The shift of paradigm of moving intelligence from the cloud towards the edge provides benefits in terms of privacy, latency, and bandwidth [2, 49]. Of particular interest are applications enabled by combining edge computing with object detection and tracking tasks, such as remote sensing image [11, 24], video surveillance [1, 18], human–computer interaction [48, 54], and autonomous driving [7, 36], to cite a few.

Object detection and tracking are two distinct computer vision tasks. Initially, those tasks were mainly carried out on powerful cloud server [12, 37]; subsequently, the idea was to adopt a collaborative approach between the cloud and the edge, where the two tasks are split across the edge computing architecture [6, 33]. A more recent approach envisages the design of efficient machine and deep learning solutions that are lightweight enough to be deployed on resource-constrained edge devices [13, 45]. Three well-known and widely adopted multi-object tracking-by-detection algorithms are Simple Online Realtime Tracker (SORT) [3], DeepSORT [43], and Intersection over Union (IoU) [35]. Many works in the existing literature compare the performance of SORT and DeepSORT algorithms against other state-of-the-art trackers [20, 31, 41, 44, 51]. The same applies to the IoU algorithm [10, 34, 46, 50]. Despite the rich literature, those works lack the investigation of the energy efficiency of the tracker algorithms. Only a few papers, e.g., [30], consider skipping some frames for performance improvement. However, the authors consider a fixed number of frames to be skipped and do not explore the energy efficiency of their algorithm.

If detection and tracking algorithms are required to run on low-power edge devices, their power consumption requires attention because of the challenges posed by resource constraints [55, 56]. Despite the high attention paid to energy expenditure, most of the literature only focuses on the impact of the detection phase. When the tracking phase is considered, many works, e.g., [47], only consider the single object tracking case. Our work advances existing literature by proposing an adaptive frame rate strategy for the IoU multi-object tracker to provide a real-time, power-aware, energy-efficient tracking algorithm.

1.1 Our contributions

In this paper, we propose the real-time vision-based virtual sensors paradigm for energy-efficient multi-object tracking on edge devices. We first thoroughly describe our proposed system architecture, with a particular focus on the Dynamic Inference Power Manager (DIPM). We implement and deploy the virtual sensor and the DIPM to perform extensive experimental measurements to prove the effectiveness and efficiency of our proposed methodology. Specifically, we consider the Single Shot Detector (SSD) MobileNet [9], and the Train Adapt Optimize (TAO) TrafficCamNet [28] as object detectors, and the lightweight Intersection over Union (IoU) tracking algorithm [38]. Our testbed uses the NVIDIA Jetson Nano [27] as an edge device platform, and we tested it on well-known benchmarks based on the Multi-Object Tracking (MOT) challenge [25]. Results show that the proposed virtual sensor can achieve a reduction in energy consumption of about 36% in videos with relatively low dynamicity and about 21% in more dynamic video content while simultaneously maintaining tracking accuracy within a range of less than 1.2%.

Our contributions are summarized as follows:

  • Real-time vision-based virtual sensors: a family of synthetic sensors that process data from camera sources and extract anonymous numerical information. The virtual sensor boosts privacy by consolidating data processing on the edge device without sending sensitive data to a centralized server.

  • Dynamic Inference Power Manager: enhances the virtual sensors by implementing an adaptive frame rate approach to allow energy savings while preserving tracking accuracy.

  • deployment on the NVIDIA Jetson Nano: we highlight the advantages of our methodology compared to conventional non-power-aware edge computing approaches.

The rest of the paper is organized as follows: Sect. 2 provides related work on object detection and tracking on edge device platforms. Section 3 describes the design principles of the proposed vision-based virtual sensor and provides a detailed description of the proposed DIPM. In Sect. 4, we provide background on the performance evaluation metrics typically employed to evaluate tracking systems. Section 5 presents the experimental setup and methodology employed to assess the efficacy of our solution. Section 6 discusses the results and findings, while Sect. 7 concludes the paper.

2 Related work

This section reviews the relevant literature on object detection and tracking efforts on testbed implementation considering edge hardware platforms. We also review the literature on adopted strategies to perform energy-efficient tracking systems.

Graphics Processing Units (GPUs) and Tensor Processing Units (TPUs), combined with machine and deep learning techniques, provide the opportunity to execute real-time computation-intensive tasks, such as object detection and tracking, at the edge. Many works in literature explored the feasibility of running the object detection task on GPUs or TPUs, often evaluating its impact in terms of detection accuracy, training and inference time, and energy efficiency [11, 14, 19, 32]. Note that these works do not investigate the impact of the object tracking task.

A pioneering work that considered the impact of the tracking task (in addition to the detection task) is the one by Casares and Velipasalar [5]. The authors proposed a lightweight algorithm that adaptively determines the smart camera idle duration to maximize energy saving. Their adaptive methodology is based on the speed of tracked objects, and detection is performed in smaller regions instead of the whole frame. Frames are dropped during the entire idle duration. Experiments were performed on a smart camera. It is worth mentioning that the authors investigated a scenario in which only one or, at most, two objects are tracked by the camera.

Zhao et al. deployed on an Internet of Things (IoT) and edge computing testbed a real-time object tracking system [53]. In their demo paper, the authors proposed splitting a You Only Look Once (YOLO) detection network (version YOLOv2tiny) across a Raspberry Pi 3B+, the IoT end device, and the NVIDIA Jetson TX2, the edge server. Their goal was to minimize the energy consumption of the IoT device while meeting the latency required by the user. However, the authors did not report any consumption measurement. Zhang et al.[52] also investigated an IoT and edge computing testbed. The authors proposed an algorithm based on the background-aware correlation filters tracker to improve the discriminative correlation filter tracking algorithm. They measured the accuracy and robustness of their proposed algorithm but did not investigate the energy consumption.

A low-power and real-time deep learning-based multiple object visual tracking system implemented on an NVIDIA Jetson TX2 was proposed by Blanco-Filgueira et al.. The authors proposed the integration of a hardware-oriented pixel-based adaptive segmenter detector with the Generic Object Tracking Using Regression Networks (GOTURN) tracking algorithm. GOTURN is a convolutional neural network that leverages deep learning to perform multi-object tracking. The authors performed experimental measurements of the power consumed by the board when varying the number of tracked objects and different Jetson TX2 operation modes. Despite the lack of a testbed, Inoue et al. also investigated real-time object tracking tailored to energy-saving [15, 16]. The authors proposed an algorithm that adaptively adjusts the frame rate based on the target object’s speed. However, the algorithm was not deployed on a hardware device, and the impact of their adaptive frame rate on energy consumption was only analyzed theoretically via simulations. A more recent study explored the capability to run an entire detection and tracking system on a resources-constrained device. In [29], Paissan et al. proposed PhyNets, a backbone sub-network based on a MobileNet. They used YOLOv2 as the object detector and the Simple Online Realtime Tracker (SORT) algorithm as the tracker for localizing, classifying, detecting, and tracking objects on an STM32H743 microcontroller unit. The detector and the tracker are properly combined to obtain the best performance on the target hardware.

Compared to the existing literature, we propose a real-time vision-based virtual sensor equipped with a dynamic inference power manager system based on an adaptive frame rate approach to allow an energy-efficient tracking system. We perform a thorough set of experiments on an NVIDIA Jetson Nano edge platform to prove the effectiveness and efficiency of our proposed methodology.

3 The vision-based virtual sensor

The proposed vision-based virtual sensor is made of five key components: (i) camera sensor, (ii) object detection artificial neural network, (iii) object tracking algorithm, (iv) metrics extractor, (v) Dynamic Inference Power Manager (DIPM).

The schema in Fig. 1 illustrates the flow of data originating from the sensor camera (blue arrows), the information used by the DIPM (orange arrows), and the signal generated by the DIPM instructing the camera to skip frames (green arrow).

Fig. 1
figure 1

Schematic representation of the pipeline of the visual-based virtual sensor

3.1 Object detector

An object detector is a pre-trained machine-learning model to identify objects of interest in a real-time video stream. Concerning this, we analyze the characteristics of two widely used models designed for mobile and edge devices: the Single Shot Detector (SSD) MobileNet and the Train Adapt Optimize (TAO) TrafficCamNet.

SSD MobileNet was introduced by Google in 2018 as a specialized machine learning model for mobile and embedded computer vision tasks [9]. The MobileNet v2 architecture comprises a standard fully convolutional layer with 32 filters, followed by 17 residual modules. Each module includes a 1 \(\times \) 1 convolutional layer, a 3 \(\times \) 3 depth-wise separable convolutional layer, and a Relu6 excitation function. The SSD MobileNet v2 extends the network model through the feature pyramid network technology by including the SSD classifier. This network is, therefore, a two-part model: the feature extractor provided by the MobileNet v2 network and the classifier for identified images provided by the SSD layers [8].

TrafficCamNet is based on NVIDIA’s DetectNet v2 and leverages ResNet18 as the feature extractor. This model is designed to identify objects falling into four categories: (i) cars, (ii) persons, (iii) two-wheelers, and (iv) road signs from video captured from an elevated viewpoint [28]. The literature also refers to this network architecture with GridBox object detection. Indeed, the bounding box regression technique is employed to partition the input image of size \(960\times 544\) into a grid. The final bounding box coordinates and category labels are derived by clustering algorithms, such as Density-Based Spatial Clustering of Applications w/ Noise or Non-Maximum Suppression, which post-process the initial confidence detections.

3.2 Object tracker

An object tracker is an algorithm designed to monitor the movement of one or multiple objects as they progress over time. It utilizes the result of the object detector network as input, which returns a list of detected objects along with their bounding boxes. The primary objective of a tracker algorithm is to associate each object with its previous detection, thereby tracking its movement between successive frames. Moreover, it assigns a unique ID to each newly detected object and removes those no longer visible.

To meet real-time energy constraints in devices with limited computing capacity, the complexity of the object algorithm must be considered. For this reason, various lightweight tracking algorithms have been suggested, such as the Intersection over Union (IOU) tracker.

The IOU tracker is designed to help the continuous tracking of multiple objects. Its approach involves associating each new detection with its counterpart from the previous frame based on their intersection over union exceeding a specified threshold denoted as \(\sigma _{IOU}\). The IOU measure is defined as:

$$\begin{aligned} IOU(a,b)=\frac{Area(a)\cap Area(b)}{Area(a)\cup Area(b)} \end{aligned}$$
(1)

Here, a represents the current detection being considered, and b refers to the active tracks from previous frames.

Any detection not associated with an existing track creates a new one, while tracks lacking assigned detections are terminated. To improve tracking accuracy, a filtering mechanism is implemented using the parameter \(t_{min}\) to eliminate tracks with a duration shorter than a specified threshold. Short-duration tracks are often associated with false positives, contributing unwanted noise to the output. The implementation of IOU addresses this issue by mandating that tracks must include at least one high-scoring detection. Such procedure ensures genuine associations with objects of interest while still permitting the inclusion of low-scoring detections, maintaining overall track completeness [4].

3.3 Metrics extractor

The metrics extractor algorithm takes input data from both the object detector and object tracker, including the object’s category, the unique ID assigned by the tracker, and the bounding box coordinates. This information is used to derive the following metrics:

  • The number of unique objects detected by the network.

  • The mean speeds of each tracked object.

  • The mean distance traveled by each tracked object.

To mitigate the impact of large or close objects on speed and distance calculations, the means are normalized with the diagonal of each bounding box.

These metrics serve various purposes within the virtual sensor system, such as measuring a person’s time in specific areas, tallying the number of people in a room or traversing a passage, and determining the speed and trajectory of objects in an environment.

The data collected by the sensor is post-processed and transmitted periodically to the cloud for storage and further analysis. An essential consideration is privacy protection; the data sent to the cloud contains only numerical values derived from the metrics extractor, omitting images or parts of video frames. This cumulative and anonymized data ensure i) security and confidentiality, and ii) respect individuals’ privacy.

3.4 Dynamic inference power manager

Modern webcams, with their ability to record video streams at exceptionally high frame rates, present a computational challenge for online object-tracking systems. In the conventional approach, each frame from the video source undergoes object detection and tracking algorithms, incurring in substantial energy costs.

When tracking slow-moving objects, there is a slight variation in position between each frame, suggesting that we can skip some frames without affecting the object’s position. This situation offers significant energy savings by selectively bypassing the execution of detection and tracking algorithms for those frames. However, in the case of rapidly moving objects, being conservative in frame skipping is crucial to prevent the risk of losing or tracking them as different objects. The DIPM aims to address these challenges.

The DIPM dynamically adjusts the frame rate based on the positions and speeds of objects in video frames. The problem of predicting the number of frames to skip is complex. At a given frame \(f_t\), the object detector creates a bounding box around each subsequent tracked object. Figure 2 displays a rectangle \(A'B'C'D'\) representing the bounding box of width \(w'\) and height \(h'\) centered around the currently tracked object at frame \(f_t\). Computing the object’s speed \((S_x, S_y)\) in pixels per frame, it becomes possible to estimate its future position after N frames, represented by the rectangle \(A''B''C''D''\).

Fig. 2
figure 2

Diagram showing the bounding box in the present frame and the bounding box predicted after N frames

When estimating the number of frames to skip, the DIPM needs to consider the operational dynamics of the tracking algorithm to avoid disruptive interference. The IOU tracker relies on the intersection over the union of rectangle areas for re-identifying tracked objects. Let \(f_r\) be the resuming frame, i.e., the frame lasting N from now. So, for instance, considering \(f_r\) as the frame in the future when the object detection and tracking will be recomputed, the following relation must be met:

$$\begin{aligned} IOU(a_{f_t},a_{f_r}) >= \sigma _{IOU} \end{aligned}$$
(2)

where \(a_{f_t}\) and \(a_{f_r}\) are the bounding boxes of the a-th object at current and at resuming frames, respectively. Predicting the maximum number of frames that can be skipped involves finding the last \(f_r\) that satisfies Eq. 2.

Assuming that the tracked objects do not change their dimension over time, we can estimate the area of the intersection of the two boxes as:

$$\begin{aligned} I = (w'-NS_x)(h'-NS_y) \end{aligned}$$
(3)

While the union of the areas of the two rectangles is:

$$\begin{aligned} U = 2w'h' - I \end{aligned}$$
(4)

Replacing those equations inside the IOU formula we obtain:

$$\begin{aligned} IOU(a_{f_t},a_{f_r}) = \frac{(w'-NS_x)(h'-NS_y)}{2w'h' - (w'-NS_x)(h'-NS_y)} \end{aligned}$$
(5)

For each tracked object, we must find the maximum value of N that satisfies:

$$\begin{aligned} \frac{(w'-NS_x)(h'-NS_y)}{2w'h' - (w'-NS_x)(h'-NS_y)} >= \sigma _{IOU} \end{aligned}$$
(6)

It is important to note that the value of N is derived from the speed calculated between the last two detections, assuming that the speed remains constant in the future. However, in reality, the speed of the object can change rapidly. If the object accelerates shortly, there is a risk of overestimating N, potentially losing the association with the old track. Similarly, if it decelerates, there is a risk of underestimating N and losing the opportunity to skip more frames.

Algorithm 1 presents the pseudo-code for the iterative algorithm serving as the foundation of the proposed DIPM.

Algorithm 1
figure a

Dynamic Inference Power Manager

For each frame, the algorithm takes as input a list of detections paired with the IDs assigned by the tracker, the maximum number of frames to skip \(n_{max}\), the IOU threshold \(\sigma _{IOU}\), and a caution factor represented by the parameter \(\alpha \) used as a speed modulator.

Initially, the number of frames to skip is set to its maximum value. Subsequently, the algorithm iterates through each detection, computing the object’s speed since the previous frame. After scaling the speed estimation by the \(\alpha \) parameter, the algorithm attempts to predict the IOU value in each subsequent frame, starting from the next. This process continues until the IOU ratio exceeds the threshold limit or until the number of frames to skip surpasses the defined maximum value. The number of frames to skip obtained through this method must then be reduced by one to meet the loop’s condition.

Rows 10 to 14 update the value of f2s by always saving the minimum value found so far. The maximum number of frames to skip is determined by the first object that fails to satisfy \(\sigma _{IOU}\). Notice that the \(\alpha \) parameter is a cautionary factor, modulating the speed value. To be precise, a higher value of \(\alpha \) results in a lower calculated intersection I.

4 Multi-object tracker performance evaluation metrics

In this section, we provide background on the performance evaluation of multi-object tracking systems, focusing on the metrics typically employed for the purpose.

A critical issue regarding trackers’ performance evaluation is deciding the metrics that are helpful to validate the algorithm compared to others. The issue of findings shared performance metrics and datasets to provide a common evaluation framework to compare different tracking strategies is well-known [40]. To overcome this issue, Leal-Taixé et al. proposed Multiple Object Tracking (MOT), a benchmark whose purpose is to provide a common standardized framework to evaluate multiple-object tracking methods [21, 25].

The MOT challenge is an evolving large-scale benchmark tailored to provide a fair comparison between tracking methods [21]. To achieve this goal, two key components are first, the highly diverse and challenging set of video sequences and, second, a set of common evaluation metrics. Sequences include dozens of object classes, including pedestrians and vehicles from static and moving cameras, low and high image resolution, varying weather conditions, and times of the day. Instead, the most common metrics are multi-object tracking accuracy (MOTA), multi-object tracking precision (MOTP), mostly tracked trajectories, mostly lost trajectories, identity switches (IDSW), and number of track fragmentations (Frag). The MOTA measures how many distinct errors (e.g., missed targets, ghost tracks, or IDSW) are made, whereas MOTP measures how well persons are localized. The IDSW measures how many trajectories corresponding to different targets are erroneously merged into a single one. The Frag measures how many ground truth trajectories are lost for any number of frames after successfully being tracked and before tracking is resumed.

More recently, Luiten et al. discussed and proposed the higher order tracking accuracy (HOTA) metrics [23]. The HOTA metrics envisage decomposing into a family of sub-metrics the evaluation of different error types independently of each other, thus enabling the analysis of i) the effect of detection, ii) association, and iii) localization errors. It is worth mentioning that the authors evaluate the effectiveness of the HOTA metrics on the MOTChallenge benchmark. The three error types happen when there is a mismatch between the ground truth and predicted tracks set. In particular, a detection error arises when the tracker misses the prediction of detections contained in the ground truth or when it predicts detections not contained in the ground truth. An association error arises when the tracker gives dissimilar unique prediction IDs to two detections that should have identical detection IDs or when it gives the identical unique prediction ID to two detections that have dissimilar unique detection IDs. A localization error arises when space domain mismatches between predicted and ground truth detections occur.

Inspired by those works, in this paper we consider the following evaluation metrics: (i) HOTA, (ii) association precision (AssPr in Sect. 6), (iii) identity switches (IDSW), and (iv) track fragmentation (Frag).

5 Experimental setup

This section outlines the hardware platform, the development framework, the video benchmarks, and the experimental setup employed for measuring energy consumption.

5.1 Edge device

As a benchmark platform, we chose the Jetson Nano device introduced by NVIDIA in 2019. This single-board computer offers a compact and affordable Artificial Intelligence (AI) and edge computing application solution. It has an integrated NVIDIA Maxwell GPU, a quad-core ARM Cortex-A57 MPCore CPU, 4 GB of RAM, and 32 GB of storage [27]. The hardware configuration provides substantial processing power, making it well-suited for a broad spectrum of AI tasks, such as computer vision, robotics, machine learning, and IoT applications. The Jetson Nano’s energy-efficient design, GPIO pins, and diverse connectivity options make it an exceptional choice for exploring and implementing AI at the edge.

5.2 Software framework

The interaction with object detection models and trackers has been achieved via the Jetson Inference framework provided by NVIDIA.

Jetson Inference is a dedicated software platform designed to facilitate the efficient deployment of AI applications on the NVIDIA Jetson platforms. With a comprehensive range of pre-trained deep learning models, tools, and Application Programming Interfaces specifically tailored for edge computing in computer vision [26], this platform simplifies the development process. It enables developers to harness the computational power of Jetson GPUs for real-time inference tasks, including object detection, image classification, and semantic segmentation. The versatility of Jetson Inference extends seamlessly to creating intelligent and responsive applications, especially in fields such as robotics, the Internet of Things (IoT), and autonomous systems. These domains heavily depend on low latency and real-time AI capabilities, which are essential for the successful operation of applications.

5.3 Video benchmark

To evaluate the effectiveness of the proposed virtual sensor in monitoring both vehicle traffic and pedestrian flow, a set of videos was employed to emphasize key aspects of each application. The benchmark videos consist of edited clips sourced from YouTube, tailored to a more manageable length for testing purposes.

A 120-second video with a frame rate of 30 fps was used to evaluate the virtual sensor’s capability to track pedestrian movement. This footage captures pedestrian activity along a sidewalk in the city center of Budapest, Hungary [22]. Conversely, a 210-second video with a frame rate of 30 fps was employed to benchmark the virtual sensor’s performance in monitoring vehicle traffic. This video records cars traveling on Route 28 in West Dennis, Cape Cod, Massachusetts, filmed from the roadside for a comprehensive view [42]. Both videos were filmed using a stationary camera, ensuring a consistent and stable perspective for evaluation.

Fig. 3
figure 3

Log trace reporting the number of objects tracked per frame with and without DIPM. Green vertical bars identify frames skipping points

5.4 Energy measurement setup

To evaluate energy consumption, we supplied power to the edge device using an NGMO2 Rohde & Schwarz dual-channel power supply [39], which maintains a constant voltage. Concurrently, we monitored the voltage decrease across a sensing resistor (\(0.06\ \Omega \)) integrated in series with the power supply. During the experiment, we sampled the signal using a National Instruments NI-DAQmx PCI-6251 16-channel data acquisition board [17].

6 Characterization results

This section presents the characterization results of the experiments conducted with the virtual sensor on several video benchmarks. We characterized how the DIPM affects object detection and tracking in terms of energy savings and accuracy loss, taking as baseline the results obtained by the same detection/tracking algorithms without the DIPM.

6.1 Detection and tracking behavior

To evaluate how the DIPM influences the detection and tracking behavior, we logged the total number of unique objects actively tracked in each frame of the video benchmarks. Figure 3 shows the log trace obtained when the DIPM is off (continuous blue line) and active (dashed red line) with three different DIPM parameter configurations. In particular, the graph on the top of the figure was obtained with a highly aggressive power manager (\(\alpha =1\) and \(n_{max}=6\)), the graph in the middle with intermediate values (\(\alpha =4\) and \(n_{max}=6\)), and with a strongly precautionary DIPM configuration (\(\alpha =16\) and \(n_{max}=6\)) on the bottom of the figure. Notice that the vertical green bars identify each frame that has been skipped because of the DIPM. Obviously, for each skipped frame, not having any information about its content available, the log trace has been reconstructed by repeating the data related to the last frame before skipping.

The global scenario shows that as the aggressiveness of the DIPM increases, the virtual sensor identifies the objects on the scene with an increasingly more significant delay. In fact, by eliminating some frames, each new object entering the scene will start to be tracked late. The trace also shows that some objects, which appear in the scene for a few frames (peaks of very low amplitude in the blue trace), are not traced because skipping some frames, in this case, reduces the length of the track which falls below the minimum duration threshold represented by the \(t_{min}\) parameter of the IOU tracker. On the other hand, as the DIPM becomes more conservative, the corresponding trace approximates the ground truth with ever greater precision.

From the energy point of view, the plot shows that during relatively stationary moments of the video, where few objects are present or move slowly, the number of skipped frames appears to be high. On the other hand, in the more excited phases, the DIPM avoids skipping frames to facilitate the tracker’s task.

6.2 MOT17 results

To quantitatively evaluate the DIPM influence on the tracking system, we executed the MOT17 benchmark using different values of \(\alpha \) and \(n_{max}\). Figure 4 shows four of the most representative MOT metrics calculated when varying the DIPM parameters.

Fig. 4
figure 4

Plots reporting the results obtained on the MOT17 benchmark when changing the values of \(\alpha \) and \(n_{max}\) related to the HOTA (a), association precision (b), identity switches (c), and tracks fragmentation (d)

In particular, both HOTA and AssPr, respectively (a) and (b) plots, rapidly increase for increasing value of \(\alpha \) until reaching the value recorded for DIPM-off (red dashed line). At the same time, the lower the value of \(n_{max}\), the lower the metrics gap. Interestingly, for \(\alpha = 4\), the gap for both metrics appears to be lower than 1 point percentage regardless of the value of \(n_{max}\). As both HOTA and AssPr deal with the association error, having a similar trend concerning the DIPM parameters suggests that the latter introduces errors precisely in the association phase, increasing, for example, the object switches.

To confirm this, Fig. 4c reports the trend of the number of identity switches (IDSW), which effectively shows higher values for parameters that make the DIPM more aggressive (i.e., forcing skipping more frames). Consequently, the fragmentation of the tracks is also more significant when a greater number of frames are skipped, as reported in Fig. 4d.

Ultimately, the global trend of the MOT17 tests highlights that the DIPM, if appropriately calibrated, does not introduce an excessive distortion in the dynamics of object tracking; indeed, even in the worst cases, the decrease in metrics remains within 5%. Moreover, for a medium precautionary configuration, the reduction of some key MOT metrics appears to be lower than a point percentage.

6.3 Impact of the DIPM on the metrics extraction

Fig. 5
figure 5

Plots showing the error introduced by the DIPM when changing the values of \(\alpha \) and \(n_{max}\) in the calculation of the number of unique objects (top), normalized distance (middle), and normalized speed (bottom)

To estimate how the DIPM impacts the accuracy of the virtual sensor, we conducted a set of experiments to extract several relevant metrics regarding the objects moving on the video. In particular, we calculated the total number of unique objects and the average normalized speed and distance from the two video benchmarks, both with and without DIPM. For each metric, we calculate the percentage error concerning the value obtained when the DIPM was disabled (ground truth). In Fig. 5, the charts show the percentage error introduced in the number of unique objects (top), in the normalized distance (middle), and normalized speed (bottom) when varying the value of \(\alpha \) and \( n_{max}\) in the benchmark containing cars.

In all three cases, the error appears more significant for values of \(\alpha \) close to 1, whereas it decreases as \(\alpha \) increases. At the same time, for higher values of \(n_{max}\), the highest error values are obtained. This behavior is because the lower the value of \(\alpha \), the higher the number of skipped frames will be, resulting in a higher probability of losing track and switching objects. Consequently, the error in the number of unique objects wrongly increases. In the same way, if the probability of losing a trace rises, the length of the traces will decrease, which entails a wrong length estimation. Finally, the average object speed measurement also suffers from a more significant number of skipped frames, probably due to the consequent increase in object switches.

Regarding the impact of modifying the \(n_{max}\) parameter, denoting the strict cap on skippable frames, the plots show that, on average, the error tends to increase for higher values. Given that this parameter essentially signifies the maximum duration during which objects remain untrackable, its association with the error in metric estimation becomes evident.

In general, even if an influence of the DIPM on the correctness of the extracted metrics is evident, by appropriately sizing the values of \(\alpha \) and \(n_{max}\), it is possible to limit the errors to within a few percentage points. For instance, with \(\alpha = 5\) and for any value of \(n_{max}\), the reduction of performance metrics is never above five percent.

6.4 Energy saving

During testing, we measured the power consumption of the NVIDIA Jetson device using the measurement setup described in Sect. 5.

Fig. 6
figure 6

Power traces collected during object tracking with DIPM on (top trace) and off (bottom)

Figure 6 shows two power traces collected during object tracking with DIPM enabled (top) and without it, i.e., the traditional IOU tracking methodology (bottom). Notice that each burst of power consumption, reaching about 5 Watts, refers to the inference and tracking of a single frame. The energy saving from each skipped frame is evident from the comparison between the two traces. Indeed, thanks to the fact that the GPU is not used (the inference phase is not executed), the power consumption drops below 2 Watts for each skipped frame with a net gain of over 3 Watts. Notice that the residual power consumption accounts for the normal CPU execution, which, in these tests, continues to decode the input frames. A strategy to obtain further energy savings is to turn off image capture by the webcam driver by interacting directly with the operating system during the skip of the frames. In this work, however, we avoid implementing operating-system-dependent strategies to demonstrate the general validity of the proposed approach.

Fig. 7
figure 7

Plots showing the energy saving when changing the values of \(\alpha \) and \(n_{max}\) for the video benchmark with cars (a) and people (b)

The energy saving achieved with our DIPM with respect to the traditional IOU approach is further explored in Fig. 7. Indeed, Fig. 7 reports the percentage of energy savings obtained for different values of the two DIPM parameters (\(\alpha \) and \(n_{max}\)) when tracking cars (a) and people (b). As expected, the lower the precautionary parameter \(\alpha \), the more the frames skipped, resulting in higher energy saving. Increasing \(n_{max}\) further increases the number of skipped frames, which leads to further energy reduction. Comparing the two video benchmarks, it is clear that the amount of energy saving also depends on the content of the video. Indeed, the video containing people (b), for which the most negligible saving is obtained, is more dynamic than the video of the cars (a), and it shows very few frames in which no one appears in the scene.

Finally, to evaluate the trade-off between energy consumption and accuracy, we plotted the Pareto charts showing the energy expenditure ratio (i.e., the ratio between the energy consumed with and without the DIPM) compared to the percentage error measured on the normalized speed (Fig. 8). A Pareto chart helps visualize the relationship between different cost metrics and provides a visual tool to identify the best trade-off points. For instance, in Fig. 8, the best points, which are highlighted with red circles, are those that simultaneously minimize the energy expenditure ratio and the error in the speed calculation. On benchmarks containing cars (a), the best trade-off occurs with \(n_{max}\) equal to 10, while for people, with \(n_{max}\) equal to 2. In both cases, the corresponding value of \(\alpha \) was 2. In these points, for an accuracy error just above 1%, the proposed virtual sensor saves up to 36% and 21% of energy, respectively, in cars and people benchmarks. The energy saving is much more significant in videos with static scenes with the sporadic presence of objects. However, even in the case of dynamic scenes with many objects, non-negligible energy savings and efficiency are achieved.

Fig. 8
figure 8

Pareto charts showing the energy expenditure rate versus the error on the normalized object speed changing the value of \(n_{max}\) for the video benchmark with cars (a) and people (b)

7 Conclusions

The convergence of edge computing devices with GPUs, TPUs, and machine learning has revolutionized computer vision and real-time tracking, enabling tasks to be executed at the edge for minimized latency and enhanced privacy. This research focuses on the design and characterization of real-time vision-based virtual sensors within distributed systems, placing a strong emphasis on energy efficiency. These virtual sensors, derived from synthetic processing of camera data, extract anonymous numerical information, such as the number of people in a given area or their average permanence time.

Through empirical experiments on a real hardware platform, the study’s findings reveal that the proposed virtual sensor can achieve a reduction in energy consumption in the range of 21% to 36%, accompanied by a decrease of less than 1.2% in tracking accuracy. Extensive tests on the MOT17 benchmarks reveal that the DIPM integrated into the virtual sensor, if properly calibrated, introduces a limited impact on object tracking. Indeed, for a medium precautionary configuration, the reduction of performance metrics is never above five percent.