1 Introduction

  1. A.

    Motivations

The ongoing pandemic of COVID-19 has had a negative impact on the development of society, the economy, and the environment worldwide [1]. COVID-19 has spread widely worldwide, mainly by direct transmission, aerosol, and contact transmission. Direct transmission, when droplets cause infection breathed in through close-range interaction; by Aerosol, when droplets mixed with air form an aerosol that is inhaled [2]; and by Contact if droplets deposited on objects reach the nasal and oral cavities, eyes, or mucous membranes, due to non-sanitized hands. Symptoms of infection recorded are fever, dry cough, general fatigue, nasal congestion, and, more rarely, hypoxia. In the most severe cases, 50% have dyspnea after the first week, which could develop into acute respiratory distress, septic shock, metabolic acidosis, hemorrhage, and coagulation dysfunction. Most patients recover well, but a not-insignificant percentage remain in critical condition or even die. Many countries have taken restrictive measures to limit the spread of infection [3, 4], but with relatively little success. Even now, the key elements to ensure the safety of individuals are technologies that can detect social distance [5,6,7,8], face masks, and body temperature [9,10,11,12]. To this aim, a promising solution comes from AI-based systems.

This paper proposes integrating an embedded platform of three parallelized models of YOLOv4, a widely used deep-learning detector for object detection. The goal is to increase the degree of detail in detecting the attitudes of individuals that often cause the spread of infection.

  1. B.

    State-of-the-art overview

Convolutional neural network models appear to be best suited for applications in image reconstruction and classification [13, 14], object detection [15], and instance segmentation [16]. They are also exploited for their ability to extract features and handle limited or incomplete data sets [17, 18]. YOLO certainly appears to be the most widely used of all the CNN-specific models due to its ability to integrate real-time systems [19]. In this work, three YOLOv4-tiny models have been proposed [20] to parallelize the detection of distance, presence of face mask, and temperature on different cores of the same processor. The proposed research takes inspiration from previous work [8, 21], which is limited to a single feature detection (social distancing only, enhanced with a bird’s eye view for perspective in [21]) and taking input from a thermal camera only.

In contrast, this work refers to real-time multi-feature detection using thermal and visible cameras. Using multiple DL models, the proposed method involves detecting humans and faces with bounding boxes. These detected boxes are then processed to classify whether the individual wears a mask. Meanwhile, the proposed approach is a standalone application to proximate the distance between these individuals and measure their facial temperature. DL is used today in different real-time applications to protect the life of people from damage such as fire disasters [22], health care, and facial feature analysis by processing image or video surveillance systems. Compared with previous work, in addition to changing the application, we have improved aspects related to the computational capabilities of the DL models, as well as enhanced the integration flow on the embedded system to ensure real-time throughput by allowing us to use as many as three different YOLO models parallelized on other cores. Furthermore, several researchers use a combination of RestNet50 [23] and YOLOV3 [24] lightweight neural network architectures with transfer learning techniques. This is to regularize the resource constraints and the accuracy of object detection. In recent years, DL object recognition techniques [25] have been exploited significantly in computer vision tasks and can potentially be more effective than shallow models in solving complex problems. However, DL recognition models emphasize feature and contextual learning [26]. Therefore, object detection architectures [27] are split into two categories, which include two-stage models such as FPN [28], Mask R-CNN [29], and Faster R-CNN [30], and single-stage models such as YOLO [31] YOLOv2 [32], and YOLOv4 [33]. In the case of two-stage detectors, detection occurs in stages, with proposals being calculated in the first stage and object categories being classified in the final stage. At the same time, single-stage models consider the prediction a regression problem, which detects all the objects in the image in a single shoot.

In [34], the authors presented an object detection approach, which has been trained to predict the targeted objects in the images for face recognition. Researchers have proven that wearing face masks can significantly minimize COVID-19 spread in public areas. In addition, deep learning models can be trained to recognize whether people have face masks.

Face mask detection has been used in transportation systems widely [35]. Face mask detection and social distance measurement can be accomplished using deep learning models [36]. The video camera can be utilized. and the DL algorithm can be used to perform face mask detection and people violating social distancing measurements. Moreover, it performs an effective process for feature extraction from the images. Authors in [37] proposed a framework for performing face mask detection and monitoring social distance to reduce the COVID-19 spread between individuals. They implemented their work on Raspberry PI4, which can perform multiple activities simultaneously. Embedded system-based deep learning algorithms gain increasing attention for different applications of object detection and tracking system [38]. Authors in [39] proposed a system that performs face mask detection, temperature measurement, and measuring social distancing to protect individuals from COVID-19. They presented an integrated approach, which includes Arduino Uno Raspberry Pi-based IoT system. In [40], authors proposed a detection system, non-real-time, for identifying COVID-19 by applying DL models on chest X-ray images.

It proved to be very accurate and hence quite beneficial for radiologists to prompt the detection of COVID-19. Artificial intelligence-enabled technology solutions, such as self-explanatory digital solutions, are needed to deal with the post-pandemic situation in society and industry. It will provide extreme support to minimize the impact of COVID-19 on the counter-economic circumstances [41, 42]. A previous study performed randomized social distancing and mask detection trials, which found that an inexpensive intervention would help interrupt respiratory virus transmission in society [43]. Recent studies have been carried out on handling community gatherings using different methods to minimize the spread of COVID-19 among individuals, such as social distancing and mask usage and temperature measurement, which is also an essential tool to detect symptoms of the virus. These studies utilized different techniques using one or a combination of two methods to prevent the spread of COVID-19. However, these studies hold few limitations from a conceptual framework point of view. The evidence explored literature depicts the need to devise an efficient method to strengthen deep learning technology to respond effectively to the outbreak. In this paper, we propose an integrated approach that incorporates all three technologies (mask detection, social distancing, and temperature measurement) that can provide numerous advantages in controlling the spread of infectious diseases. It can help identify individuals who may be infected but are asymptomatic and provide real-time data on compliance with public health guidelines. Furthermore, an integrated approach can help to overcome the limitations of using each technology individually. Table 1 shows a summary of existing studies.

Table 1 Comparison review of different studies, using different techniques (one method solely or a combination of two) to prevent COVID-19 spread
  1. C.

    Contributions

Our goal in this research is to enrich COVID-19 prevention system and examine the integrated algorithm to the other methodologies from the state-of-the-art. Therefore, an AI-enabled technology will enhance the overall situation by minimizing the lockdown phases, where systems such as surveillance, detection, and monitoring will be implemented by utilizing DL models and IoT-embedded devices as the required core solution to the ongoing pandemic. The contributions of this work are summarized as the followings:

  • This integrated approach can help prevent the spread of COVID-19 by monitoring social distancing, face mask detection, and facial temperature measurement by employing fusion of three different YOLOv4-tiny object detectors, to simultaneously monitor and detect these features in real-time.

  • The proposed YOLOv4 tiny can perform object detection and tracking much faster than the other state-of-the-art deep learning models. Despite its smaller size, YOLOv4 tiny can still achieve high accuracy in detecting objects for real-time applications.

  • Executing the proposed models on NVIDIA boards (Jetson nano and Xavier AGX) showcases its potential scalability and efficiency, paving the way for real-world applications in various scenarios with different trade-offs between cost and performance.

  • A single thermal camera has developed thermal screening systems to measure facial temperature for more than one person at once, while this camera continues to monitor social distancing between pedestrians.

The aim of YOLOv4-tiny in this research is to detect the objects in video frames. Given an input frame, the model processes it through its convolutional neural network to generate bounding box predictions and associated class probabilities. Specifically, we integrated three different YOLOv4-tiny object detectors into the system, each serving a specific task: social distancing monitoring, mask detection, and facial temperature measurement. YOLOv4-tiny is a deep learning model known for its efficiency and suitability for real-time applications, making it a suitable choice for this edge-AI algorithm. The proposed models were trained on different data sets for people detection, mask detection, and facial temperature measurement. These data sets contain a diverse range of samples to ensure robustness and accuracy in different scenarios.

The rest of the paper is organized as follows: Section 2 presents the proposed methodology; Section 3 presents the obtained results and the discussion; Section 4 describes real-time implementation on edge NVIDIA platforms. Finally, conclusions are drawn in Sect. 5.

2 Proposed algorithm design methodology

In this work, we implemented the proposed method for multiple tasks, including monitoring social distancing and facial temperature measurement, using face mask detection algorithms. This approach provides an automated surveillance system, which uses video cameras to warn authorities and help them ensure the individuals comply with social distancing regulations, measuring their face temperature, and face mask detection norms to reduce virus spread. Three models of YOLOv4-tiny are utilized for the tasks described above. The proposed approach started with collecting the data sets for 3 tasks. Then, we trained and tested the YOLOv4-tiny models to evaluate their performance and robustness. The final prototype approach executed on the embedded system (Jetson Nano or Xavier AGX) is connected to the monitoring system to be executed as a standalone application in these devices. We used a visible video camera for face mask detection and a thermal camera for social distance classification and measuring facial temperature. The visible and thermal cameras are operated simultaneously, installed, and executed on NVIDIA devices. Figure 1 shows the integrated approach for face mask detection, social measuring, and facial temperature video measurement.

Fig. 1
figure 1

Integrated approach for face mask detection, social distance, and facial temperature video measurement system

  1. A.

    Face mask detection

The images of face masks have been used from various sources on the internet. We selected various people of different ages in indoor and outdoor public places. 900 images have been used for this experiment. The selected images include single faces and crowded groups of individuals that appeared from different angles in these images. We have selected different types of masks with different colors, see Fig. 2. A data annotation tool has been used to label the targeted faces on the images. There are various data annotations, such as image and video annotations, key-point annotations, and Polygonal segmentation annotations. In addition, LabelImg was utilized to label object bounding boxes on the images. This tool allows saving annotations in different formats. The YOLOv4-tiny model has been designed and trained for face mask detection. Figure 3 shows the workflow for designing and training YOLOv4-tiny for face mask detection. The proposed approach aims to build a custom real-time model for face mask detection.

Fig. 2
figure 2

Sample for masks with a different color

Fig. 3
figure 3

Workflow structure for mask detection

  1. B.

    Social distancing

In this research, YOLOv4-tiny model is used for human detection. 2000 thermal images have been collected from various sources. This data set consists of thermal images of people, which were acquired from different realistic indoor and outdoor environments. These thermal images contain natural scenes of human activity recognition, including walking, talking, standing, and sitting. A custom annotation tool has been utilized to label persons with bounding boxes. We used the Euclidean formula to compute the distance and the centroid information for the detected bounding boxes. In this work, the Euclidean measurement distance is determined as 6 feet. We have assigned two different thresholds for violation rules as dangerous and warn for the detected persons. We assigned the first threshold as warn, determined with yellow color, and the second threshold as dangerous, determined with red. If the distance between the detected people is less than or equal to 5 feet, the color of the bounding box is set to red. The bounding box color changes to yellow when the space between the detected bounding boxes is less than or equal to 6 feet and more than 5 feet. When the distance between the detected persons is more than 6 feet, the bounding box color is set to green, meaning social distancing is maintained safely.

The proposed approach has been implemented with Bird’s eye view to eliminate the perspective view from the video camera. The top-down view helps our idea to improve the scalability of a social distancing estimation system. The video camera does not have to be set up in a specific way. Neither the camera's height nor the inclination angle needs to be determined. Instead, it needs to click four dots on the captured video images that will be the plane's corner points, transforming the targeted classes into a top-down view. These points must create a rectangle with at least 2 two opposite sides parallel. If this system is turned into a product, it can be adopted effectively.

  1. C.

    Facial temperature measurement

Facial images have been utilized from work [50], see Fig. 4. Most facial thermal data sets were collected from indoor and outdoor environments. These images were acquired from different scenes, including people in different body positions and facial expressions from a thermal video camera. 9.982 images have been utilized for this work. The thermal images have been inverted to get the negative images. Gamma correction has been applied to these negative images to improve their visibility. This enhanced the brightness of the features from the captured facials. The proposed system calculates the average temperature of individuals’ faces based on pixel interpolation from a given image frame. The process determines the average temperature for each person ‘face within the frame. Initially, the code loops through each person's faces bounding box in the frame and extracts the region of interest (ROI) corresponding to that person's faces. The process begins with the function get_person_temperature, which takes a list of bounding boxes (boxs) and an input image frame (frame). It proceeds to iterate over each bounding box in the list and extracts the region of interest (ROI) from the input image, assuming that the ROI contains the person's face. Python and appropriate libraries (e.g., OpenCV or PyTorch) are utilized to read the image and extract raw pixel values. By analyzing the pixels in the ROI, the code calculates the average temperature value. This temperature value is then mapped to a temperature range 36–38 °C using a custom map_function, allowing for better representation and visualization. The map_function is instrumental in this process as it transforms the calculated average temperature value from its original range. Finally the obtained raw pixel values are converted into integers.

Fig. 4
figure 4

Sampled images are used for facial temperature measurement [50]

  1. D.

    Model building and training

YOLOv4-tiny structure is a deep convolutional neural network designed for object detection and recognition. It is a smaller and faster version of the original YOLOv4 model but still maintains high accuracy and precision in detecting objects in images and videos. The lightweight nature of YOLOv4-tiny also makes it suitable for use in mobile and embedded devices, which are becoming increasingly popular for real-time applications. With the rise of the Internet of Things (IoT), there is a growing need for low-power, low-cost devices that can perform real-time object detection. YOLOv4-tiny is well-suited for this task, as it can run on devices with limited processing power and memory. In YOLOv4-tiny, the classification model is typically based on the CSPDarknet53 architecture, which is a custom deep neural network architecture specifically designed for the YOLO models. CSPDarknet53 is a convolutional neural network and backbone for object detection that uses DarkNet-53. It employs a CSPNet strategy to partition the feature map of the base layer into two parts and then merges them through a cross-stage hierarchy. The use of a split and merge strategy allows for more gradient flow through the network; Fig. 5 shows the structure of YOLOv4-tiny model. The convolutional neural network layers have been compressed to 29 layers to achieve fast detection. As a result, YOLOv4-tiny reached up to 371 fps, which could meet the requirement of real-time applications. YOLOv4-tiny model utilizes the CSPDarknet53-tiny network as a backbone, substituting the CSPDarknet53 network used in YOLOv4 architecture. The CSPDarknet53-tiny network is the CSP-Block architecture in the cross-stage model. It substituted the Res-Block architecture within the residual network. The feature map is divided by CSP-Block architecture into two segments. This creates a gradient, which could generate two separate paths for the network. CSP-Block architecture has the capability to enhance the learning of CNN in contrast to the Res-Block architecture. However, the accuracy of the model is improved by the increased computation. It eliminates the computational bottlenecks with higher computational overhead in the CSP-Block architecture to minimize the computational cost. Furthermore, it enhances the performance of the YOLOv4-tiny model with constant by reducing the computation. To improve the computation process, the Leaky-ReLU function is used as an activation function in YOLOv4-tiny model instead of mixed activation function used in YOLOv4 architecture, see Eq [1]. The Leaky-ReLU function is

$${y}_{i}=\left\{\begin{array}{c}\frac{{x}_{i}}{{a}_{i}} if {x}_{i}<0\\ \\ {x}_{i} if {x}_{i}\ge 0,\end{array}\right.$$
(1)

where \({a}_{i}\in (1,+\infty )\) is a constant value.

Fig. 5
figure 5

Architecture of YOLOv4-tiny [20]

The final stage of YOLOv4-tiny model is YOLO head. It is used to perform dense prediction. The outcome of dense prediction consisting of a vector containing the center coordinates that include {xcenter, ycenter, w, h} for the targeted objects. The data set of images has been split into training, validation, and testing with 70%, 20%, and 10%, respectively. These split balances training the model on enough data to prevent overfitting. A separate testing set is essential to evaluate the model's performance on unseen data. The results obtained from the testing set are considered the final evaluation of the model.

A graphical image labelling tool is utilized to annotate the bounding boxes for the targeted classes. The anchor box sizes are usually defined based on the aspect ratios and scales of objects present in the data set. Table 2 provides a simplified example of anchor box sizes for each task. In practice, these sizes would be determined through experimentation and fine-tuning to achieve optimal performance for the specific data set and model architecture used in the research. To fine-tune the YOLOv4-tiny model, several hyperparameters were adjusted based on the specifications provided in Table 3. The chosen method for training was "sdgm." The training process was run for a total of 80 epochs, allowing the model to iterate over the entire data set 80 times. This helps the model learn and improve its performance over time. To prevent overfitting during training, L2 regularization was applied with a coefficient of 0.05. L2 regularization is used during to train the proposed models, to prevent overfitting and improve the generalization performance of the model on unseen data. L2 regularization helps address this issue by adding a penalty term to the loss function based on the L2 norm of the model's weights. L2 regularization term encourages the optimizer to minimize the weights of the model. As a result, the optimizer tends to penalize large weight values, reducing their impact on the final predictions. Smaller weights lead to a simpler model that is less to overfit and can generalize better to new data. The batch size used during training was set to 16. Batch size refers to the number of samples processed together in each iteration of the training process. The learning rate in this system was tuned through iterative experimentation, evaluating the model's response to error during training. Different learning rates were applied, and the model's performance was monitored on a validation set. The learning rate of 0.001 was eventually selected, as it demonstrated the best balance between stable convergence and achieving better performance on the specific task. YOLOv4-tiny model was optimized to achieve better performance and accuracy on the specific task or data set it was being trained for. The selection of these hyperparameters played a crucial role in shaping the model's ability to detect objects and produce meaningful predictions during inference. Figure 6 shows the loose training curve for the three YOLOv4-tiny models for each task. All the experiments have been carried out to train the proposed models using Google Colab and NVIDIA Tesla K80 GPU system.

Table 2 Sizes for the annotated bounding boxes for the proposed data sets of three tasks
Table 3 Hyper parameters for tuning YOLOv4-tiny models
Fig. 6
figure 6

Loss functions of the three YOLOv4-tiny during the training phase

3 Experiment results and discussion

A. Evaluation matrices

In this research, we used the following performance confusion metrics criteria [51, 52] to evaluate the proposed framework: accuracy, recall, and precision see Eq. [2], where \(TP\) True Positive, \(TN\) True Negative, \(FP False Positive,\) \(FN\) \(False Negative\) were calculated from confusion matrix criteria. Accuracy can be defined as the number of all correct predictions divided by the total number of the data set. Precision is the percentage of correct positive predictions. It indicates how many selected predictive values are relevant. Finally, recall is the ability of the model to find all the appropriate cases within the given data set:

$$\begin{aligned} {\text{Accuracy }} & = \frac{{\left( {{\text{TP}} + {\text{TN}}} \right)}}{{\left( {TN + {\text{FN}}} \right) + \left( {{\text{FP}} + {\text{TP}}} \right)}} \\ {\text{Precision }} & = \frac{{{\text{TP}}}}{{\left( {{\text{FP}} + {\text{TP}}} \right)}} \\ {\text{Recall }} & = \frac{{{\text{TP}}}}{{\left( {{\text{FN}} + {\text{TP}}} \right)}}. \\ \end{aligned}$$
(2)

B. Results of the proposed system

The full description of the experiment results performed in this study is included in this subsection. The proposed system operates automated for social distancing, face mask detection, and facial temperature measurement. The simulation has been performed on the testing data set for the three tasks. The images have been acquired from different realistic situations, including indoor/outdoor environments. In addition, we designed other DL models, which include YOLO, YOLOv2, YOLOv3-tiny, and Faster R-CNN. This is to assess the proposed YOLOv4-tiny performance using the same training/testing data sets with these object detection architectures. According to the results from the experiments, see Fig. 7, the performance of YOLOv4-tiny overcomes the other DL models for the three tasks (person detection, mask detection, and facial detection for temperature measurement). The first YOLOv4-tiny model for person detection has been assessed on thermal videos, which showed promising results among the social distancing classification algorithm.

Fig. 7
figure 7

Performance of YOLOv4-tiny a social distancing, b facial temperature measurement, c mask detection

The key challenge for social distancing is the accuracy of measuring the actual distance between the detected individuals in the thermal videos. Top-down view approach has improved the perspective view and has been used to process the video images from a 2-D view to a Bird’s eye view. As a result, the centroids of the detected bounding box are transformed from the input image onto a top-down view, and then the social distance classification is performed. In addition, the threshold of violation for social distance is highlighted, which can also be correlated with the assigned bounding box colors among the individuals. Simultaneously, the second YOLOv4-tiny is executed to perform facial detection to measure individual temperature. The average acquired pixels have been mapped from the enclosed bounding boxes on the faces, which are assigned with blue color, and then converted into numbers, see Fig. 8.

Fig. 8
figure 8

a First YOLOv4-tiny model for social distancing with a green bounding box (safe condition), while the second YOLOv4-tiny model facial detection with a blue bounding box for measuring temperature. b Person detection points with top-down view

We examined the third YOLOv4-tiny model to detect if people wear respirator face masks or not. A green color indicates those people wearing face masks, while red is used for those not wearing face masks. In addition, on the top of each detected bounding box, two labels are assigned (Mask or No Mask), see Fig. 9 (false negatives and positives were noted from this experiment). However, the proposed model achieved promising results in detecting real-time interactions among individuals. The proposed work for social distancing achieved better results in comparison with the method [8], which utilized two data sets of thermal images. It used a customized YOLOv2 lightweight architecture for object detection. YOLOv4-tiny represents a significant improvement over YOLOv2 in various aspects. It boasts a more powerful backbone network, utilizing CSPDarknet53, leading to enhanced feature extraction and better object detection performance. The proposed techniques are compared to the other methodologies for measuring social distance and face mask detection to assess performance based on accuracy [53,54,55,56,57,58]. These methods utilized different data sets for social distancing and mask detection in comparison with this work. The proposed approach achieved an accuracy of 96.2% for social distance measurement and 95.1% for face mask detection, and 96% for facial temperature model. Furthermore, YOLOv4-tiny utilizes anchor boxes to detect objects in different scales and aspect ratios. This architecture enables faster and more accurate object detection than MobileNet single shot detector (SSD), which has been utilized in the method [54]. In addition, robustness to occlusion and small objects: YOLOv4-tiny is more robust and can detect small objects better CV and IoT algorithm, which was utilized in the method [53]. This is because YOLOv4-tiny uses a better feature extractor that can capture more detailed features of objects from the images. Nagrath et al. [56] utilize MobileNetv2 for facemask detection. Its convolutional neural network architecture has gained popularity due to its lightweight and efficient design, making it a suitable choice for mobile and embedded devices. However, despite its advantages, there are still some drawbacks and limitations of the MobileNetV2 architecture: the lack of residual connections, which are present in other deep learning models, such as ResNet in YOLOv4-tiny. These connections allow information to flow directly from one layer to another, facilitating the training of deeper networks. Without these connections, the model may suffer from the vanishing gradient problem, making it difficult to train the model. Tables 4 and 5 show the model accuracy compared to the other social distancing and mask detection methods. YOLOv4-tiny made it possible to detect COVID-19 pandemic in terms of respecting social distancing, face mask detection, and measuring the facial temperatures among individuals.

Fig. 9
figure 9

Experiment results for Mask/No Mask detection

Table 4 YOLOv4-tiny vs. other methods for social distancing
Table 5 YOLOv4-tiny vs. other face masks/no mask detection methods

4 Real-time edge implementation

The final designed models have been executed in real-time on resource-constrained Edge NVIDIA platforms. We utilized Jetson Xavier and Jetson nano to execute the proposed architectures. Table 6 presents a comparison of the proposed NVIDIA platforms, Jetson Nano, and Jetson Xavier. The Jetson Nano features a 128-core Maxwell GPU, a Quad-core ARM A57 CPU, and delivers 472 GFLOPs of AI performance. It is equipped with 4 GB of 64-bit LPDDR4 RAM and a MicroSD card slot for storage, offering a maximum resolution of 4 K @ 30 fps. Supported AI frameworks include TensorFlow, PyTorch, and Caffe. In contrast, the Jetson Xavier boasts a 512-core GPU, an 8-core ARMv8.2 CPU, providing 30 TOPs of AI performance. It comes with 16 GB of 256-bit LPDDR4x RAM and 16 GB eMMC flash storage, supporting 2 × 4 K @ 30 fps resolution. In addition, it supports various AI frameworks, such as TensorFlow, PyTorch, Caffe, cuDNN, CUDA, among others. However, the Jetson Xavier consumes more power, ranging from 10 to 30 W, while the Jetson Nano's power consumption lies between 5 and 10W. This research activity integrated the face mask detection approach with social distancing and measuring the face temperature of the individuals. This approach examines multiple DL model execution on a single NVIDIA board. Different cameras were utilized in this work, including Raspberry Pi model 2.1, See2CAM camera as a visible camera for face mask detection, and lepton 3.5, FLIR BOSON cameras for social distancing and measuring the facial temperature for the individuals. Lepton and Raspberry cameras have been connected with Jetson-nano. Boson and See3CAM cameras have been connected with Jetson Xavier AGX. Thermal cameras are radiometric measurements that can extract every pixel in the image. Therefore, the color map in the image has been converted to an array and integral of temperature values, which can be read.

Table 6 Specification for the proposed NVIDIA platforms (Jetson nano & Jetson Xavier)

in numbers. Thanks to OpenCV and its supported libraries. We adjusted the frame height and width sizes for each camera output to 416 × 416. The proposed integrated approach has been executed on both Edge NVIDIA platforms. Based on the experiment results, the two cameras simultaneously produced mask face detection, facial temperature, and social distancing classification on the centralized monitoring system, see Fig. 10.

Fig. 10
figure 10

Experiment results for the proposed integrated approach include c) visible camera for mask detection. B) Thermal camera for social distancing detection with green bounding boxes and facial temperature measurement with blue bounding box. A) Pedestrian detection points with a top-down view

We recorded the real-time detection and power consumption on both Edges NVIDIA platforms to assess the proposed techniques’ performance, which includes social distancing (SD), Mask detection (MD), and facial temperature measurement (FTM) with different algorithm running scenarios, see Tables 7 and 8. It has been observed that when three models run together, real-time detection performance decreases due to increased computation cost. Furthermore, for the variation of temperature, it is observed that the temperature of Jetson nano is higher than the temperature of Jetson Xavier when the proposed approach is running simultaneously for the three tasks, which leads to generating an alarm of over temperature and degrades the performance of Jetson nano, see Fig. 11, b. This temperature difference attributed to the increased workload on Jetson nano as it struggles to handle the simultaneous execution of the three tasks, leading to elevated heat generation and potentially impacting overall performance.

Table 7 Real-time detection for proposed method on Jetson Nano
Table 8 Real-time detection for proposed method on Jetson Xavier AGX
Fig. 11
figure 11

Temperature measurement for a Jetson nano b Jetson Xavier AGX

This research compares the proposed approach to other methodologies, including pre-trained neural network models. The advantage of the integrated techniques is its small disk storage size (22.9 MB) for the YOLOv4-tiny of social distancing task, (22.8 MB) for facial temperature YOLOv4-tiny model, and (23 MB) for mask detection YOLOv4-tiny, which these architectures have few learnable parameters. This makes them executable for low-cost IoT devices. On the other hand, other methodologies utilize pre-trained CNN layers that require large storage sizes to disk, such as the Resnet50 model [42]. In addition, the performance of these pre-trained models is very low for real-time applications on low-cost embedded, which impairs the performance of the targeted deep learning models from videos and images. The proposed DL for the three tasks algorithms utilizes lightweight and efficient deep learning models. These models are specifically designed to run on resource-constrained devices. Techniques such as model quantization, pruning, and knowledge distillation are applied to reduce the model's size and computational complexity while preserving its accuracy to the extent possible. NVIDIA devices (Jetson nano & Jetson Xavier) integrate specialized hardware accelerators such GPUs (Graphics Processing Units). These accelerators are optimized for performing matrix operations and other machine learning tasks, significantly speeding up the computations required for AI processing. To process multiple features simultaneously, NVIDIA devices leverage parallel computing techniques. They split the workload across multiple cores or threads available on the device's processor, allowing the algorithm to handle multiple inputs and outputs concurrently.

5 Conclusion

This research presented social distancing, mask detection algorithm, and facial temperature as an integrated approach executed in real-time on a single NVIDIA board. This assesses the robustness of low-cost embedded systems to run multiple deep-learning models simultaneously. The proposed vision-based system can be utilized in any indoor/environment, such as public areas, train stations, streets, shopping centers, and smart cities, where the performance is suitable to fulfill the purpose. The proposed work ensures safe conditions between the individuals. In addition, the developed deep learning models were validated through multiple experiments and achieved promising results. Jetsons are low power consumption relative to computing power. We performed different experiments on Jetson nano & Jetson Xavier AGX with different algorithms scenarios. The highest real-time performance was obtained on Jetson Xavier AGX, which achieved 18 fps from the thermal camera and 62 fps from the visible camera when the three YOLOv4-tiny, based models executed at the same time. It is noted in this research that the claims of improved real-time detection in Jetson Xavier AGX lead to increased power demands compared to the performance in Jetson nano. This is due to continuously consuming a large amount of energy for the GPU architectures in Jetson Xavier AGX. Further to our exploration, recently released YOLOv7 will be considered for the integrated approach.