1 Introduction

According to the report published by the World Health Organization (WHO) [1], it has been revealed that the coronavirus disease (COVID-19) has infected millions of people globally and caused deaths on a large scale. For a long time, countries stopped doing the daily activities on which their livelihood depended. Countries have started to carry out their daily life activities by applying the health protocols prescribed by WHO. One of the health measures is to prevent people from crowding in indoor spaces. To achieve this, the maximum number of people that the indoor area can accommodate is determined according to its size. When the number of people per square meter is greater than a certain threshold, an action can be taken accordingly.

It is difficult to manually control whether the maximum number of people has been reached in many places, such as banks, train stations, shopping malls, and schools. Therefore, computers can be used to provide this control in indoor spaces monitored by cameras. For this reason, in this study, a region within the indoor area in the video recording was selected and the size of that region was estimated. A threshold value was determined as one person per square meter. Then, the people in this area were counted. If there are more people than the capacity of the area, it is marked in red; otherwise, it is marked in green.

Deep learning is one of the popular topics in the field of machine learning. Deep learning algorithms have been successfully applied in different modalities and fields. Image processing [2], machine translation [3], and natural language processing [4] are some of them. Deep learning is also used in other fields apart from these. For example, in recent years, deep learning algorithms are used for sound classification [5], language recognition [6], or cancer diagnosis [7] as well.

There are two types of methods for person detection: Region of Interest (ROI) and Line of Interest (LOI). In the ROI-based method, the number of people in a limited region is estimated. In the LOI-based method, on the other hand, the number of people crossing a predetermined line is estimated [8]. Deep learning-based models are used in the ROI-based system approach. Deep learning is frequently used in object detection and classification tasks. This situation encouraged researchers to use deep learning on people counting problems, as well. In this study, deep learning methods were used with an ROI-based approach.

Object detection is made to locate objects within the image and classify each object. For this, the detected object is enclosed in a bounding box and the class of this object is tried to be estimated. The YOLO models use deep learning methods for object detection and the first version was developed by Redmon et al. [9].

Detecting and counting people in indoor spaces have been one of the important tasks of video surveillance systems [10]. Since this task has become important, an increase has been observed in studies conducted in recent years. YOLO models are frequently used to detect people, count them, and measure the distance between them. For example, Punn et al. [11] used the Deepsort [12] algorithm together with the YOLO v3 model for pedestrian detection and distance measurement. Compared to other object detection algorithms, they demonstrated the success of YOLO v3 [13] and Deepsort algorithm. Rahim et al. [14] used YOLO v4 [15] for person detection in low-light situations and monitored social distance with motionless time of flight (ToF) cameras. Person detection from the top view was investigated with YOLO [16]. According to this study, with the YOLO v3 model, people were detected and counted from the top view with an accuracy of up to 95%.

The YOLO v5 model [17] was used for crowd detection [18]. It has been observed that the YOLO v5 model gives good results in counting applications where there is no dense crowd. Ren et al. [19] proposed the YOLO-PC algorithm, which is an advanced version of YOLO. They used YOLO, which has a high accuracy rate, to develop a people counting system. High accuracy rates were obtained with this proposed algorithm. Menon et al. [20] used INRIA and ShanghaiTech datasets to count pedestrians using the YOLO v3 model. While the tests performed on the INRIA dataset resulted in an accuracy of 96.1%, relatively less accuracy was obtained with 87.3% in the ShanghaiTech dataset.

In addition to these studies, crowd detection was performed using image and sound features together, since feature extraction is difficult in low-quality images [21]. In this study, crowd counting is considered as a regression problem. Convolutional Neural Network (CNN) is used to extract image features and VGG network architecture is used for audio features. Also, in a different study [22], quantum feature extraction was performed in addition to the classical CNN architecture for crowd feature extraction.

In this study, YOLO models were used (1) to detect and count the people, and (2) to estimate the area of the specified region of the indoor space. An algorithm is proposed to estimate the area of a specified region where people are present and to determine the maximum number of people to settle in this area. For this proposed algorithm, YOLO v3, YOLO v4, YOLO v5s, YOLO v3-Tiny, and YOLO v4-Tiny models were used and performance analyses of these models were made. Accordingly, the YOLO v3, YOLO v5s, and YOLO v4 models detected persons within a specified region with 96.89%, 96.12%, and 94.57% accuracy rates, respectively.

2 Materials and methods

In this study, it is focused on calculating the area of a region with predetermined borders in indoor spaces and determining the maximum number of people who can settle in this region. For this, cameras that observe indoor spaces are used. YOLO models were used to detect and count people inside the specified area. In this section, the YOLO models and dataset used are introduced.

2.1 YOLO models

YOLO (You Only Look Once) are models that perform object detection processes in real time, with high accuracy and quickly. YOLO is a single-stage Convolutional Neural Network (CNN)-based model. In the single-stage model, objects can be detected without the need for a preliminary stage. Examples of single-stage detectors are SSD [23] and YOLO models. YOLO applies a single Convolutional Neural Network model to the image and splits the image into grids. Each grid makes an estimation of the bounding boxes and the associated confidence score. According to the estimated confidence score, the class of the object in the bounding box is determined.

YOLO v3 uses a new network to perform feature extraction. This network is connected to the network used in YOLO v2 Darknet-19 [13]. YOLO v3, whose classifier network is more successful than other versions of YOLO, is based on a pre-trained 53-layer Darknet-53 network with Imagenet [24] dataset [13]. This architecture consists of 53 convolutional layers. It is slower than YOLO v2 as it has more layers. YOLO v3 contains five residual blocks. Each residual block consists of multiple residual units. The residual blocks found in YOLO v3 are not present in YOLO v2. There is a mixed approach between residual network (ResNet) elements.

This approach divides the network into consecutive 1 × 1 and then 3 × 3 convolutional layers to perform feature extraction. 53 more layers are added on top of this for the detection task, resulting in a 106-layer fully convolutional architecture for YOLO v3.

Bochkovskiy et al. [15] proposed the YOLO v4 algorithm, an advanced version of YOLO v3. YOLO v4 has improved over YOLO v3 in terms of both speed and accuracy. It has been observed that the CSPDarknet53 feature extraction model gives better results in YOLO v4 [15]. YOLO v4 architecture is given in Fig. 1. YOLO v4 architecture has three different blocks: backbone, neck, and dense prediction blocks. Figure 2 shows the main structure of YOLO v4 architecture.

Fig. 1
figure 1

YOLO v4 architecture [15]

Fig. 2
figure 2

Main structure of YOLO v4 architecture [28]

The pipeline consists of three parts: backbone, neck, and head. CSPDarkNet53 [25], which is used as the backbone, is used for feature extraction and is one of the factors that increase the accuracy of the system [26]. The Neck is the layer between the Backbone and the Head. An intermediate layer called neck has been added to obtain more information while estimating objects. Head predicts classes and location of objects (e.g., person). It also calculates the size (width and height) and the coordinates of the bounding boxes [27].

There is a CSPDarknet53 network in the backbone. The network takes an image or its frame as input. The backbone is responsible for extracting the features of the image or frame. Backbone divides the current layer into two parts, DenseNet and CSPDenseNet (Fig. 3). One of the outputs from the first layer goes to the Dense block, and the other goes directly to the next transition layer, as shown in Fig. 3b.

Fig. 3
figure 3

Structure comparison between a Standard DenseNet and b CSPDenseNet [29]

The Dense block includes Batch Normalization, Rectified Linear Unit (ReLU), and a convolution layer. Each layer of the Dense block takes as input the feature maps of all previous layers and helps to find complex features of an image. YOLO v4 has CSPDarknet53, while YOLOv3 has Darknet53 as its backbone. YOLO v3 uses Feature Pyramid Network (FPN), while YOLO v4 uses Spatial Attention Module (SAM) and Path Aggregation Network (PAN) instead of FPN. However, in YOLO v4, there is no max pooling and average pooling [15].

YOLO v5 was developed by Jocher et al. [17]. The latest and fastest version of YOLO models is YOLO v5 [17]. YOLO v5 differs from previous versions as it is a PyTorch [30] implementation. As in YOLO v4, Cross-stage Partial Networks (CSP) is used in the backbone and Path Aggregation Network (PANet) is used in the neck [31]. In the head part, the model used in YOLO v4 is used. In YOLO v5, Leaky ReLU (LReLU) is used in hidden layers and sigmoid activation function is used for object detection in the last layer. In YOLO v5, the default optimization algorithm for training is Stochastic Gradient Descent (SGD) [32].

The network structure of YOLO v5 is divided into three parts: backbone, neck, and output. The backbone part extracts features from the input images. The neck part combines the extracted features and creates a feature map, and detects objects from the feature maps in the output part [33]. There are two types of CSP in YOLO v5. One of them is used in the backbone and the other in the neck network. While the CSP network in the backbone consists of one or more residual units, the CSP network in the neck replaces the residual units with CBL modules (Conv2D, Batch Normalization, and Leaky ReLU). CSP connects the front and back layers of the network. Thus, it increases the inference speed by reducing the size of the model [34]. The Spatial Pyramid Pooling (SPP) layer aggregates object-related features and produces fixed-length vectors to other layers. In other words, it performs the process of collecting some information about the object at a deeper stage of the network hierarchy (between convolutional and fully connected layers) without clipping the object [35]. By adjusting the width and depth of the YOLO model, four models with different parameters are obtained, namely YOLO v5s, YOLO v5m, YOLO v5l, and YOLO v5x.

YOLO v3 and YOLO v4 models have been minimized and optimized, and a new version (Tiny) has been developed. The network size of tiny models and the number of convolutional layers in the CSP backbone have been significantly reduced. YOLO v5 is named differently in itself (s, m, l, x). YOLO v3, YOLO v3-Tiny, YOLO v4, YOLO v4-Tiny, and YOLO v5s models were used in this study. Summary information of the models used is given in Table 1 [17]. YOLO v5s has 224 layers and 7.2 million trainable parameters. B in Table 1 indicates how many anchor boxes will be used for each detection, and C indicates the number of classes.

Table 1 Characteristics of YOLO models (v3, v4, v5s, and Tiny) [17]

2.2 Dataset

MS COCO dataset was used in this study. This dataset is a large-scale object detection, segmentation, and image caption dataset [36]. MS COCO has a total of 1.5 million object instances belonging to 80 object categories, including the person category. Since this study is only for person detection, images in the person category in the MS COCO dataset were used.

3 Our proposed approach: area calculation algorithm for a specified region in a video

The aim of this study is to estimate the area of a specified region (Fig. 4) within the space seen in a video. The area of these regions can be calculated in pixels with the formula in Eq. 1. However, due to reasons such as image resolution, camera angle, and video quality, the size in pixels does not represent the actual size of the area. Actual size should be in square meters. For this, pixel-square meter conversion should be done. Estimating the actual size (m2) from an image is a challenging problem. To overcome this, an object of known size is taken as a reference in the image [37]. Here, the reference object used in this study is a person.

Fig. 4
figure 4

Determination of a region as desired

If the area occupied by an average person (Eq. 2) is known, the area of the region where these people are can be estimated.

The method proposed in this study is based on the bounding boxes of persons detected by the YOLO models. YOLO models generate a rectangular shaped bounding box for each of the persons it detects in the image. The height and width values of the bounding boxes are provided by the model in pixels. The areas of these rectangles are calculated according to Eq. 2 and averaged. If this calculated average area value (px2) is proportioned to the area occupied by an average person (m2), a conversion coefficient from pixels to square meters is found. Using this coefficient, the area of any region in the image can be estimated.

Data such as height and shoulder width vary by geography. For example, according to a study [38] in which anthropometric measurements were made in Turkey, the average height was found to be 1708 mm in men and 1598 mm in women (avg. ~ 1.65 m). In the same study, the average shoulder width was found to be 475 mm in men and 366 mm in women (avg. ~ 0.4 m). In this study, the area covered by an average person is taken as 0.66 square meters (Eq. 2).

It is checked whether the persons detected in the video are within the predetermined borders. Using the heights (h) and widths (w) of the persons detected in the region, the area of each person (Eq. 2) is expanded to approximately 1 m2 (Eq. 3). The square meter of the specified region is estimated (Eq. 5) and the number of people required to be in this region is determined (Eq. 5). The area calculation algorithm of the specified region is shown in Fig. 5

Fig. 5
figure 5

Area calculation algorithm for a specified region in a video

$$A=\mathrm{abs}\left\{\left({x}_{1}{y}_{2}-{x}_{2}{y}_{1}\right)+\left({x}_{2}{y}_{3}-{x}_{3}{y}_{2}\right)+\dots \left({x}_{n}{y}_{1}-{x}_{1}{y}_{n}\right) \times \frac{1}{2}\right\}.$$
(1)

The area of the specified region is found as Rx px2 with the help of Eq. 1 in pixels. The \(x\) and \(y\) values given here indicate the vertex coordinates of the specified region (polygon). If a person’s area is represented by P1, height h, and shoulder width w, then the rough area of a person can be found by Eq. 2. As a result of Eq. 2, the area of a person is found as 0.66 m2

$${P}_{1}=h\times w.$$
(2)

Since 1 person per square meter is taken as the threshold value, if the area occupied by a person is expanded to 1 m2 (Eq. 3), the area of the specified region can be found in m2. For this, it is necessary to add 3/5 of a person’s area to their area (Eq. 3)

$$\begin{gathered} P_{2} = P_{1} + P_{1} \times \frac{3}{5} \hfill \\ P_{2} \cong 1 m^{2} . \hfill \\ \end{gathered}$$
(3)

As a result of Eq. 3, the area of a person (P2) becomes approximately 1 m2. The width and height of the persons (bounding boxes) detected in the specified region are known in pixels. Using Eq. 2, the average area (\({B}_{{x}_{1}}\)) of the bounding boxes is calculated. Equation 4, which is the same as the transformation in Eq. 3, should be applied to this value (\({B}_{{x}_{2}}\)).

To calculate the area of the specified region in m2, the area of the region in pixels2 must be divided by the value \({B}_{{x}_{2}}\). The \({R}_{m}\) value obtained in this way gives the area of the specified region in square meters. The \({R}_{m}\) value also indicates the human capacity of the specified region

$${B}_{{x}_{2}}={B}_{{x}_{1}}+{B}_{{x}_{1}}\times \frac{3}{5}$$
(4)
$${R}_{m}={R}_{x}/{B}_{{x}_{2}.}$$
(5)

3.1 Advantages of the proposed algorithm

The area of the specified region is calculated according to the height and width of the persons detected. Since the height and width of the people are adapted according to the camera resolution, changing the camera resolution will not adversely affect the operation of the algorithm. If the area in pixels2 of the specified region remains the same, realistic results are obtained even if the video resolution changes. Because in videos of different resolutions, the height and width of persons detected vary according to the resolution.

3.2 Disadvantages

Height and width of persons are determined by bounding boxes. The height and width of the bounding box may turn out to be different than it should be, for example when the video resolution is bad or persons are not clearly visible. In such a case, the area calculation may be different.

4 Experimental study

In the study, Python programming language and OPENCV Library were used together. Deepsort algorithm [12] was used as the person tracking algorithm. The Deepsort algorithm uses a CNN model for object classification. Thanks to the CNN model, the most distinctive feature of the object to be classified is determined and the classification process is performed. Each object detected by the deepsort algorithm is passed through the neural network and a vector is obtained. Two objects are associated using these vectors. This vector is called the appearance feature vector. Therefore, it takes into account previous and current frame information to estimate the current frame without needing to process the entire video at once. The convolutional neural network continues to be trained until satisfactory success is achieved.

In order for object tracking to be carried out, first object detection must be performed. The object detection process is the process of detecting objects using a bounding box. The location, type, and class of the detected object are determined. These determined features are assigned to YOLO as a class label. The class label used in this study is person. After YOLO learns the person’s characteristics, it divides the image into cells called grids and determines the bounding boxes in case there is a person in the cell. In case of more than one bounding box per person, it uses non-maximum suppression (NMS) to reduce the number of bounding boxes to one per person. Finally, the number of people is found using the number of bounding boxes in the image or video.

In this study, to count people, first of all, a region with certain borders should be selected on the video. These specified regions can be shaped as triangles, squares, or polygons (Fig. 4). In a frame, persons inside the specified region are counted as follows: all objects except “person” in the MS COCO dataset are discarded. In this way, persons were detected on the images using pre-trained YOLO models. However, instead of all the persons in the image, only those within the specified region were taken into account. A simple counter was used here. The counter was incremented by 1 if the center point of the bounding boxes produced by YOLO were within the specified region, and they were not included in the counting if they were not within the region. In this way, the counter was increased by 1 for each person detected in the region in each frame. Then, the area of the specified region is calculated in pixel type. Calculation is made in such a way that the maximum number of persons will fit in the area that we have determined as 1 person per square meter. Examples from the real-time experimental environment are shown in Fig. 7. The image resolutions here are 800 \(\times\) 600 and 1024 \(\times\) 768 pixels.

In this study, other parameters except YOLO models were kept the same for consistency of results and fair comparison. The characteristics of the video used also in [40], such as the number of frames, video duration, and resolution, are given in Table 2.

Table 2 Characteristics of the video used [39] in the testing phase

For example, all experiments were done using frames of the same resolution. Using the video given in Table 2, the results obtained from the algorithms of detecting persons, counting people, calculating the area of the specified region, and determining the maximum number of persons to settle in this area are given in Table 3. The steps followed to find the maximum number of people the specified region can take are given in Fig. 6. YOLO-based algorithms were studied with input image dimensions of 416 \(\times\) 416 pixels. During the application, 0.4 Intersection over Union (IoU, Eq. 8) threshold and 0.4 confidence threshold values were kept constant in all YOLO models.

Table 3 Characteristics of the specified regions given in Fig. 7 and calculation results
Fig. 6
figure 6

The proposed method to estimate the number of people in a specified region

The YOLO v5 model is not supported by OpenCV as it is written with the PyTorch application. YOLO v5s model was converted and run in ONNX format supported by OpenCV. Tests were performed using 1024 × 768 pixels and 800 \(\times\) 600 pixels resolutions, respectively, on the images in Fig. 7a and b. Then, the persons in this region were counted. If more people are detected than the specified region capacity, they are marked in red, otherwise in green (Fig. 8).

Fig. 7
figure 7

Samples taken from the experimental environment: a 1024 × 768 pixels and b 800 × 600 pixels

Fig. 8
figure 8

a Number of persons detected in the region \(<\) The max capacity. b Number of persons detected in the region \(>\) The max capacity. Note that the area at the top (a) is larger than the area at the bottom (b)

4.1 Computing environment

Deep learning-based applications often require high-level graphics processing units (GPUs). This situation causes high hardware costs. Deep learning models such as YOLO and MASK R-CNN can run at 1–3 FPS on the CPU. To overcome this problem, NVIDIA GPUs were used together with the OPENCV library. In this study, a computer with Intel Xeon CPU @ 2.20.

GHz 6 MB Cache Memory, NVIDIA Tesla k80 GPU, and 16 GB Memory features were used. CUDA and cuDNN installations have been made, so that the models can run on the GPU.

4.2 Performance metrics

Different metrics are used in this study to measure the performance of the proposed method to count the people in a specified region within the video and to estimate their accuracy.

4.2.1 Accuracy rate (Acc)

It shows how close the estimation values of the developed method are to the real values, and it is good if it is high. Accuracy rate formula is given in Eq. 6 (TP: True Positive, TN: True Negative, FP: False Positive, FN: False Negative)

$$\mathrm{Acc}=\frac{\mathrm{TP}+\mathrm{TN}}{\mathrm{TP}+\mathrm{TN}+\mathrm{FP}+\mathrm{FN}}.$$
(6)

4.2.2 Mean average precision

A common evaluation metric called Mean Average Precision (mAP) provides a single number as the mean of the Average Precision (AP) values for all classes. This makes it possible to assess a model’s performance using just a single number. As a result, mAP is the evaluation metric that object detection algorithms utilize the most frequently. It is calculated as

$$\mathrm{mAP}=\frac{{\sum }_{q=1}^{Q}\mathrm{Ave}P(q)}{Q},$$
(7)

where Q represents the total number of queries in the set and q represents the average precision query.

Since there is only a “person” class in our study, the number of classes will be 1. mAP (0.5) means that the confidence threshold (IoU) is greater than 0.5 when calculating AP. If IoU ≥ 0.5, the prediction is considered correct

$$\mathrm{IoU}=\frac{\mathrm{Area\; of\; Overlap}}{\mathrm{Area\; of\; Union}}.$$
(8)

4.2.3 FPS

Inference speed (millisecond, ms) or frames per second (FPS) are the most popular metrics for real-time applications. Utilizing a trained deep learning algorithm to generate predictions is known as inference. FPS is the term used to describe how many images can be processed in a second. Real-time applications need low inference speeds or high FPS values.

5 Results and discussion

The areas of the specified regions in Fig. 7a and b are 72,600 and 72,500 pixels2, respectively. Regions of the same size were determined in the same video frame and different resolutions were used. Thus, it has been determined to what extent the algorithm gives consistent results. The values calculated by YOLO v5 for the areas of the specified regions found in Fig. 7a and b are shown in Table 3.

Based on the data in Table 3, the area of the region whose boundaries are determined in Fig. 7a has been estimated as 18.8 m2. Since the calculation is made in such a way that 1 person per square meter a maximum of 19 persons can fit inside this area. In the video frame from which the image was taken, the total number of persons in this region was found to be 8. In Fig. 7b, the area of the region whose borders are maximum of 28 persons can fit in this area.

The width and height of persons will appear higher in a 1024 × 768 pixels image than in an 800 × 600 pixels image. Therefore, if the same size region is selected in both figures, the capacity of the region determined in the high-resolution image will be lower.

In Table 4, activation and loss functions used by YOLO models, estimated accuracy, video processing time, and FPS values are given. The YOLO v3 read videos at 9.91 FPS, while the YOLO v5s read at 20.06 FPS. Although the FPS value of YOLO v3 was the lowest, the estimated accuracy rate reached the highest value with 88.39%. The YOLO v3 algorithm processed the 14.3 s video in 36.09 s on the NVIDIA CUDA-capable GPU. It is clear that the YOLO v3 algorithm has a speed issue. The YOLO v5s model can be used to eliminate this speed issue. YOLO v5s read 14.3 s video in 17.84 s, reaching 20.06 FPS.

Table 4 Performance comparisons of YOLO models for detecting persons in specified regions

The fact that the video processing time and the FPS values of the YOLO v5s algorithm are close to each other shows the compatibility of this algorithm in terms of time and speed. YOLO v5s correctly predicted persons in the specified area of the video with an accuracy rate of 78.14%. Despite the reduction in the pre-trained weight size of the YOLO v5s, this result is satisfactory. YOLO v5s is more successful than other models except YOLO v3 in terms of accuracy.

It was analyzed whether persons in the specified regions were correctly detected in each frame of the videos, and the results are given in Table 5. The Total Number of Persons in the Specified Region is the total number of persons in the region whose borders are determined in all frames of the video. The Total Number of Persons Detected in the Specified Region represents the number of persons detected by the YOLO algorithm used. The inability of the YOLO v4 and YOLO v5s algorithms to detect some persons is due to the IoU and confidence threshold being taken as 0.4. Detection takes place when IoU and confidence threshold values are lowered, but it is not recommended that these values be lower than 0.3.

Table 5 Metrics for detecting persons in specified region (mAP 0.5)

In Tiny models, especially in the YOLO v3-Tiny model, if the IoU and confidence threshold are set to 0.3 and below, object recognition still does not occur in some frames. This is due to the significant reduction of the number of layers of the YOLO v3-Tiny model. YOLO v3-Tiny is the model with the smallest number of layers (Table 1). In the YOLO v3 algorithm, persons who should be in the specified area were detected with an accuracy rate of 96.89%, while this rate was 96.12% in the YOLO v5s algorithm. Again, according to mAP (0.5) values, YOLO v3 gave the highest result with 98.8%. The fact that YOLO v3 produces higher values than YOLO v5s can be explained by the fact that the YOLO v3 model reaches more frames during video processing. Besides, the YOLO v3-Tiny model gave the lowest results in terms of accuracy and mAP (0.5) values, with 58.13 and 75.9 values, respectively.

Since people will be more prominent in a video with a high resolution or people close to the camera, taking the IoU and confidence threshold values as 0.75 does not pose any problem. However, when the IoU and the confidence threshold are 0.75 in low-resolution images, many people in the video images are not recognized. Table 6 shows the results of people in the specified region according to the mAP (0.75) metric. According to the mAP (0.75), YOLO v3 and v5s outperformed other models. The scores were quite low in the tiny models. According to the mAP (0.50) and mAP (0.75), the difference in the number of people recognized by the models is 26 in YOLO v3, 34 in YOLO v4, and 28 in YOLO v5s. In Tiny models, and the difference is 40 for v3-Tiny and 46 for v4-Tiny between mAP (0.50) and mAP (0.75).

Table 6 Metrics for detecting persons in specified region (mAP 0.75)

6 Conclusion

In the COVID-19 pandemic, some rules have emerged that have changed our lives. One of these rules is that there must be a certain number of persons in indoor spaces. According to this rule, the capacity of indoor areas is limited and this capacity should not be exceeded. Then, the size of the indoor area should be calculated and the persons in this area should be counted. In this study, using the Python programming language, OpenCV library, and YOLO models, the area of a region determined by us in the video was estimated in square meters. After determining the area of the specified region, the capacity of the region was calculated approximately. The proposed algorithm to calculate the area of the specified region is based on the bounding boxes that the YOLO models produce when detecting persons. Since there were only persons in the study, pixel-squaremeter conversion was made based on the real-life sizes of persons. The properties of bounding boxes produced by YOLO models when detecting persons vary from model to model. These properties are the sizes of bounding boxes and the prediction values they produce for detecting persons. Therefore, in this study, five models of YOLO were used and their performances were compared.

According to the results of the study, the YOLO v3, YOLO v5s, and YOLO v4 models detected the persons within the specified region with an accuracy of 96.89%, 96.12%, and 94.57%, respectively, in Table 5. After testing a certain number of videos, the YOLO v3 and YOLO v5s models gave much better results than the others. Considering the results, there is a direct correlation between the number of persons in the specified area and object detection. In test videos, YOLO v5s run faster than YOLO (v3, v4) models in terms of frames per second (FPS). In real-time video processing, analysis processes may be slower as the camera properties, computer performances, and Internet speed come into play. Analyses may differ from image to image. In images with high resolution, the results are closer to reality. Preventing scanning of all objects in the MS COCO dataset increases the scanning speed. In this study, only the person object is emphasized. In this way, models can be made more useful as needed.

YOLO v3-Tiny and YOLO v4-Tiny models have lower object recognition rates than YOLO v3, YOLO v4, and YOLO v5s, as the number of layers and weights is significantly reduced. However, it is sufficient in terms of FPS and speed and does not strain the computer system. Therefore, the YOLO v4-Tiny model can be used on low-performance devices. Real-time object detection performance of YOLO v3 and YOLO v4 models decreases on low-equipped computers. Therefore, they are not recommended for use in low-equipped systems. In the YOLO v5s model, satisfactory results were obtained in terms of accuracy, video read time, and FPS values, despite the low weight size and high number of layers. Therefore, the YOLO v5s model can be used on low- and high-performance devices.

In future studies, it is planned to measure the distance between people, inspired by the approach developed in this study. In this way, it will be investigated whether the social distance between people is among reasonable values. In addition, the application and comparison of one-stage (YOLO, SSD, RetinaNet, etc.) and two-stage (R-CNN, Fast R-CNN, Faster R-CNN, etc.) object detection architectures to the problems mentioned here will be discussed.