Statistical Analysis of Design Aspects of Various YOLO-Based Deep Learning Models for Object Detection

Sirisha, U.; Praveen, S. Phani; Srinivasu, Parvathaneni Naga; Barsocchi, Paolo; Bhoi, Akash Kumar

doi:10.1007/s44196-023-00302-w

Statistical Analysis of Design Aspects of Various YOLO-Based Deep Learning Models for Object Detection

Review Article
Open access
Published: 02 August 2023

Volume 16, article number 126, (2023)
Cite this article

Download PDF

You have full access to this open access article

International Journal of Computational Intelligence Systems Aims and scope Submit manuscript

Statistical Analysis of Design Aspects of Various YOLO-Based Deep Learning Models for Object Detection

Download PDF

U. Sirisha¹,
S. Phani Praveen²,
Parvathaneni Naga Srinivasu²,
Paolo Barsocchi ORCID: orcid.org/0000-0002-6862-7593⁵ &
…
Akash Kumar Bhoi^3,4,5

10k Accesses
29 Citations
Explore all metrics

Abstract

Object detection is a critical and complex problem in computer vision, and deep neural networks have significantly enhanced their performance in the last decade. There are two primary types of object detectors: two stage and one stage. Two-stage detectors use a complex architecture to select regions for detection, while one-stage detectors can detect all potential regions in a single shot. When evaluating the effectiveness of an object detector, both detection accuracy and inference speed are essential considerations. Two-stage detectors usually outperform one-stage detectors in terms of detection accuracy. However, YOLO and its predecessor architectures have substantially improved detection accuracy. In some scenarios, the speed at which YOLO detectors produce inferences is more critical than detection accuracy. This study explores the performance metrics, regression formulations, and single-stage object detectors for YOLO detectors. Additionally, it briefly discusses various YOLO variations, including their design, performance, and use cases.

Object detection using YOLO: challenges, architectural successors, datasets and applications

Article 08 August 2022

Object Detection: State of the Art and Beyond

YOLO-based Object Detection Models: A Review and its Applications

Article 14 March 2024

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Computer vision is a highly researched field, with efforts directed toward enabling machines to comprehend and interpret complex visual content. Object detection is a significant challenge in this domain, which involves identifying and locating objects of interest in images or videos. Deep learning, a subfield of machine learning and AI, gained prominence in the early 2000s after artificial neural networks, multilayer perceptrons, and support vector machines became popular. However, initially, deep learning faced scalability issues and high computing power requirements, which limited its adoption. Still, the availability of large datasets and powerful computers since 2006 has significantly contributed to the widespread popularity of deep learning.

Object detection is a method in computer vision that involves identifying and localizing objects within images or videos. The main objective is to precisely detect objects' existence, location, and dimensions in an image or video and label them with an appropriate class label. This technique has various applications, including but not limited to prediction of stock values[1], recognition of speech[2], object detection [3], recognition of characters [4], intrusion detection [5], detection of landslides [6], time series problems [7], classification of text [8], gene-expression [9], micro-blogs [10], data-handling [11], irregular data with fault-classification [12], captioning of text from images [13, 14], aspect-based sentiment analysis [15], and generation of captions from videos [16]. Object detection models utilize a range of algorithms and deep learning architectures to detect and classify objects in real-world scenarios.

Object detection can be done on different forms of data, i.e., images, video, and audio data. The ability of computer and software systems is to find and identify individual items within an image or scene. Object detection in the video is very similar to how it operates in images. Such a tool would enable the computer to find, recognize, and categorize things visible in the provided moving images. Object detectors can identify objects based on various sounds also.

Object detection using machine learning models refers to a set of algorithms that can automatically identify and locate objects in images or videos. These models employ feature extraction, feature selection, and classification techniques to recognize objects in visual data. To train these models, labeled images are provided where each object of interest is labeled with its corresponding class. The model then utilizes these labeled images to learn features specific to each class of objects. Several machine learning models are available for object detection, including support vector machines (SVM), decision tree, and random forests [17, 18]. These models differ in their feature extraction and classification approach and may perform differently based on the task and data at hand. Some of these models require manual feature engineering, while others can automatically learn features from the input data.

Deep learning models refer to a class of neural networks that can automatically identify and locate objects in images or videos. These models utilize multiple layers of processing units to extract complex features from the input data, which makes them efficient for object detection tasks. Some examples of models include CNNs, R-CNNs, SSDs, and you only look once models, which can recognize objects accurately and detect multiple objects in a single image or video. Training a deep learning model for object detection involves providing a large dataset of labeled images or videos to the model, with each object labeled by class and bounding-box coordinates. The model learns to identify and locate objects by minimizing a loss function that measures the difference between predicted and ground-truth labels and bounding boxes. These models are used in applications, such as autonomous driving, surveillance, and robotics.

Object detection, classification, localization, and segmentation (Fig. 1) are three crucial tasks that models aim to accomplish. Classification refers to identifying the object's category in an image or video by assigning a class label to the entire image or a specific region of interest. For example, a model can identify a car, a pedestrian, or a traffic sign in an image. Localization refers to identifying the object's location in an image or video by drawing a bounding box around it, which provides the coordinates of the object's position within the image, enabling the model to locate the object accurately. Segmentation refers to identifying the pixels that belong to an object in an image or video, enabling the model to create a pixel-level mask that outlines the object’s shape. Each segment usually shares the color texture and intensity of pixels. This technique is more precise than bounding-box localization and can be useful in scenarios where precise object boundaries are necessary, such as medical imaging or satellite imagery analysis. Object detection models typically aim to perform all three tasks simultaneously to comprehensively understand the objects in an image or video.

The key implementation steps for object detection are:

1.
**Data collection and annotation**: Collect a large dataset of images or videos with labeled objects. The labels should include the class of each object and its corresponding bounding-box coordinates.
2.
**Pre-processing of data**: Prepare the data for training by performing tasks on the data.
3.
**Selecting a model**: Choose a suitable object detection model based on the specific requirements of the task, such as accuracy, speed, and computational resources.
4.
**Training the model**: Train the selected model on the labeled dataset using a suitable training algorithm and optimization technique. This involves adjusting the weights and biases of the model to minimize the loss function.
5.
**Validation and testing**: Validate the model on a separate dataset to check its performance and fine-tune the hyperparameters if necessary. Test the final model on a dataset to evaluate its generalization ability.
6.
**Deployment**: Deploy the trained model on a production environment or integrate it into a larger system for real-world use. This involves optimizing the model for inference speed and memory usage and ensuring its compatibility with the target hardware and software platform.

Training techniques for object detection involve the methods and strategies used to train models to accurately detect and locate objects in images or videos. Here are some common training techniques for object detection:

1.
Supervised learning: This is the most common training technique for object detection. Each object in the dataset is annotated with its corresponding class label and bounding-box coordinates. The model learns to detect and localize objects in the input data by optimizing a loss function.
2.
Transfer learning: This technique involves using a pre-trained model. This can save time and computing resources, as the model has already learned general features useful for object detection.
3.
Augmentation of data: It can help improve the model's generalization ability to new and unseen data.
4.
One-shot learning: This technique involves training the model to detect objects with only one or a few examples of each class. It can be useful in scenarios where obtaining a large labeled dataset is difficult.
5.
Active learning: Involves selecting the most informative and uncertain samples from a pool of unlabeled data and presenting them to a human annotator for labeling. The labeled data is then used to train the object detection model, and the process is repeated iteratively to improve the model’s performance.
6.
Reinforcement learning: This involves training the object detection model using a reward-based system, where the model learns to maximize a reward signal by detecting objects accurately. Reinforcement learning can be useful in scenarios where the object detection task involves complex and dynamic environments, such as robotics or autonomous driving.

A crucial component of computer vision is object detection. Using video surveillance, healthcare, and in-vehicle sensing in the business world is possible. Object detection, a crucial yet challenging problem in computer vision, has advanced considerably over the past decade. Nevertheless, this field has made much progress; each year, the research community sets a new standard for excellence. Deep neural networks and the massive processing capacity of NVIDIA graphics processing units made this possible. There have been two distinct periods in the development of object detection:

Up until 20th, conventional computer vision methods were in use.
When AlexNet triumphed in the ImageNet-Visual-Recognition-challenge-in-2012, a new era for convolutional neural networks was initiated.

In Fig. 2, the development of object detection algorithms is depicted. Early object detection methods, such as Viola-Jones, Histogram of Oriented Gradients, and Deformable Parts Model, relied on manual feature extraction from the image, such as edges, corners, and gradients, and traditional machine learning algorithms.

After that, cutting edge image classification architectures were adopted as feature extractors in object detection. Both issues are connected and depend on discovering reliable high-level characteristics. Therefore, rich feature hierarchies for accurate object detection and semantic segmentation introduced R-CNN and demonstrated how we might employ the convolutional features for object detection. Recent years have seen tremendous advancement in object detection. Deep learning detection techniques can be divided into two stages.

Two-stage object detection: Object region proposal is the first step in a two-stage process, including object classification from region proposal and bounding-box regression. Although slower than other detectors, this detector has the highest accuracy. These object detectors include the (RCNN), (Faster-RCNN), and (Mask-RCNN) algorithms.
One-stage object detection eliminates the object region suggestion step and predicts the bounding box from images. These detectors are much faster than two-stage. However, they have trouble picking up minute items. Single-stage detectors are suitable for practical applications due to their quick inference speed. Single-stage detectors, such as SSD, YOLO, EfficientDet, etc. belong to the second category of detectors.

This section provides an overview of computer vision and deep learning, object detection, and related terminologies, key implementation steps, a timeline of how object detection algorithms have developed, and the review’s main contributions. Our analysis will focus on an in-depth examination of the details of the designs of YOLO and their architectural successors. The optimizations brought to each successor and the fierce competition between various two-stage object detectors.

This is the outline of the research. In Sect. 2, the study looks at a few survey papers on YOLO architectures. In Sect. 3, we will review the different YOLO versions, YOLO's design concepts, and the many pre-trained models used in them. In Sect. 4, we will review the datasets, and evaluation metrics of YOLO. Section 5 compares the analysis of YOLO versions regarding performance, architectures, and input size, providing some statistics on their relative effectiveness. Section 6 provides a detailed analysis of challenges and future research directions. Finally, we wrap up the paper with the conclusion.

2 Related Work

2.1 Prior Analysis in YOLO Algorithms

Only survey studies have been published, but they all provide a solid overview of the history of YOLO algorithms. The authors in [19] presented a review of two-stage and one-stage techniques, an architectural overview of YOLO versions, and a comparison analysis among them. In this paper [20], the author focused on an overview of the YOLO versions through public data.

2.2 Novelty and Contributions

Most evaluations and reviews cover the one-stage and two-stage object detection techniques. As far as we know, this assessment addresses single-stage techniques using certain YOLO algorithms. Here, we thoroughly analyze YOLO algorithms based on fundamental architectures, benefits and drawbacks, comparative & incremental approaches in this field, well-known datasets, outcomes, and potential future applications. The contributions include the following:

(a)
Highlight each stage's difficulties and significance in the object detection process.
(b)
Single-stage object detector's necessity and a thorough analysis of YOLO’s incremental architectural features, suggested optimization methods, and YOLOs-based applications.
(c)
Illustration of comparisons made between several versions of YOLO in terms of performance and outcomes, as well as discussion of potential directions for future study in single -stage object detectors.

3 Evolution of YOLO Algorithms

The basic principles, designs, and incremental methods are presented in this section over various YOLO algorithms and represented in Fig. 3.

The basic terms related to YOLO architecture are briefed below.

CNN: Object detection is a crucial task in computer vision, and CNNs have played a significant role in advancing this field. CNNs can extract relevant features from images and use them to classify and locate objects within the image, making them well suited for this task. The Region-based CNN (R-CNN) family of algorithms is a popular approach for object detection using CNNs. These algorithms generate a set of region proposals, use a CNN to feature extraction from each proposal, and use these features to classify and refine the object's location within each proposal. Advancements in object detection using CNN’s include faster and more accurate algorithms, which have greatly improved the speed and accuracy of object detection.

Convolutional layer: DenseNet-169 is a layered architecture that is used for classification, which incorporates convolutional layers. When an input is fed into a convolutional layer, a filter is applied to activate it. This process generates a feature map that shows the relative importance of different features within the data. The activation function, ReLU, is then applied to the feature map. A dot-product operation is calculated to compute the convolutional layer output. In the DenseNet-169 architecture, a convolutional layer with dimensions d × d is applied after a square neuron layer of size $S \times S$, resulting in an output of size $\left( {S - d + 1} \right)\left( {S - d + 1} \right)$. Equation (1) provides a way to compute the non-linear feed to the components $ij$, by incorporating input from all the cells in the first layer

$$s_{ij}^{l} = \mathop \sum \limits_{x = 0}^{d - 1} \mathop \sum \limits_{y = 0}^{d - 1} \mu_{xy} {\mathcal{L}}_{{\left( {i + x} \right)\left( {j + y} \right)}}^{l - 1} .$$

(1)

The non-linearity of the model is assessed through Eq. (2)

$${\mathcal{L}}_{ij}^{l} = \lambda (s_{ij}^{l} ).$$

(2)

Max-pooling-layer: Max pooling is a technique that involves subsampling a tensor’s entire dimension while preserving its depth. Overlapping max pooling refers to contiguous windows where the maximum value extends beyond the window boundaries. To improve convergence and generalization while avoiding scaling issues, it is recommended to include a maxpool layer. This layer can be connected to every convolution layer or a subset of them. Equation (3) illustrates how the max pooling is performed over a max-pooling layer $M_{p}$ using a filter size of k with dimensions $k_{x} ,k_{y} ,k_{z}$

$$M_{p} = \frac{{\left( {k_{x} - k + 1} \right)}}{s} \times \frac{{\left( {k_{y} - k + 1} \right)}}{s} \times k_{z} .$$

(3)

Global-average-pooling: The global average pooling layer, which does not have any trainable parameters, can replace the flattening layer typically placed after the last pooling layer in a convolutional neural network. This technique significantly reduces the input and prepares the network for the subsequent classification layer. In fully connected layers, overfitting is a concern that can be addressed using dropout, and the global average pooling layer can help with this. Global-average-pooling layers can perform an even more extensive form of dimensionality reduction by reducing a tensor with original dimensions of $l \times b \times h$ to dimensions of $1 \times 1 \times d$. For each $h_{b}$ feature map, the global average pooling layer normalizes it to a single value by taking the mean of all $l_{b}$ values.

Fully connected layer (FCL): A fully connected neural network layer establishes a linear connection between input and output neurons. The information learned from lower levels can then be used to classify data at the FCL. An advantage of FCL is that they can handle input data with no structural assumptions. To interpret the activation at a given layer with dimensions $l_{1} \times l_{2} \times l_{3}$, a multilayer perceptron function (MPF) is constructed from a class probability distribution. The final layer of the MPF-based multilayer perceptron will have 1 × 1 × d output neurons, where m is the total number of layers in the network. Equation (4) is used to compute the MPF

$$p_{i}^{l^{\prime}} = f\left( {\mathop \sum \limits_{j = 1}^{{x_{i}^{l} }} w_{i,j}^{l} \times p_{i}^{l} } \right).$$

(4)

The purpose of the fully connected structure would be to provide a probability interpretation of each category by altering the weight parameters $w_{i,j}^{l} { }$ based on the feature map produced by the linear combination of the convolutional, non-linearity, rectification, and pooling layers.

Softmax layer: When the input is negative, the result is extremely low, but when the input is large, the result is high. The softmax function takes a vector of numbers as input, where each element can be either a positive or negative number, or zero. The softmax assessment yields a probability distribution with the normalization factor included in the denominator thanks to the normalization factor.

3.1 YOLO (V1)

On June 8th, 2015, YOLO (V1) [21] was introduced. It employs a convolutional neural network that involves two main processes: fully connected layers to predict output probabilities and coordinates, and early convolutional layers to extract image features. The model’s architecture is inspired by the googlenet framework, and was trained and evaluated on the pascal dataset 2007 and 2012 using the Darknet framework. In YOLO (V1), googlenet inception modules are replaced with (1 × 1) convolutional filters followed by (3 × 3) filters, except for the first layer which uses a (7 × 7) filter. Figure 4 illustrates that YOLO (V1) has 24 convolution layers and two fully connected layers. Only four of the convolutional or max-pooling layers have additional layers following them. This version of the method highlights the use of (1 × 1) convolution and global average pooling.

The authors spent around a week training and tuning the model using the ImageNet 2012 dataset, using the top 20 layers, an average pooling layer, and a fully connected layer. In addition, four more convolutional layers and two fully connected layers with random initializations are added to the model to further fine-tune it for object detection. Large localization errors and limited recall are two key issues with this implementation of YOLO compared to two-stage object detectors.

A Fast-yolo variant of YOLO (V1) with a simpler model is suggested for quicker object recognition. There are nine convolutional layers with weaker filters in them. YOLO-lite [22] is a different version of YOLO designed specifically for nonGPU machines for real-time object recognition. The authors show that shallower networks may detect objects without explicitly requiring accelerators. Additionally, they show that the existence of batch normalization hinders shallow neural network’s object detection ability. Table 1 summarizes the features of YOLO.

Table 1 Features of YOLO (V1)

Statistical Analysis of Design Aspects of Various YOLO-Based Deep Learning Models for Object Detection

Abstract

Similar content being viewed by others

Object detection using YOLO: challenges, architectural successors, datasets and applications

Object Detection: State of the Art and Beyond

YOLO-based Object Detection Models: A Review and its Applications

1 Introduction

2 Related Work

2.1 Prior Analysis in YOLO Algorithms

2.2 Novelty and Contributions

3 Evolution of YOLO Algorithms

3.1 YOLO (V1)

3.2 YOLO (V2)

3.3 YOLO (V3)

3.4 YOLO (V4)

3.5 Scaled YOLO V4

3.6 PP-YOLO

3.7 YOLO (V5)

3.8 YOLO-X

3.9 YOLO-R

3.10 PP-YOLOV2

3.11 YOLO (V6)

3.12 YOLO (V7)

3.13 YOLO (V8)

4 Training Parameters, Datasets, and Evaluation Metrics

4.1 Training Parameters

4.1.1 Multi-scale Training in YOLO

4.1.2 Attention Mechanisms in YOLO

4.1.3 Non-maximum Suppression

4.1.4 Activation Functions

4.2 Datasets

4.3 Evaluation Metrics

5 Comparison Analysis of YOLO in Different Aspects

6 Challenges and Future Directions

7 Conclusion

Data Availability

Abbreviations

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Conflict of Interest

Ethical Approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation