1 Introduction, Positioning

The reliability of the transmission network is of paramount importance, especially due to electrification and the increase in global energy demand. The failure of insulators in this system can result in serious economic consequences. Therefore, special attention should be paid to them. Insulators of different types and made of different materials have different failure mechanisms. This paper focuses on the detection of strain-type porcelain long rod insulators that are typically used in high-voltage transmission lines as overhead insulation. The tension they accommodate is influenced by temperature and wind load and can be influenced by ice load. Long rod insulators are considered to be unpuncturable, but they can crack or brake under thermal stress (Macey et al., 2004). To help the detection of these faults of the power systems in time, unmanned aerial vehicles (UAV) and computer vision (CV) algorithms can be applied.

Generally in fault detection, there are methods that are based on signal diagnosis such as fuzzy methods and neural networks. Janarthanan et al. (2021) propose a technique using fuzzy logic systems and artificial neural networks for photovoltaic (PV) failure analysis. The study finds that both methods have similar detection performance and can detect specific fault categories such as damaged PV modules and partial PV unit shades. Gao et al. (2018) propose a distributed filtering scheme using interval Takagi-Sugeno fuzzy models to deal with fault detection in nonlinear stochastic systems with wireless sensor networks. Their approach involves utilizing a novel type of fuzzy distributed fault detection filter for each sensor node, adopting a fault reference model, and formulating a new overall fault detection system in a fuzzy model framework, which is validated through simulation. Zhou and Tang (2021) develop a fuzzy classification approach to address the challenges associated with the limited availability of training data for diagnosing and prognosing gear systems. Their approach involves using a fuzzy neural network that can classify an unseen fault scenario into the nearest fault class with probability, enabling effective diagnosis under limited data, and is validated through systematic case studies using experimental data from a laboratory-scale gear system.

However, insulator fault detection is rather done based on drone videos and CV, since the drone-based approach does not require any physical contact with the insulator or the transmission lines, making it a non-intrusive method for fault detection. This is particularly advantageous in high-voltage transmission lines where physical contact can be dangerous and costly. Furthermore, CV algorithms have shown to be highly accurate in detecting insulator faults and can be trained to identify different types of faults as well.

Object detection is one of the most fundamental challenges of CV. There are basically three approaches that can be used to cope with this task. The oldest methods are the traditional solutions like Viola-Jones (2001) or deformable part-based model algorithm (Felzenszwalb et al., 2008), which are nowhere near as robust as their deep learning successors. Methods using deep neural networks are basically classified into two groups (Zou, 2019; Zou et al., 2019). These two are the one-stage and two-stage or in other words the You Only Look Once (YOLO) / single shot detector (SSD) type and region-based groups. It is shown in several works that YOLO achieves good trade-off between speed and accuracy as opposed to region-based methods which are not fast enough. Lee and Kim (2020) draw this conclusion in connection with airplane detection. Kim et al. (2020) makes a similar finding in the case of vehicle detection. Furthermore, Li et al. (2020) and Sumit et al. (2020) outline this when detecting agricultural greenhouses and human figures. Narayanswamy et al. (2022) demonstrated that YOLO had a waste detection accuracy of 88%, which was only 3% lower than faster RCNN, while having faster processing times. And Kurdthongmee et al. (2022) achieved real-time pupil detection without requiring a GPU accelerator by utilizing the tiny architecture of YOLOv3, resulting in a detection time that is 2.8 times faster than the state-of-the-art approach. Thus, it can be assumed that a one-stage architecture may be a better choice for real-time recordings made by UAVs. Besides the few examples highlighted, there are many other applications based on versions of the YOLO model (Al-qaness et al., 2021; Chen et al., 2021a; Cheng et al., 2021b; Dewi et al., 2022b; Hu et al., 2021; Junos et al., 2022; Karaci, 2022; Kumar et al., 2021; Wageeh et al., 2021; Xianbao et al., 2021; Zhao et al., 2022; Zhou et al., 2021), but in some cases, even more optimized architectures can be developed for special applications, such as the one developed by Akhtar (2022) for soldering fault detection in printed circuit boards.

In the field of insulator and defective insulator detection, the use of YOLO methods is not exclusive, there are other algorithms as well. Among the works from 10 to 12 years ago, there are solutions made with traditional image processing algorithms. Examples include the use of the Hough transform (Zhang et al., 2010) or the support vector machine (SVM) algorithm (Li et al., 2012). Today, deep learning algorithms are used for these tasks due to their robustness (Van et al., 2018). There are also one-stage and two-stage solutions. Among the region-based methods, Zhang et al. (2022) uses faster R-CNN to locate insulators and extract them from the image. After that a segmentation is also applied in this work, which outlines that YOLO is a faster but less accurate method. Hu et al. (2022) reaches an 1.63% increase in \(\hbox {mAP}_{50}\) with replacing the VGG16 backbone of the network with Resnet50 and adding a channel attention module to the architecture. Zhao et al. (2019) improves the anchor generation method of the R-CNN network and to perform an 81.8% average precision which would be 27.81% less without the improvement. In addition to those listed here, other region-based solutions can be found (Ma et al., 2017; Tomaszewski et al., 2020; Wang et al., 2020b). The weak point of these approaches is their inability to run in real time. Given the limited battery capacity of UAVs and the length of lines in transmission networks, it is definitely desirable for UAVs not to spend a lot of time on an insulator. But it is undeniable that these are currently the most accurate algorithms for object detection.

Fig. 1
figure 1

Flowchart of the detection scheme

One-stage insulator detection can be separated into YOLO and SSD methods. Typically, SSD methods provide the fastest solution, but can be very inaccurate when detecting small objects. The issue of small objects is particularly important, as in this paper an UAV video recording-based application is presented. It should also be emphasized that due to strong electromagnetic fields, UAVs cannot go very close to the insulators easily. Thus, one of the YOLO architectures may be more suitable for this application. Especially since significant progress has been made since YOLOv3 to detect small objects directly (Liu et al., 2020). Over the past year or two, several researchers have addressed this topic. Chen et al. (2022a) proposes a YOLOv4-based method especially for insulator detection. This model called INSU-YOLO has a bit more parameters than the original YOLOv4 and performs better by 4.59%, although it has a bit lower frame per second (FPS) value. Liu et al. (2021c) makes an improvement on the YOLOv3 architecture and achieves good performance for insulator fault detection in aerial images with diverse backgrounds. The same research group (Liu et al., 2021b) in another work increases the feature reuse in the low-resolution feature layers of YOLOv3 and YOLOv4 architectures and also improves the loss function. Liu et al. (2021a) does similar feature reuse with YOLOv2- and YOLOv3-based architectures and achieves 94.47% precision with YOLOv3-dense. YOLOv5 is used in many new works as the newest version of YOLO generations. Feng et al. (2021) tries four different versions of YOLOv5 and achieves 95.5% \(\hbox {mAP}_{50}\) with it. Gao et al. (2021) improves the feature extraction by establishing a certain spatial relationship between the residual transformation and the rotation operation and reaches 93.4% \(\hbox {mAP}_{50}\). As well as others, Feng and Jiang (2021) use different versions of YOLOv5, but in case of recognizing station post insulators.

To summarize the demonstrated problem, the detection of faults in power transmission lines is crucial due to the economic consequences of insulator failure. Detecting faults in insulators using UAVs and computer vision CV algorithms is a non-intrusive and effective method. YOLO deep learning architectures are commonly used for object detection and have shown good performance in insulator detection. However, small object detection and real-time processing remain challenging. Several versions of YOLO architectures have been developed and compared for insulator detection. The main challenge in this field is to filter out individual insulator faults semi-automatically or fully automatically. The main achievement is the development of efficient and accurate YOLO architectures for insulator fault detection using UAVs and CV algorithms. Some of the applications listed here only perform insulator detection, others detect faults as well. The ultimate goal of all research is the same to help the work of transmission system operators by filtering out individual insulator faults semi-automatically or fully automatically. YOLO architectures can be a good solution to this problem, and several versions of them are used in several studies. However, the benchmarking of different versions of the YOLOv3, YOLOv4, and YOLOv5 architectures is still poorly published in the field of insulator detection.

The main contribution of this work is the comparison of architectures YOLOv5n (Ultralytics, 2020b), YOLOv5s (Ultralytics, 2020b), YOLOv5m (Ultralytics, 2020b),YOLOv5l (Ultralytics, 2020b), YOLOv5x (Ultralytics, 2020b), YOLOv5n6 (Ultralytics, 2020b), YOLOv5s6 (Ultralytics, 2020b), YOLOv5m6 (Ultralytics, 2020b),YOLOv5l6 (Ultralytics, 2020b), YOLOv4 (Ultralytics, 2020c), YOLOv4-tiny (Ultralytics, 2020c), YOLOv4-csp (Ultralytics, 2020c), YOLOv3 (Ultralytics, 2020a), YOLOv3-tiny (Ultralytics, 2020a) and YOLOv3-spp (Ultralytics, 2020a). The main aspects of comparison are Precision, Recall, \(\hbox {mAP}_{50}\), box loss, object loss, and class loss. This work is the author’s first publication in the field of insulator detection, so there is no overlap or similarity with his previous publications.

The rest of the paper is organized as follows. Section 2 introduces the methods and data used for the work. Evaluation and results are presented and discussed in Sect. 3, and finally, conclusions are drawn in Sect. 4.

2 Methods and Data

In this section, the study’s design, implementation, and analysis are presented. First, an overview of the YOLO algorithm and its different versions is provided in Sect. 2.1. In Sect. 2.2, the main new features in YOLOv3 are highlighted, while Sect. 2.3 presents the effective new features in YOLOv4. The YOLOv5 architecture by Ultralytics (2020b) is discussed in Sect. 2.4. The hyperparameters used for the YOLO models are then presented in Sect. 2.5, with an emphasis on how they were selected through extensive experimentation. Finally, the data management techniques used to create the dataset are described in Sect. 2.6, highlighting the use of a comprehensive data augmentation pipeline and the U-Net (Ronneberger et al., 2015) convolutional neural network for insulator segmentation, which together ensure that the synthesized images in the dataset closely resemble real defective insulators. This makes the dataset representative of real-world scenarios, thereby ensuring reliable and valid results. The complete detection scheme flowchart is shown in Fig. 1.

2.1 The Basics of the YOLO Algorithm

Here, a brief overview of the basic concepts of the YOLO algorithm is presented. This is how YOLO works up to version v2. The basic operation of YOLO is shown in Fig. 2.

Fig. 2
figure 2

YOLO detection system (Redmon et al., 2016)

The essence of the YOLO algorithm is that it does not use region proposals, like say RCNN, but generates a lot of bounding box outputs and selects the appropriate ones. YOLO is a fully convolutional network (FCN) architecture that compresses an image into an \(S \cdot S\) layout with N dimension grid cells thanks to its different stride values. Bounding boxes can be estimated from these grid cells. If B bounding boxes from each grid cell are estimated, a total of \(S \cdot S \cdot B\) bounding boxes are available by the end of the process. Importantly, each object will be estimated from the grid cell where its center is after the stride operations. The so-called non-maximum suppression is also used. If a cell results several significantly overlapping bounding boxes associated with an object, the highest confidence bounding box has to be chosen. The bounding box estimation process is as follows. B bounding boxes from each grid cell are estimated with the same parameters. This gives \((5 + NC) \cdot B\) outputs from each cell. NC is the number of classes, and the other five parameters are x, y, w, h, and confidence. In YOLO, x and y determine the center of the object to be detected relative to the width and height of the entire image. And w and h define the width and height of the object to be detected, the same way.

An important part of the operation of YOLO is the use of anchor boxes that are used from version v2. These are made by the clustering of all w and h values from the training database. And from the B bounding box outputs, selection is done by the similarity to the anchor boxes (Redmon & Farhadi, 2017).

2.2 The Main New Feature in YOLOv3

There are several newer versions of YOLO. A significant change in v3 is that there are more upscalings. So not only an \(S \cdot S\) grid is used, but also a \(2S \cdot 2S\) grid and a \(3S \cdot 3S\) grid. This is very useful because, as mentioned in Introduction, YOLOv3 is designed to improve the detection of small objects. This can be done with a higher resolution activation map (Redmon & Farhadi, 2018).

2.3 Effective New Features in YOLOv4

The architecture of the v4 version is similar to the architecture of the v3, but it brings a lot of small ideas and innovations. These new ideas can basically be divided into two groups. The first group is the Bag of Freebies. Methods that fall into this group do not result in additional computational costs, i.e., they do not increase runtime. So, for example, data augmentation is used in v4 to create new and different training data. Another method of Bag of Freebies is to modify the loss function. This is necessary because a feature of YOLO architectures is that they generate far more bounding boxes than there are real objects to detect. Thus, false negative detections should be punished to a much greater extent than false positive detections. But there are many other features in this group as well (Bochkovskiy et al., 2020).

The second group is the Bag of Specials. With the methods belonging to this group, a significant improvement can be achieved in addition to a small increase in the computational costs. This includes, for example, the use of Mish Activation and Spatial Pyramid Pooling but there are other features as well (Bochkovskiy et al., 2020). Mish (2019), which improves the performance of the neural network architecture, is defined using Softplus function, that is, a smooth approximation to the ReLU activation function:

$$\begin{aligned} f(x) = x \cdot \tan \!{h}(\hbox {softplus}(x)). \end{aligned}$$
(1)

It is a self-regularized non-monotonic activation function and works better than Leaky-ReLU that is the activation function of YOLOv3. In YOLOv4, features are pooled and fixed-length, outputs are generated by spatial pyramid pooling (He et al., 2015), which is a layer added on top of the last convolutional layer. In the field of backbone usage, YOLOv4 can use the cross-stage partial (CSP) network that can enhance the learning capability of the architecture (Wang et al., 2020a).

2.4 The YOLOv5 Architecture by Ultralytics

The name of this architecture entails a lot of controversy among those involved in object detection. The reason for this is, first of all, that YOLOv5 was released on May 27, 2020, by then-unknown authors, just over a month after YOLOv4 was released on April 23, 2020. Furthermore, it contained only minor changes compared to YOLOv4. Moreover, as it turned out later, it is developed by a startup called Ultralytics, which was founded by YOLOv4’s mosaic data augmentation specialist, Glenn Jocher. That is why YOLOv5 does not have an original publication, only the Ultralytics GitHub repository (Ultralytics, 2020b). In this work, the YOLOv5 name is used for the algorithm developed by Jocher et al.

A lot of paper has already compared v4 and v5. Ghimire et al. (2022) highlights that YOLOv5 is much easier to integrate into robotic systems because it is implemented entirely in Pytorch. Li et al. (2021) adds that although YOLOv5 uses CSPDarknet as a backbone, PANet as a neck, and the head of YOLOv3 as a head, but its activation function is a sigmoid function instead of Mish. Wang et al. (2022) concludes that the usage of GIoU loss function effectively solves the problem of nonoverlapping bounding boxes. These are the main new features of YOLOv5. Apart from these, it is significantly similar to YOLOv4.

2.5 Hyperparameter Selection

The set of hyperparameters highlighted in Table 1 were chosen after conducting extensive experimentation and finding that these values consistently provided optimal results across the v3, v4, and v5 architectures. This uniformity in hyperparameter selection enables fair comparison between the different versions of the model, ensuring that the observed performance differences are attributed to the architecture and not the choice of hyperparameters.

Table 1 Hyperparameters

A suitable initial learning rate is crucial for efficient convergence, while the final learning rate helps to fine-tune the model. A larger learning rate may lead to unstable training, while a smaller one may result in slower convergence. The choice presented in Table 1 strikes a balance between these two extremes. The momentum and weight decay values were chosen to find a balance between stabilizing the training process and preventing overfitting. Higher momentum values could cause oscillations during training, while lower values may result in slow convergence. Larger weight decay values might lead to underfitting, while smaller values may result in overfitting. The warmup hyperparameters help in gradually increasing the learning rate and momentum during the initial phase of training. This strategy aids in avoiding large updates that could negatively impact model convergence. The chosen values provide a suitable balance between effective warmup and overall training time. Loss function gains weigh the importance of different loss components. The chosen values of 5, 1, and 0.5 for the respective loss components reflect a specific balance between accurate bounding box predictions, correct class identification, and precise objectness scores, with the highest weight assigned to accurate bounding box predictions and the lowest weight assigned to objectness scores. Deviating from these values may lead to suboptimal performance in one or more of these aspects. More information about them can be found in Sect. 3.1. Positive weights balance the contributions of positive and negative examples in the loss function. It was found that equal weighting between positive and negative examples yields the best results. Threshold parameters control the assignment of ground-truth boxes to anchors during training. The chosen values ensure that suitable anchors are assigned to each ground-truth box, leading to better predictions. Data augmentation parameters were carefully chosen to provide sufficient data variability without generating unrealistic training samples. Extreme values in these parameters might result in poor generalization to real-world data.

In conclusion, the selected hyperparameters were found to provide the best results across the v3, v4, and v5 architectures through extensive experimentation. Changing these values may lead to suboptimal performance in various aspects of the model, such as convergence speed, stability during training, or the quality of the generated predictions.

2.6 Data Management

The used dataset is in a GitHub repository provided by Tao et al. (2020). It is divided into two parts. The Normal\(\_\)Insulators part contains 600 Portable Network Graphics (PNG) pictures captured by UAVs. The PNG pictures about defective insulators are in the other part, that is called Defective\(\_\)Insulators part. There are 248 of them. Since there are not too many defective insulators, a data augmentation method is applied. All of these 248 images are synthesized by the following method (Tao et al., 2020).

  1. 1.

    The first step is the segmentation of some of the defective insulators in the original picture by TVSeg (Unger et al., 2008). This is how the first mask images are obtained.

  2. 2.

    The second step is the augmentation of the original images and their masks that results in a lot of original-mask image pairs.

  3. 3.

    The following stage is the training of the U-Net (Ronneberger et al., 2015) with these pairs.

  4. 4.

    Then, the trained U-Net can segment the rest of the defective insulators.

  5. 5.

    And finally, the segmented parts can be attached to different backgrounds.

Both the images of Normal\(\_\)Insulators and images of Defective\(\_\)Insulators are basically in two images subdirectories, and their annotations are in two labels subdirectories. Only the insulator annotations are used in this work.

It is important to clarify that the dataset used in the paper is not just representative of real-world scenarios and contains a sufficient number of samples for both normal and defective insulators, but also was made by a good data augmentation method that synthesizes defective insulator images that are similar to real defective insulator images.

This good synthesization is due to the comprehensive data augmentation pipeline used by Tao et al. (2020). The pipeline includes techniques such as affine transformation, insulator segmentation and background fusion, Gaussian blur, and brightness transformation, which generate a diverse set of images for both normal and defective insulators, closely resembling real-world scenarios. Affine transformation accounts for different angles, distances, and positions of insulators, while insulator segmentation and background fusion diversify the dataset by fusing insulators with different backgrounds. Gaussian blur and brightness transformation account for variability in image clarity and brightness, simulating real-world imaging scenarios.

Furthermore, the use of the U-Net convolutional neural network for insulator segmentation contributes to the good synthesization by improving accuracy and completeness. The U-Net is trained on manually segmented images, resulting in segmentation results that closely resemble real insulators. Overall, the comprehensive data augmentation pipeline and the use of U-Net ensure that the synthesized images in the dataset closely resemble real defective insulators, making the dataset representative of real-world scenarios.

The original annotation format for the images is the Pascal visual object classes (VOC) format, so it has to be converted to the input format of the YOLO architecture (Ng, 2022). After these modifications, the training of the YOLO architectures can be performed.

3 Evaluation and Discussion of Results

This section presents the results of the architectures mentioned in Sect. 1. In the YOLO loss-logic, three components are considered with different weights. The first of them is the classification loss with weight 1, the second is the localization loss with weight 5 and the last is the confidence loss with weight 0.5. These are the default values in the YOLO architecture. They can of course be modified if required. The accuracy can be measured with mAP. The following subsection describes these methods. After that, results are presented.

3.1 Evaluation Metrics

This section describes the calculation of the classification loss, the localization loss, the confidence loss and the mAP.

3.1.1 Classification Loss

In the case of a detected object, the classification loss (Redmon et al., 2016) (\(L_\textrm{classification}\)) can be calculated with the sum of each squared error of the class conditional probabilities by classes and by cells:

$$\begin{aligned} {L}_{\text{ classification } }= & {} \sum _{i=0}^{S^{2}} I_{i}^{\text{ object }} \nonumber \\{} & {} \cdot \sum _{\text{ class } \in \text{ classes }}\left( p_{i}(\text{ class})-{\hat{p}}_{i}(\text{ class})\right) ^{2} \end{aligned}$$
(2)

where

  • \(S^{2}\) is the number of grid cells;

  • \(I_{i}^{\text{ object }}=0\) if an object does not appear in the ith cell, otherwise 1;

  • \({\hat{p}}_{i}(\text{ class})\) is the predicted conditional class probability in the ith cell;

  • and \(p_{i}(\text{ class})\) is the conditional class probability in the ith cell.

3.1.2 Localization Loss

The localization loss (Redmon et al., 2016) (\(L_\textrm{localization}\)) composes of the fault of the boundary box size prediction and the fault of the boundary box location prediction:

$$\begin{aligned} {L}_{\text{ localization } }= & {} \sum _{{i}=0}^{{S}^{2}} \sum _{{j}=0}^{{B}} {I}_{{i}, {j}}^{\text{ object } } \left[ ({x}_{{i}}-\hat{{x}}_{{i}})^{2} +({y}_{{i}}-\widehat{{y}}_{{i}})^{2}\right. \nonumber \\{} & {} + \left( {w}_{{i}}^{\frac{1}{2}}-\widehat{{w}}_{{i}}^{\frac{1}{2}} \right) ^{2} + \left. \left( {h}_{{i}}^{\frac{1}{2}}-\widehat{{h}}_{{i}}^{\frac{1}{2}} \right) ^{2} \right] \end{aligned}$$
(3)

where

  • \(S^{2}\) is the number of grid cells;

  • \(I_{i}^{\text{ object }}=0\) if the jth boundary box in the ith cell is not responsible for the object detection, otherwise 1;

  • \({x}_{{i}}\) is the horizontal coordinate of the center of the boundary box;

  • \(\hat{{x}}_{{i}}\) is the prediction of the horizontal coordinate of the center of the boundary box;

  • \({y}_{{i}}\) is the vertical coordinate of the center of the boundary box;

  • \(\hat{{y}}_{{i}}\) is the prediction of the vertical coordinate of the center of the boundary box;

  • \({w}_{{i}}\) is the width of the boundary box;

  • \(\hat{{w}}_{{i}}\) is the predicted width of the boundary box;

  • \({h}_{{i}}\) is the height of the boundary box;

  • and \(\hat{{h}}_{{i}}\) is the predicted height of the boundary box.

The logic behind the exponentiation with \(\frac{1}{2}\) is to consider the relative errors of the smaller boundary boxes similar to the relative errors of the large boundary boxes.

3.1.3 Confidence Loss

The confidence loss (Redmon et al., 2016) (\(L_\textrm{confidence}\)) that measures the “objectness” of the detected object in the boundary box is calculated as follows:

$$\begin{aligned} {L}_{\text{ confidence } }=\sum _{{i}=0}^{{S}^{2}} \sum _{{j}=0}^{{B}} {I}_{{i}, {j}}^{\text{ object } }\left( C_{{i}}-{\hat{C}}_{{i}}\right) ^{2} \end{aligned}$$
(4)

where

  • \(S^{2}\) is the number of grid cells;

  • \(I_{i}^{\text{ object }}=0\) if the jth boundary box in the ith cell is not responsible for the object detection, otherwise 1;

  • \(C_{{i}}\) is the confidence score of the jth boundary box in the ith cell;

  • and \({\hat{C}}_{{i}}\) is the prediction of the confidence score of the jth boundary box in the ith cell.

3.1.4 Mean Average Precision

Accuracy is measured in this paper with two metrics that use \(\hbox {mAP}\) (Padilla et al., 2020). The first one is the \(\hbox {mAP}_{50}\) that calculates the \(\hbox {mAP}\) at 50% intersection over union (IoU) threshold by the following way:

$$\begin{aligned} \textrm{mAP}_{50}=\sum _{\text{ class } \in \text{ classes } } \int _{0}^{1} {p}_{50}^{\text{ class } }\left( {r}_{50}^{\text{ class } }\right) \end{aligned}$$
(5)

where,

  • \({p}_{50}^{\text{ class } }\) is the precision of the class at 50% IoU;

  • and \({r}_{50}^{\text{ class } }\) is the recall of the class at 50% IoU.

Fig. 3
figure 3

Results of the YOLOv3 architectures

The other one is the \(\hbox {mAP}_{50:95}\) that averages the mAP values at different IoU threshold levels from 50 to 95% with step size 5%:

$$\begin{aligned} \textrm{mAP}_{50: 95}=\frac{1}{10} \sum _{\begin{array}{c} \text{ perc } \end{array}} \sum _{\text{ class }} \int _{0}^{1} {p}_{\text{ perc } }^{\text{ class } }\left( {r}_{\text{ perc } }^{\text{ class } }\right) \end{aligned}$$
(6)

where,

  • \( {p}_{\text{ perc } }^{\text{ class } }\) is the precision of the class at perc % IoU;

  • and \({r}_\textrm{perc}^{\text{ class } }\) is the recall of the class at perc % IoU.

It is a widely used metric for object detection tasks and it is a good metric for this work for several reasons.

It provides a balanced evaluation of both precision and recall, which is crucial in insulator detection. The goal is to find as many defective insulators as possible (high recall), while minimizing false detections (high precision).

Furthermore, it calculates average precision (AP) by averaging precision values at different recall levels, making it threshold-independent. This allows for a more comprehensive assessment of the model’s performance across various confidence levels, which is useful in real-world applications where the optimal threshold may be unknown or vary depending on the context.

And, since mAP is widely used in object detection, it enables easy comparison of the YOLO-based insulator detection model’s performance with other models or methods in the field, which helps researchers and practitioners make informed decisions when selecting the best approach for their specific applications.

3.2 Results of Different Architectures

Results are highlighted in this subsection. All of them is obtained using an NVIDIA Quadro RTX 3000 GPU. These are reliable and generalizable to other systems, since NVIDIA GPUs, including the Quadro RTX 3000, are built on a consistent architecture that ensures consistent computations and algorithms across different hardware configurations. It is compatible with popular deep learning frameworks like TensorFlow and PyTorch, which are designed to provide consistent results across different hardware configurations.

By considering both training and validation losses for classification, localization, and confidence, it is possible to assess the model’s performance during training and its ability to generalize to unseen data. This helps identify potential issues, such as overfitting or underfitting, which could impact the model’s real-world performance.

3.2.1 Results of the YOLOv3 Architectures

Three subtypes of YOLOv3 architectures are evaluated in this paper in terms of insulator detection. These are the YOLOv3, the YOLOv3-tiny and the YOLOv3-spp (Ultralytics, 2020a).

Fig. 4
figure 4

Results of the YOLOv4 architectures

The YOLOv3-tiny has much fewer layers than YOLOv3 and the difference between them in the number of parameters is even greater. This is because the simplified YOLOv3-tiny architecture, that requires less computing performance, can even run on a smartphone (Nakahara et al., 2020). The weakness of this architecture is usually the lower accuracy, as it is composed of 48 layers instead of 261 and consists of 8,669,002 parameters instead of 61,502,815.

In the case of YOLOv3-spp, the SPP stands for spatial pyramid pooling. SPP is basically one of the Bag of Specials of YOLOv4 but can also be used for YOLOv3. It results in 269 layers and 62,551,903 parameters. The results of these architectures are shown in Fig. 3.

It is noticeable that the two bigger architectures (YOLOv3 and YOLOv3-spp) outperform the smaller one. The classification loss at the end of the 50 epochs is more or less the same in the three cases, but remarkable differences can observed in the cases of localization loss and validation loss. Of these two metrics, which give a larger loss for YOLOv3-tiny, the second is particularly spectacular. This means that even the tiny architecture can recognize if a detected particular long rod insulator is defective or non-defective. However, the exact position of this insulator is not so well predicted by the simpler and much more uncertain solution. It is a general phenomenon that there is a greater difference between validation values than between training values. There is no significant difference between the traditional normal size method and the one using SPP, but the latter is somewhat more effective in terms of losses.

The latter difference is not observed for precision, recall, and mAPs, but the tiny method performs worse in these metrics as well. The \({\hbox {mAP}}_{50}\) values achieved are 98.8 and 98.9%, while the highest \({\hbox {mAP}}_{50:95}\) values are 90.5% and 91.0% for YOLOv3 and YOLOv3-spp. These metrics indicate 95.1% and 64.3% for the tiny version. However, in terms of speed, this version performs better. The achievable FPS values are 56.8, 56.2, and 222.2, respectively.

3.2.2 Results of the YOLOv4 Architectures

Fig. 5
figure 5

Results of the YOLOv5 architectures

Fig. 6
figure 6

Results of the YOLOv5 architectures with extra output layers

This paper summarizes the running results of three different YOLOv4 architectures. These are the YOLOv4, the YOLOv4-tiny and the YOLOv4-csp (Ultralytics, 2020c).

The difference between v4 and v4-tiny is similar to that between v3 and v3-tiny. The original YOLOv4 consists of 488 layers and 63,943,071 parameters. In contrast, YOLOv4 tiny consists of only 99 layers and 5,876,426 parameters. As a result, like YOLOv3-tiny, this architecture has a high FPS rate. The YOLOv4-csp version has a bit more layers because of the special backbone. But despite the 514 layers, it has only 52,501,407 parameters, which is less than the number of parameters in the original v4 architecture. The results of these architectures are shown in Fig. 4.

It is easy to see that the less accurate detections belong to the tiny version in this case as well. However, it is important to note that the difference is not as significant as for v3 versions. The classification loss and the localization loss during training are close to the values of the larger architectures. But in validation, the classification loss is different, and the difference in confidence loss is significant in both cases.

The evaluations are very similar for precisions, recalls, and mAPs. This means that the best precision, recall, \({\hbox {mAP}}_{50}\) and \({\hbox {mAP}}_{50:95}\) values for YOLOv4, YOLOv4-csp and YOLOv4-tiny are: 67.6–63.0–58.1%, 98.8–99.1–99.1%, 97.8–97.4–97.1% and 79.1–80.0–79.1%. This is very beneficial for YOLOv4-tiny, as it means that significant speed improvements can be achieved with little accuracy degradation. This is because the FPS values for the three architectures are 51.5, 76.3, and 270.2.

3.2.3 Results of the YOLOv5 Architectures

The tested YOLOv5 algorithms are YOLOv5n, YOLOv5s, YOLOv5m, YOLOv5l, YOLOv5x, YOLOv5n6, YOLOv5s6, YOLOv5m6, and YOLOv5l6 (Ultralytics, 2020b). The tested YOLOv5 algorithms are YOLOv5n, YOLOv5s, YOLOv5m, YOLOv5l, YOLOv5x, YOLOv5n6, YOLOv5s6, YOLOv5m6, and YOLOv5l6 (Ultralytics, 2020b). The n, s, m, l and x letters are standing for the sizes: nano, small, medium, large and extra-large. Number 6 indicates preparation for higher resolution. Architectures marked with this number, include extra output layers for better results.

These architectures justify the “the bigger is the better is” principle. Training classification loss, validation classification loss, training localization loss, validation localization loss, training confidence loss, validation confidence loss, precision, recall, and mAPs are all the more favorable the more complex architectures are used. The best loss values for v5 architectures are generally two times smaller than the best loss values for other architectures. Respectively, the decrease in losses is more monotonous than in the other cases.

The last four metrics vary from 96.9 % to 99.2 %, from 98.2 % to 99.4 %, from 98.1 % to 99.0 %, and from 83.0 % to 92.3 %. In these cases, size is the most significant influencing factor as well. However, as before, greater robustness means longer runtimes. Versions n, s, m, l, x, n6, s6, m6, and l6 have layers 213, 224, 316, 400, 484, 280, 280, 378, and 476, respectively. And their parameter numbers are 1,761,871, 7,056,607, 21,472,671, 47,370,047, 88,397,343, 3,089,188, 12,312,052, 35,254,692, and 76,126,356. These different solutions have FPS values of 212.8, 151.5, 84, 48.8, 27.1, 175.4, 137, 70.9, and 44.4. It is worth emphasizing that for these methods, the correlation between the average image processing time (AIPT = 1 / FPS) and the number of layers is 0.917 and the parameter number - AIPT correlation is 0.966.

3.2.4 Same Size Architectures

Finally, it is worth comparing the subtypes of the different generations (v3, v4, v5) studied so far in same size ranges. The v3-tiny, v4-tiny, v5n, v5s, v5n6, and v5s6 versions are similar. These architectures are small enough to reach 100 FPS.

Fig. 7
figure 7

Comparison of smaller architectures

Fig. 8
figure 8

Comparison of normal size architectures

Their comparison is shown in Fig. 7. The same comparison is shown in Fig. 8 for normal-sized original v3 and v4 and similar v5 architectures.

It is obvious that v5 architectures outperform v3 and v4. However, it should be noted that these have slightly lower FPS rates. The 5n, 5 s, 5n6, and 5s6 architectures produced very similar results. That is why it may be advisable to choose the fastest of these architectures. The situation is similar for normal size. There, the m5 may be the best choice. An interesting phenomenon is that the v3 implementation of Ultralytics (Ultralytics, 2020a) is capable of better performance than the v4 implementation of Wang et al. (2020a). However, typically the v4 architectures presented here can achieve higher FPS rates.

4 Summary

A benchmarking of YOLOv3, YOLOv4 and YOLOv5 was presented in this paper in terms of detection of defective and normal long rod insulators. Fifteen subtypes of YOLO architectures were evaluated on freely available dataset (Tao et al., 2020) based on the implementations (Ultralytics, 2020a, b, c) and it has been shown that, in general, subtypes of the YOLOv5 architecture are best suited for long rod insulator object detection.

Table 2 Best results of the different architectures

In comparing different YOLO generations and small, normal-sized, and big architectures for insulator detection task with each other, their performance is evaluated based on several indicators. These include classification loss, which measures the accuracy of predicting the correct insulator class, localization loss, which assesses the accuracy of bounding box predictions, and confidence loss, which indicates the model’s ability to correctly assign confidence scores to insulator predictions. Additionally, precision, recall, and \({\hbox {mAP}}\) are used to provide a more comprehensive understanding of the model’s performance. Precision measures the proportion of true positive predictions out of all positive predictions, while recall calculates the proportion of true positives out of all actual positive instances. Typically, normal-sized and big architectures, such as the YOLOv3 and YOLOv4 series, exhibit better performance in terms of accuracy. This is demonstrated by their lower classification, localization, and confidence losses, as well as higher precision, recall, and \(\hbox {mAP}\) values. The increased accuracy can be attributed to the larger capacity and more complex structure of these models, which allows them to learn more robust features and better generalize to various insulator detection scenarios. However, small architectures like YOLOv3-tiny and YOLOv4-tiny have their own advantages. While they may not achieve the same level of accuracy as their larger counterparts, they boast significantly faster processing speeds.

The extensive architecture comparison distinguishes this paper from other studies in the field, which have typically focused on a single YOLO version or a specific improvement. The best results obtained in this work showcase a range of \(\hbox {mAP}_{50}\) values from 98.1 to 99% across the various YOLO architectures. In comparison, other studies have reported \(\hbox {mAP}_{50}\) values ranging from around 93.4% to 95.5% for their specific YOLO-based methods. This highlights the superior performance achieved in this paper, as well as the importance of considering different YOLO architectures to identify the most suitable option for insulator detection tasks. However, it is important to note that the better performance observed in this work may be influenced by the different datasets used across studies. Variations in the composition, quality, and difficulty of the datasets can affect the evaluation results, making direct comparisons between studies challenging. Nonetheless, this paper’s comprehensive examination of YOLO architectures provides key understanding for the field.

The results of this study indicate that YOLOv5 architectures are better suited for long rod insulator object detection tasks, which can be beneficial for industries that rely on insulator health monitoring and maintenance. By leveraging the strengths of these models, organizations can enhance their monitoring systems, potentially reducing the risk of insulator failures and improving overall grid reliability. It is also worth noting that the processing speed of smaller architectures makes them suitable for applications with limited computational resources or those requiring real-time insulator detection. It is important to consider the trade-off between accuracy and processing speed when selecting an architecture for a specific application, taking into account the requirements and constraints of the particular use case.

The primary strength of this study lies in its comprehensive evaluation of 15 different YOLO architectures, providing valuable insights into their performance for insulator detection tasks. This is even more significant in light of the fact that a joint comparison of YOLOv3, YOLOv4 and YOLOv5 has not yet been published in this area. However, the study is limited by its focus on a single dataset, although the good data augmentation pipeline ensures that it represents the diversity of insulator detection scenarios in real-world applications.

In summary, this research offers a thorough comparison of YOLOv3, YOLOv4, and YOLOv5 models for the detection of defective and normal long rod insulators, with YOLOv5 architectures showing superior performance. By examining 15 distinct YOLO subtypes, the study emphasizes the necessity of choosing the appropriate architecture based on the desired balance between precision and computational efficiency. Future investigations will aim to expand this methodology to include transfer learning and assess its relevance to additional situations. This could lead to the creation of more precise and effective insulator health monitoring systems, enhancing power grid stability and mitigating the likelihood of insulator failure. The findings of this research serve as a valuable resource for academics and industry professionals involved in insulator detection and surveillance projects. A complete evaluation of all architectures is shown in Table 2.