1 Introduction

In the process of manufacturing, metal products are vulnerable to defects such as surface oxidation, fractures, depressions, and scratches. These flaws have detrimental effects on the product’s quality and durability, leading to potential financial loss. The need to ensure manufacturing quality has made it imperative to detect flaws on metal surfaces. As a result, there has been a surge of interest among researchers in developing rapid and precise inspection technologies.

Researchers employ the methods of vision-based defect detection since the 1980s. A typical traditional detection algorithm includes image preprocessing, Regions of Interest (ROI) detection, image segmentation, feature extraction and defect classification [1]. Traditional algorithms categorize defects using manual extraction of features to implement detection, such as HOG (Histogram of Oriented Gradient) [2], Local Binary Pattern (LBP) [3], Fourier transform [4], Gabor filter [5], SVM (Support Vector Machine) [6] and Random Forest [7]. However, there are limitations of algorithms above: the artificial feature extraction cannot express information adequately under complex circumstances, defect performance is highly affected by human-generated feature extractors, problems like huge computations, locations and sizes of defects are still difficult to solve.

Since 2012, Convolutional Neural Networks have become more and more important in computer vision (CV) tasks [8]. For the purpose of defect classification and location regression, the two-stage deep-learning algorithms represented by Faster-R-CNN [9] generate candidate boxes using Region Proposal Network (RPN) to predict regional images, finally features of merged Regions of Interest (ROI) can be extracted. He et al. [10] proposed a strip surface defect detection network based on the improved Faster R-CNN, and multiple hierarchical features were combined into one feature in the Resnet backbone network, which made good use of global semantic information and reduced feature loss. Xie et al. [11] proposed the improved Faster-R-CNN combined with the feature pyramid network (FPN)and ROI-Align algorithms, which reduced the information loss of small objects and errors caused by quantization, obtaining 98.5% mean Average Precision (mAP) on the PCB dataset. In comparison to two-stage algorithms, one-stage target detection algorithms such as You Only Look Once (YOLO) [12] and SSD (Single Shot Detector) [13] integrate classification and positioning tasks using regression techniques to directly compute classification results and position coordinates for multiple targets. Thanks to less complexity and faster detection time, the one-stage detectors have received much concern to meet real-time detection needs. For instance, Li et al. [14] proposed a fully convolutional YOLO detection network, which improved the prediction of the locations and sizes of surface defects on steel strips. This method offers support for real-time surface defect detection. Xie et al. [15] utilized six feature maps to generate predicted classification results and integrated the Fully Convolutional Squeeze-and-Excitation (FCSE) block into SSD to enhance the detection accuracy on the TILDA dataset.

While Convolutional Neural Networks (CNNs) are effective at extracting features for defect detection tasks, their performance can degrade when dealing with smaller targets as the network depth increases. As the depth of the CNN model grows, the receptive field of the convolutional filters also increases, making it challenging to capture intricate details and extract discriminative features from more minor defects or targets. This is because the larger receptive fields tend to capture more contextual information, which can be beneficial for understanding larger objects but may cause smaller objects or defects to be overlooked or misrepresented. In addition, CNNs face difficulties in accurately detecting and classifying surface defects due to background interference and various shapes of the defects.

With the emergence of Vision Transformer (ViT) [16], transformer-based models have become a hot issue in the field of CV. ViT divides original images into several patches, and compiles the patch information into transformer encoder module, at last images are classified by the fully connected layer. DETR [17] is an innovative network that integrates CNN and Transformer architectures to perform object detection by treating it as an image sequence prediction task. Based on the self-attention mechanism, the transformer encoder module extracts feature information and incorporates position information, then the encoded information is sent to the decoder module to obtain global information in feature extraction and fusion. Compared to CNNs, which are limited by receptive fields of convolution kernels, transformer performs well in obtaining global context by collecting and learning relative information from adjacent regions.

As the analysis mentioned above, this paper introduces a novel approach for defect detection using an optimized version of YOLOv5. The improved model is designed to achieve accurate defect detection on steel surfaces, while simultaneously ensuring a high detection speed. The main contributions of the study are as follows:

  1. 1.

    BottleneckCSP architectures and standard convolution layers are replaced with reparameterized CSP (Cross Stage Partial) structures (Rep-CSP) and reparameterized depthwise separable convolutions (Rep-DSC) to lightweight the backbone part and enhance the ability of extracting features.

  2. 2.

    Combined CNN and transformer, the contextual transformer module is utilized to enhance feature expressions from different layers and transfer effective information to the neck part.

  3. 3.

    A simplified generalized FPN CoT-GFPN is designed to fuse features from different scales and enhance network generalization ability.

  4. 4.

    Four detection heads are set to predict defects with different sizes and the k-means clustering algorithm is utilized to generate prior bounding boxes. Moreover, the model convergence and detection precision are further improved using Focal-EIOU as the loss function.

The remainder of this paper is organized as follows: Section 2 will present related work. Section 3 will display the structure of CRGF-YOLO and proposed methods. Section 4 will demonstrate experiments evaluated on the NEU-DET dataset and verify the effectiveness of improvements. Section 5 will discuss limitations and future research. Finally, Sect. 6 will summarize the conclusion.

2 Related Work

2.1 YOLOv5

As the one-stage detector, YOLO is designed to perform object detection as a regression problem, which is achieved through the use of the convolutional neural network architecture, predicting the class probabilities of detected images and regress the coordinates of the bounding boxes. YOLOv5 is the fifth version of the YOLO series, which has received extensive attention because of its high accuracy and ability to perform real-time detection. There are three parts in the network, namely backbone, neck, and head. The CSPDarknet53 structure is constructed in the backbone part, including Focus, BottleneckCSP, Spatial Pyramid Pooling Fusion (SPPF) to extract features, and uses the stacking strategy of several residual architectures to avoid the vanishing gradient problem. The FPN facilitates the transmission of high-level semantic features in a downward direction. In addition, the Path Aggregation Network (PANet) incorporates an extra bottom-up pathway to aggregate feature maps, conveying both high-level semantic information from deeper layers and low-level locational information from shallower layers into the neck part. YOLOv5 utilizes the k-means clustering algorithm along with a genetic algorithm to determine the sizes and aspect ratios of anchor boxes, which serve as prior bounding boxes for object detection. The k-means algorithm groups the ground truth bounding box dimensions into k clusters, with each cluster represented by a single anchor box dimension. The genetic algorithm further optimizes these anchor box dimensions to improve their fitness for the specific dataset. The loss function is a critical metric for assessing the performance of a model as it measures the extent of variation between the predicted values and the actual values. A well-designed loss function is essential for achieving accurate predictions.

2.2 Self-attention Mechanism

The self-attention mechanism has been well extensively utilized in the domain of natural language processing (NLP), which achieves self-attention by evaluating the similarity of input data and acquiring knowledge of the interdependence between information by means of selective weighted calculations. In CV tasks, the self-attention mechanism is utilized to model the relationships of feature vectors across various spatial regions, enabling the extraction of global connections within images. By employing the global multi-head self-attention mechanism, Wang et al. [18] developed a non-local operation, and gaussian transformation is introduced to access the global similarity of information and generate corresponding weights to improve the network performance. During the processing of images, the self-attention mechanism expands the picture pixels into sequences, and the computation and memory usage will increase sharply if the sequence length is large. To solve the problem, researchers have improved transformer models by utilizing local attention mechanism that concentrates on a portion of the input sequence to decrease computation and memory requirements.

With the development of deep learning, the combination of CNN and Transformer has emerged as a prominent trend. Bello et al. [19] enhanced network performance by incorporating self-attention mechanisms into CNNs. BoT (Bottleneck Transformers) [20] is improved upon the ViT model by utilizing a bottleneck structure and replacing the 3 × 3 convolutional layers in ResNet with the self-attention mechanism of transformers. This results in a more efficient multi-branch model design. To use contextual information efficiently, Contextual Transformer Networks (CoT) [21] combines the self-attention mechanism with the idea of non-local blocks. This approach enables adaptive modeling of contextual information from different regions and enhance the feature expressions of the model. In general, the fusion of CNNs and Transformers can strengthen the convolutional operation of CNNs. By integrating the positional encoding of feature information, the expression of semantic information can be enhanced. This innovative combination provides an effective solution for the study of deep networks.

2.3 Multi-Scale Fusion Network

During convolution operations, extracted features from different layers have varying scales due to the high-level network containing more semantic information and the low-level network containing more geometric information. To enhance target detection ability, multi-scale features from different layers can be fused. The FPN exhibits a robust capability of equilibrating feature maps from diverse layers through its up-sampling and down-sampling operations. Based on FPN, PANet adds the route enhancement and adaptive feature pooling operations to shorten information paths and ease information propagations. Moreover, the improved FPN models like PRB-FPN [22], NAS-FPN [23], DFPN [24] have been developed to improve the performance of multi-scale target detection. EfficientDet [25] proposed a more efficient multi-scale fusion method BiFPN, which integrated bidirectional cross-layer connections and fast normalized fusion. By introducing weights, feature information at different scales was well balanced. However, previous networks only focus on feature fusion and lack internal information connections. GFPN [26] developed a novel pathway fusion approach that incorporates skip-layer connections and cross-scale connections. The skip connections have a shorter distance between feature layers, allowing for more efficient information flow. Two connections named dense-link and log2n-link are introduced to reduce gradient vanish in the network and provide more effective transmission from previous nodes to next. In addition, a new cross-scale fusion named Queen-fusion is proposed to enhance feature interactions from same levels and neighbor levels. The GFPN model is able to extract rich and high-quality multi-scale features from various levels of feature maps, resulting in superior performance when compared to other enhanced FPN models.

2.4 Structural Reparameteriazation

Multi-branch network architectures such as ResNet and Inception significantly improve the performance at the training stage. However, inference time and memory consumption are increased because of complex structures, and it is not friendly to achieve the real-time detection. Structural reparameterization is a technique used to decouple the training-time and inference-time network, which results in a network that has both high performance and low complexity. Structural reparameterization refers to constructing the multi-branch structures in the training phrase and the parameters of the structure can be converted into another inference-time plain architecture with another set of parameters. RepVGG proposed by Ding et al. [27] realized the idea. Unlike Inception models designed multi-branch architectures, RepVGG utilizes a straightforward topology and extensively employs 3 × 3 convolution operations, resulting in a simple, fast and flexible model. At the training stage, RepVGG is a multi-branch model that constructs the information flow of shortcut branches, and the model weights of identity branch, 1 × 1 branch can be added into the 3 × 3 convolution layer. During the inference time, structural reparameterization outperforms plain models while maintaining a lower complexity structure than multi-branch architectures.

3 Methods

3.1 CRGF-YOLO Structure

For industrial surface defect detection, there are two challenges: (1) There are various types of sample sizes in complex circumstances, which have no obvious unified division of defect boundaries; (2) Due to uneven textures of samples under weak conditions, the intra-class difference of defect images change sharply. Although the convolution operation has limitations in obtaining contextual information of targets, the transformer modules’ self-attention mechanism can improve contextual semantic information and extract more spatial information, thereby enhancing target detection ability. In addition, GFPN has an efficient multi-scale feature fusion ability that combines feature information from both high-level and low-level layers., which have good performance in the defect detections.

From the state above, this paper proposed the CRGF-YOLO (CoT-Rep-GFPN) network, and the structure of CRGF-YOLO is shown in Fig. 1.

Fig. 1
figure 1

The structure of CRGF-YOLO

In Fig. 1, there are three parts: (1) The backbone part is a top-down information flow made of the reparameterized depthwise separable convolution modules (Rep-DSC) and the reparameterized BottleneckCSP structures (Rep-CSP). Spatial Pyramid Pooling Fusion (SPPF) module from YOLOv5 pools the input feature map and combines the results of different scales into a final feature vector. This allows for the extraction of richer feature information on different scales. In addition, the CoT module is utilized with self-attention mechanism to enhance semantic information expression and extract effective contextual information from different feature layers. (2) The neck part is of the simplified GFPN network to fuse information from same layers and adjacent layers to upgrade feature expressions. Up-sampling modules and Rep-DSC modules have the function of adjusting channel numbers through up-sampling and down-sampling operations. (3) The prediction part has four prediction heads, which are deployed to output different sizes of feature maps. “Conv” modules are used to change the number of channels for outputting detecting information. The modules proposed above are depicted in Fig. 2.

Fig. 2
figure 2

Modules in CRGF-YOLO

3.2 Rep-DSC and Rep-CSP

When delving into convolutional operations, it is important to consider the potential issues of gradient vanishing and gradient exploding that influence effective training results. By utilizing the multi-branch structure’s ability to converge well and create nonlinear diversified connections, the performance of these structures can be significantly improved. However, the multi-branch structure cannot effectively achieve parallel acceleration, and greatly increases the memory consumption. For the purposes of building the network with high performance and low complexity, the structural reparameterization is introduced. During the training phase, a multi-branch structure is utilized to enhance the capacity of the model, while the multi-branch architecture is consolidated into a single-path structure during the inference phase. This allows for the extraction of more feature information without increasing the computational load required for inference.

In Fig. 2, the 3 × 3 depthwise convolution with stride two and the 1 × 1 pointwise convolution are reparameterized in the Rep-DSC module, and SiLU (Sigmoid Linear Unit) activation function is used after Conv-BN (Batch Normalization) modules. The Rep-DSC module offers the advantage of having fewer parameters of depthwise separable convolution [28], and the architecture of the multi-branch fusion diversifies the network. The Rep-CSP module can be split into two branches, one of which has the benefits of a multi-branch structure at training time, as well as a stacking strategy of bottlenecks with residual architecture to enhance performance. The other branch performs concatenation with this. By utilizing structural reparameterization, the inference-phase results in richer features compared to the BottlenckCSP module.

Taking the structural reparameterization of the CBS module as an example. At the training stage, the standard 3 × 3 convolution is decoupled into a multi-branch architecture composed of the 3 × 3 branch, the 1 × 1 branch and the identity branch. By combining the convolution without bias and BN, it is beneficial to improve the convergence speed of the network and control the occurrence of overfitting. Then the size of each branch convolution kernel is converted into 3 × 3, and finally the multi-branches are fused. The process of structural reparameterization is demonstrated in Fig. 3.

Fig. 3
figure 3

The process of structural reparameterization

The formula of the convolution operation (\(w_{k}\) means the weight coefficient):

$$Conv{\kern 1pt} {\kern 1pt} ({\user2{X}}) \, = \, \user2{X}^{\prime} \, = \, w_{k} * {\user2{X}}.$$
(1)

BN includes five parameters: mean (µ), variance (\(\sigma\)), weight (\(w_{{\text{b}}}\)), bias (b) and a minimum value set to prevent the denominator from falling to zero (\(\varepsilon\)):

$$BN = w_{b} {\kern 1pt} {\kern 1pt} \times {\kern 1pt} {\kern 1pt} \frac{{\user2{X}^{\prime} - \mu }}{{\sqrt {\sigma^{2} + \varepsilon } }} + b.$$
(2)

The combination of Conv and BN is:

$$BN({\kern 1pt} {\kern 1pt} Conv({\text{X}}){\kern 1pt} {\kern 1pt} ) = \frac{{w_{b} \times w_{k} }}{{\sqrt {\sigma^{2} + \varepsilon } }} * {\text{X}}{\kern 1pt} {\kern 1pt} + b - w_{b} \times \;\frac{\mu }{{\sqrt {\sigma^{2} + \varepsilon } }}.$$
(3)

The Conv-BN module not only accelerates the training speed, but also realizes the additivity on the multi-branch structure. The convolution kernel size is adjusted to 3 × 3, the white square represents the element with zero in the convolution kernel. For a convolution kernel with a size of 1 × 1, a circle of zero elements is filled to change the size of the convolution kernel to 3 × 3. Because there is no convolution operation in the identity branch, it is necessary to construct a 3 × 3 convolution kernel automatically so that the output of the input data does not change.

The structure consists of the 3 × 3 branch, the 1 × 1 branch and the identity branch, corresponding to the transfer function \(f(x)\), \(g(x)\) and \(x\). By adjusting the branch dimension, the final output of the multi-branch structure is:

$$F(x) = f(x) + g(x) + x.$$
(4)

3.3 CoT Module

CNN models effectively combine visual features and semantic information, with low-level features such as edge, texture, angular providing feature information, while high-level semantic information is interpreted in terms of the low-level features. In addition, transformers are better suited for processing high-level semantic information. The CoT module enhances feature extraction from feature maps by combining the convolution operation with transformer’s self-attention mechanism and integrated contextual information between adjacent keys through self-attention learning.

In CV tasks, the structure of transformer comprises of three main components: Image Patch Embedding, encoder, and Multi-Layer Perceptron (MLP). The Image Patch Embedding is responsible for extracting image features and transforming each image patch into a tensor form that can be processed by the encoder for target detection. The encoder, on the other hand, encodes the input image tensor and mines global information using Multi-Head Self-Attention (MSA). In addition, MLP maps features to higher dimensions. MSA obtains the weighted values of the attention distribution between features by querying the matrix Q and the key matrix K. The output is then obtained by multiplying the feature matrix V by the softmax function.

As is shown in Fig. 4, the input feature is X (X = H × W × C, H: Height, W: Width, C: Channels). The CoT module utilizes the MSA mechanism, through the 3 × 3 convolution operation, all adjacent keys (K) in the kernel are contextually encoded to obtain the static contextual information K1. The local static context information keys K1 performs the concatenation with Q, and then attention matrix A will be obtained through two consecutive 1 × 1 convolutions.

Fig. 4
figure 4

The structure of the CoT module

The matrix A is expressed as:

$$A = [K_{1} ,Q]{\kern 1pt} {\cdot}W_{\alpha } {\cdot}W_{\beta } .$$
(5)

A is applied for aggregating with V (self-attention mechanism), and the dynamic contextual information K2 is received after the concatenation of aggregated feature maps.

$$K_{2} = V \circledast A.$$
(6)

Finally, the extraction of feature information is enhanced by fusing the static contextual information K1 and the dynamic contextual information K2. In this paper, we find that placing the CoT module with self-attention mechanism after Rep-CSP module can enhance the ability of feature expression and transmit effective information at different levels. To control the number of channels and create a deeper nonlinear level, the 1 × 1 CBS module is embedded before the CoT model. In Sect. 4, comparative experiments are conducted to evaluate the effectiveness of the CoT module.

3.4 CoT-GFPN

The FPN is designed to effectively merge information from multiple feature layers in the backbone network, which has been proved to be effective with limited parameters. A conventional FPN model typically combines multi-scale features in a top-down manner. However, this method has limitations in terms of one-way information flow. To address this issue, GFPN has been developed to serve as a new “neck” that integrates global contextual information into the feature pyramid, allowing for a more holistic understanding of information. This helps in capturing long-range dependencies and improving the overall performance. However, GFPN significantly increases the model size, which makes it unfriendly for real-time detection applications.

The CoT-GFPN is displayed in Fig. 5. In this network, four prediction heads for detecting different sizes are employed firstly, and the feature maps of different scales are extracted from the CoT modules. There are two information flows: one path is the skip-layer connection of same level layers, and the other one is the queen-fusion integrating high-level semantic information and low-level geometric information. The skip-layer connection allows for the direct transfer of information between feature maps from different layers that share same dimensions. This contributes to enhancing the transmission of intricate details present in the low-level feature maps, while also integrating them into the high-level feature maps. Queen-fusion adopts an adaptive weight strategy during the fusion process by assigning different weights to feature maps from the same level and neighboring levels. This allows the model to selectively emphasize more informative features while suppressing less useful ones, resulting in more effective multi-scale feature fusion. Moreover, this connection is performed in a sequential manner, starting from low resolution and gradually progressing to high resolution. This approach allows for the effective utilization of feature information at various scales, thereby enhancing the network’s capability to perceive multi-scale targets.

Fig. 5
figure 5

The design of CoT-GFPN network. Here the CoT modules are utilized to extract effective information and transfer to the neck part. The simplified GFPN in the neck part has the cross-scale connection of same level layers, and the queen-fusion connection of different level layers

3.5 Prediction Network

Four prediction heads with sizes of 10 × 10 × 1024, 20 × 20 × 512, 40 × 40 × 384, 80 × 80 × 256 are set to output detecting information. After the employed 1 × 1 convolution with stride one layers and sigmoid activation function, the channel will be adjusted to the value of C: 3 × (6 + 5). For each grid, three bounding boxes are designed to output three types of parameters: (1) the box parameter with four values, namely the center coordinates of the box (x, y) and the width and height of the box (w, h); (2) confidence with values between 0 and 1; (3) the type of the dataset is a set of conditional class detecting probabilities, which is the value between 0 and 1. The final outputs of the convolution layers are three-dimensional tensors, which include: 10 × 10 × 33, 20 × 20 × 33, 40 × 40 × 33 and 80 × 80 × 33. Each branch has feature maps with varying receptive field sizes, allowing for the detection of targets of different sizes, including large, medium, small, and smaller size of targets.

In YOLOv5s, CIOU (Complete Intersection over Union) loss function is used in the regression of the prediction box. This loss function considers the IOU, aspect ratio, and distance between the center point of the real box and the predicted box. However, it is essential to note that the CIOU loss function only considers the relative aspect ratio between the predicted and ground truth bounding boxes, which may need to be revised to capture the actual aspect ratio differences accurately, potentially leading to ambiguity or error in specific scenarios. A good approach is to use the Focal-EIOU (Focal and Efficient Intersection over Union) loss function to solve the problem. The expression of EIOU loss [29] function is as follows:

$$L_{EIOU} = 1 - IOU + \frac{{\rho^{2} (B_{ct} ,B_{ct}^{gt} )}}{{h_{c}^{2} + w_{c}^{2} }} + \frac{{\rho^{2} (w,w^{gt} )}}{{c_{w}^{2} }} + \frac{{\rho^{2} (h,h^{gt} )}}{{c_{h}^{2} }},$$
(7)

where \(B_{ct}^{gt}\), \(w^{gt}\) and \(h^{gt}\) are the center coordinates, width, and height of the real bounding box, \(B_{ct}\), \(w\) and \(h\) are the center coordinates, width, and height of the predicted bounding box, \(c_{w}\), \(c_{h}\) are the width and height of the minimum bounding rectangle of the real bounding box and the predicted bounding box. \(\rho {\kern 1pt} {\kern 1pt} ({\kern 1pt} {\kern 1pt} ,)\) is the method of Euclidean distance.

In addition, IOU loss can be expressed as follows:

$$IOU = \frac{|A \cap B|}{{|A \cup B|}},$$
(8)

where A represents the predicted box’s area and B represents the real ground box’s area. This formula is used to determine the ratio between the overlapping area and the common area. The Loss function \(L_{Focal - EIOU}\) can be expressed as:

$$L_{{F{\text{o}}cal - EIOU}} = IOU^{\gamma } L_{EIOU} ,$$
(9)

where \(\gamma\) is a hyper-parameter used to control the curvature of the curve, it is determined to have a value of 2 in this paper.

Yolov5 uses the k-means clustering algorithm to generate different sizes of prior bounding boxes, which contributes to predicting the target with different scales. Due to four detection heads employed in this paper, 12 prior bounding boxes are generated for improving the detection accuracy of steel surface defects in this paper. Table 1 is given to demonstrate a new set of prior bounding boxes serving for each detection head.

Table 1 Prior bounding boxes for each detection head

4 Experiments

This section introduces the dataset and evaluation metrics, followed by comparative experiments with different models and detectors, which are evaluated on the NEU-DET dataset. The experimental comparison shows that the performance of CRGF-YOLO is reasonable, and it achieved real-time detection in the industrial scenario.

4.1 Datasets and Experimental Environment

In this paper, NEU-DET dataset [30] is a steel strip surface defect database, offered by Northeast University, which includes six types, inclusion, pitted surface, crazing, rolled-in scales, patches, and scratches. There are 1800 gray-scale pictures with 200 × 200 pixels, and each type of defects has 300 labeled samples. Six types of samples are shown in Fig. 6. The dataset we used is divided into training set, validation set and test set in a ratio of 8:1:1.

Fig. 6
figure 6

Presentation of six types of defects in the NEU-DET dataset

The experiments are performed on the configuration with NVIDIA GeForce RTX 3060Ti GPUs, and deep-learning framework is PyTorch 1.9.0. By utilizing the image adaptive amplification function, the input image size was adjusted to 640 × 640 pixels. This method helps strengthen the detection ability at various scales on images, particularly for small targets. Moreover, enlarging the image can decrease the model’s sensitivity to resolution and image ratio, improving the model’s robustness and generalization ability.

Moreover, the SGD optimizer is a momentum of 0.9 and weight decay is 0.0005, and the learning rate increased from 0.0033 to 0.01. The models mentioned below are trained 300 epochs, and the batch size is 8. To enhance the robustness of the dataset, image enhancement methods was introduced at the training stage, such as Mosaic, Image Flip, and HSV Conversion.

4.2 Evaluation Metrics

To demonstrate the performance of the CRGF-YOLO algorithm in an objective manner, the following metrics are utilized:

a. Precision (P), Recall (R), and mean Average Precision (mAP) calculated as follows:

$${\text{Precision }} = {\text{ TP }}/ \, \left( {{\text{TP }} + {\text{ FP}}} \right),$$
$${\text{Recall }} = {\text{ TP }}/ \, \left( {{\text{TP }} + {\text{ FN}}} \right),{\text{ mAP }} = \, \Sigma \left( {{\text{APi}}} \right) \, /{\text{ N}},$$

where:TP = True Positives FP = False Positives FN = False Negatives APi = Average Precision for class iN = Number of classes

P measures the accuracy of model’s predictions, while R measures the model’s ability to detect positive instances. mAP denotes the average accuracy value across different categories.

b. Params (M), Inference Time, and GFLOPs

Params is the metric used to quantify the model size, while the computational volume of the model is represented by FLOPs (Floating Point Operations, 1 GFLOPs = 109 FLOPs), and Inference Time is used to demonstrate the detection speed.

4.3 Ablation Study

To assess and analyze the contribution of each module, ablation experiments are conducted to evaluate the following components: simplified GFPN, an additional detection head for detecting small targets, CoT module with the self-attention mechanism, structural reparameterization, and the improved loss function Focal-EIOU. Related improved models are displayed in Table 2, and evaluation metrics relevant to the corresponding cases are illustrated in Table 3.

Table 2 Cases of ablation study
Table 3 Evaluation metrics about corresponding cases evaluated on the NEU-DET dataset

In this section, various experiments were conducted to assess the efficiency of the one-stage detector presented, and main metrics need to be focused on evaluating the models’ performance. As shown in Table 3, it can be shown that evaluation metrics have improved when adding another detection head, and prior bounding boxes are generated using k-means clustering algorithm. This method makes Model 1 improve 3.3 and 2.9% in R and mAP. Compared to the baseline model, all evaluation metrics are improved at different levels in Model 2. The improved neck part includes additional feature information from different levels using Queen-fusion and skip-layer connections. Although this fusion increases the computational load and the structure becomes complex, it results in a robust algorithm generalization and greater robustness. The multi-scale CoT-GFPN network in Model 4 is employed to increase 5.0% in mAP, exhibiting varying degrees of enhancement in defect detection at different scales. It is noteworthy that the parameters sharply increase to 17.6 M.

To lightweight the network in Model 4, DSC plays an important role in reducing parameters and computational cost. Through the employment of DSC, the model size is successfully reduced by 24%. However, the use of separate filters in DSC can result in inadequate information transmitted in channels. The values of P and mAP have decreased by 2.1 and 0.3% compared with Model 4.

To reduce the number of parameters and improve the generalization ability of the model, the structural reparameterization is utilized in DSC and BottleneckCSP. Compared with Model 5, Model 6 has shown an increase of 1.0 and 2.1% in mAP and R. To accelerate convergence speed and reduce the imbalance rate of sample categories, the loss function Focal-EIOU is implemented, and the mAP value is increased by 1.9%, reaching to 82.2%. As a result, CRGF-YOLO shows increase in mAP and P of 7.7 and 4.1% compared to YOLOv5s.

4.4 Experiments on Attention Mechanisms with Grad-CAM

It is necessary to have a better understanding of the network’s capacity to extract efficient features by looking at the regions of predicting a class. Grad-CAM [31] is introduced to visualize the importance of detecting regions in convolutional layers and reflect them in the form of heat maps. Grad-CAM calculates significant weights in each channel on feature maps based on the gradient ascent strategy, creating heat maps to represent the model’s region of interest. This improves the visibility and interpretability of deep neural networks and helps users understand models better.

Experiments are performed on the NEU-DET dataset and Grad-CAM is applied to the networks below. To verify that CoT module with self-attention mechanism has better effects on feature extraction in the network, CoT can be placed with SE [32], CBAM (Convolutional Block Attention Module) [33], SimAM [34], and ECA [35], then compare the visualization results of CoT-integrated network (Network + CoT) with baseline (Model 6 without CoT modules), SE-integrated network (Network + SE), CBAM-integrated network (Network + CBAM), SimAM-integrated (Simple Attention Module) network (Network + SimAM), and ECA-integrated (Efficient Channel Attention) network (Network + ECA). In every network, six images are selected with different types of defects for comparison, and the Grad-CAM helps to display the focus area and the confidence. Figure 7 demonstrates the visualization results of the network integrated with different attention mechanisms, and relevant evaluation metrics about Grad-CAM visualization results on different networks are shown in Table 4.

Fig. 7
figure 7

Grad-CAM visualization results of networks with different mechanisms

Table 4 Evaluation metrics about Grad-CAM visualization results on networks with different attention mechanisms

In Fig. 7, Grad-CAM shows attended area of various types of defects, and visualization results have been improved by adding attention mechanisms at different levels. Compared with the baseline network, CoT-integrated network has best performance of extracting effective information and covers largest attended areas in target regions. It can be seen that complex background textures have negative impacts on extracting efficient features, which is particularly evident in the pictures of crazing, pitted surface, and rolled-in scale. The CoT module enhances feature expression by establishing the contextual relationship between information to effectively locate and enlarge the attended area. This improves the network’s performance and reduces interference from background factors.

Integrated with data from Table 4, it can be shown that mAP values of different networks have been increased with attention mechanisms. Although the number of parameters has increased by 40%, the CoT-integrated network outperforms, improving 2.4%, 4.5%, and 3.8% in P, R and mAP compared with Network. It can be seen that complex background textures have negative impacts on extracting efficient features, which is particularly evident in the pictures of crazing, pitted surface, and rolled-in scale. The CoT module enhances feature expression by establishing the contextual relationship between information to effectively locate and enlarge the attended area. This proves that CoT module capitalizes on the contextual information to strengthen the capacity of visual representation and transfers useful information to the network.

4.5 Comparison of Different Models

Table 5 shows the results of different detectors evaluated on the NEU-DET dataset. To measure the effectiveness of defect detection, it is important to consider both the detecting speed and the accuracy rate. Metrics such as mAP, Inference Time, and Params are important indicators when evaluating these capabilities. Under the experimental environment and setting parameters in this paper, six representative existing methods are selected for comparative experiments: Faster R-CNN, YOLOv3_spp [36], YOLOv5s, YOLOv7s [37], YOLOX_s [38], and YOLOv8s [39].

Table 5 The results of different networks evaluated on the NEU-DET dataset

As shown in Table 5, Faster R-CNN achieves satisfactory mAP values in Patches and Scratches, but its model size is the largest and Inference Time is the longest, making it unsuitable for lightweight deployment. In comparison to Faster R-CNN, the YOLO series has less Inference Time and smaller model sizes, making it a more efficient option. As a comparative algorithm, YOLOv5s has the smallest model size and fastest detection speed, but mAP value is not satisfactory. On the other hand, our FPS is slower 46% than YOLOv5s because of the increased model size. Influenced by complex textures, Crazing has obscure boundaries and its performance in terms of mAP values is significantly lower than other types. Depending on the CoT module with the self-attention mechanism and multi-scale feature fusion from different layers, the ability of feature expressions is enhanced. As a result, the performance of Crazing is increased by 17.2% compared to YOLOv5s. In Aaddition, mAPs of other types have also been improved at different levels, which demonstrates CRGF-YOLO has strong ability to detect defections with different scales, and FPS is improved than that of Faster R-CNN. Some visualization results from CRGF-YOLO on the NEU-DET dataset are shown in Fig. 8.

Fig. 8
figure 8

Comparison of visualization results from different models on the NEU-DET dataset

5 Limitations and Future Research

The improved model based on YOLOv5s outperforms other algorithms. However, the interference of complex background noise can affect the effectiveness of detecting some defections. It is crucial to explore methods to minimize the impact of background noise on defect targets and enhance inspection outcomes. In addition, for categories that exhibit minimal differences within classes, conducting research on edge detections can significantly enhance the classification accuracy of similar categories. Furthermore, the deep-learning framework encompasses numerous hyperparameters, necessitating further experiments to identify the optimal detection parameters and avoid falling into the local optimal dilemma.

To enhance the performance in defect detection, future work can explore the use of Generative Adversarial Networks (GAN) and its variants to expand the dataset. In addition, improving the preprocessing of images by suppressing background noise and enhancing local feature information can help improve the accuracy and robustness of the model. It may not be the optimal one to choose YOLO as the basic model. Other CNNs or even building custom networks can be taken into consideration, as well as more improved methods, such as dilation convolution and other transformer variants, which benefits the network to increase the receptive fields of feature extraction. Although different networks may perform well in validation on datasets, they often necessitate on-site testing and fine-tuning in various industrial scenarios to meet requirements.

6 Conclusions

As the research on CNNs for steel defect detection progresses, researchers have started evaluating on real-time collected defect datasets to train models and improve detection capabilities. This study specifically focuses on making improvements to YOLOv5s, realizing exceptional outcomes on the NEU-DET dataset.

In this paper, an enhanced YOLO model called CRGF-YOLO is designed for detecting steel surface defects. Based on YOLOv5s, four detection heads are employed to detect targets with different scales, and the performance of detecting small targets is improved. The CoT module utilizes the self-attention mechanism to improve feature expression by connecting contextual information from different feature maps, resulting in more efficient information transfer to the neck part. On the other hand, the simplified GFPN facilitates multi-scale feature fusion and enables effective information exchange between different feature maps. By utilizing Grad-CAM for visualizing the results, it can be found that the CoT-GFPN network outperformed the others and covered the largest attended areas in detecting regions. Furthermore, Rep-DSC and Rep-CSP implement the structural reparameterization to create nonlinear diversified connections and improve the generalization ability of the network, maintaining high detection accuracy and a compact model. Finally, the loss function Focal-EIOU is utilized to accelerate model convergence and improve the imbalance of sample categories. Based on the experimental results and evaluations, it is concluded that CRGF-YOLO outperforms at the training and validation stage, with the mAP value of 82.2% compared to other detectors, which indicates that the improved model has both high detection accuracy and meet real-time requirements.

This study makes significant contributions by offering valuable insights and practical guidance for enhancing defect detection technology in the steel industry. By overcoming the drawbacks of conventional methods and harnessing innovative techniques, this research sets the stage for more robust and effective defect inspection systems. Ultimately, it aids in strengthening product quality and minimizing economic losses within steel manufacturing.