Introduction

Fig. 1
figure 1

Examples of steel surface defects. a inclusion. b scratches. c crazing. d patches. e pitted_surface. f rolled-in_scale

Steel is widely used in industrial manufacturing, so the detection of steel surface defect is crucial [1]. Steel has a wide variety of surface defects, such as “inclusion”, “scratches”, “crazing”, “patches”, “pitted_surface” and “rolled-in_scale”. These defects have complex features as shown in Fig. 1. Using manual inspection of steel surface defects, it is difficult to distinguish the type of defect and accurately detect the location of the defect [2]. The task of defect detection consists of identifying defect categories and locating defect locations, which is a very challenging task. With the emergence and development of machine learning, industrial defect detection methods based on machine learning have become useful tools [3]. In this regard, Krummenacher et al. [4] designed a novel wavelet features for time series data and they learned a classifier using a support vector machine to detect wheel defect. Chan and Pang [5] used a simulated fabric model to understand the relationship between the fabric structure in the image space and in the frequency space. In [6], a method based on acoustic emission was proposed to detect defect in carbon steel welded joints. However, the above traditional machine learning-based defect detection methods have low detection accuracy and cannot meet the needs of industry.

With the development of deep learning technology, deep learning-based object detection methods have been rapidly improved and advanced [7, 8]. The object detection models based on deep learning are mainly divided into two types: two-stage models and one-stage models. The SOTA two-stage object detection models include R-CNN [9], Fast R-CNN [10], and Faster R-CNN [11]. The well-known one-stage object detection models include SSD [12], YOLO series [13,14,15,16], and RetinaNet [17]. These methods are widely used in industrial defect detection. Lin et al. [18] achieved the application of convolutional neural network (CNN) for LED chip defect inspection. Wei et al. [19] used the Faster R-CNN to detect fastener defect. In [20], a novel defect detection network (DefectNet) is proposed to solve the problem of defect detection. However, the above deep learning-based industrial defect detection methods do not specifically optimize the model’s ability to extract and fuse features, which is crucial for defect detection. Studying how to improve the feature extraction ability and feature fusion capability of the model is this paper needs to solve.

Fig. 2
figure 2

The structure diagram of the YOLOv5s model. The conv2d module represents a two-dimensional convolutional layer

Fig. 3
figure 3

The main components of the YOLOv5s model

In this paper, a steel surface defect detection model is introduced to solve the problems existing in the above methods. The YOLOv5s [21] is selected as the baseline model because of its small number of parameters and high inference speed. The features of steel surface defects are complex and it is difficult for the model to extract features. In order to improve the ability of the model to extract features, this paper introduces a module to extract multi-scale features called the Multiscale Feature Extraction (MSFE) module. The MSFE module extracts multi-scale features through three different branches having convolutional kernels of different sizes and fuses them. At the position behind the model, the features extracted from the shallow network of the model tend to disappear. To solve this problem, the features generated in the backbone network are fed into the neck network for feature fusion. This design is called the Efficient Feature Fusion (EFF) Module. The EFF Module can make the features generated in the backbone network and the features generated in the neck network fuse more efficiently. Further, some of the normalization layers and activation functions in the bottleneck module are removed to reduce the negative impact of having too many normalization layers and activation functions in the model. Also, the depth of the backbone network is deepened to further improve the ability of the model to extract features.

The main contributions of this paper are as follows.

  1. (1)

    A deep learning model for steel surface defect detection is proposed. The effectiveness of the proposed modules and model is validated by extensive ablation studies and contrast experiments on a challenging steel surface defect detection dataset (NEU-DET).

  2. (2)

    An efficient feature extraction module (MSFE) is introduced to improve the ability of the model to extract features effectively. The designed MSFE module uses three branches with convolutional kernels of different sizes for multiscale feature extraction.

  3. (3)

    An effective feature fusion scheme called EFF is introduced which efficiently fuses the generated features in the backbone network with the features in the neck network.

  4. (4)

    A new Bottleneck module is designed, which has fewer normalization layers and activation functions. Besides, the backbone network of the model is deepened for further enhancement of the model’s ability to extract features.

The rest of this paper is organized as follows. “Related works” section discusses related works on the YOLOv5s model and industrial defect detection methods. The proposed method and modules are presented in “The proposed method” section. Datasets used in the experiments, performance evaluation metrics, and extensive experimental results are provided in “Experiments and analyses” section. Finally, “Conclusion” section concludes the paper.

Related works

YOLOv5s

YOLOv5 [21] has five model structures of different sizes, namely YOLOv5n, YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x. YOLOv5s maintains a relatively lightweight model size while having great detection accuracy. Therefore, YOLOv5s is chosen as the baseline model in this paper. The structure of YOLOv5s is shown in Fig. 2. The most basic module in YOLOv5s is called Conv as shown in Fig. 3a, which is composed of a convolution layer, normalization layer, and SiLU activation function. Figure 3b depicts the bottleneck module, which is made up of a \(1\times 1\) Conv and a \(3\times 3\) Conv connected in series.

In this context, it should be noted that the bottleneck module in the backbone network has residual structure to connect the input information and the output information, but has no residual structure in the neck network. The C3 module is shown in Fig. 3c. It has two branches. The first branch is composed of a \(1\times 1\) Conv module and X bottleneck modules. And the second branch only has a \(1\times 1\) Conv module. Following the two branches is a concat module, which splices the outputs of the two branches in channel dimension. At the end of the C3 module is a \(1\times 1\) Conv module, which is used to control the number of output channels. The final SPPF module of the backbone network is used for multi-scale feature fusion, whose structure diagram is shown in Fig. 3d. The YOLOv5s backbone network consists of Conv modules, C3 modules, and a SPPF. The neck network is the PANet [22] structure, and the head network is the YOLO head [15]. The backbone network of the YOLOv5s is responsible for feature extraction. It can get three feature maps, which have different sizes and channels. And the neck network is responsible for fusing the feature information. The head network outputs the prediction results.

Table 1 The average size of each category in NEU-DET dataset

Deep learning in industrial defect detection

With the rapid development of deep learning, more and more deep learning methods are being used to detect industrial defect [23]. Yu et al. [24] proposed a lightweight and efficient defect detection network (LEDD-Net) for PCB defect detection. They designed a novel backbone network and a neck work, which can effectively fuse the multiscale features. An adaptive localization loss function was designed to compute the localization loss. Cheng and Yu [25] used the optimized RetinaNet to detect steel surface defect. First, they proposed the differential evolution search algorithm to optimize anchor configuration. Then, a novel channel attention module is proposed to make the model learn more important channel information. They integrated the adaptively spatial feature fusion (ASFF) [26] module into the model to make full use of shallow and deep feature information. Su et al. [27] proposed a CNN-Based detector for photovoltaic cell defect detection. They designed a multihead cosine nonlocal attention module to make the model surpass the unfavorable features and learn the useful information. A BAFPN was proposed to enhance the feature fusion ability of the model.

Ying et al. [28] modified the YOLOv5s model to detect the wire braided hose defects. The K-means++ clustering algorithm was adopted to obtain more suitable anchor boxes, and the Focal loss function was adopted to balance between negative and positive samples. They used the efficient channel attention (ECA) mechanism to improve the detection performance. Chen et al. [29] proposed a method based on the deep convolutional neural networks to detect the defect of the fasteners. Zeng et al. [30] proposed a novel feature fusion method namely the ABFPN, which is for small object detection. And they came up with an IPDD framework where the developed ABFPN is embedded as the feature fusion method, which is used to detect PCB surface defect. In Li et al. [31] presented a tunnel surface image preprocessing approach. And they proposed a multi-layer feature fusion network based on Faster RCNN to detect the tunnel surface defect. Cui et al. [32] put forward a fast network for surface defect detection, called SDDNet. They introduced feature retaining block (FRB) and skip densely connected module (SDCM) to overcome issues of large texture variation and small size of defects. Chen et al. [33] designed a visual defect detection method based on multi-spectral deep convolutional neural network (CNN) to detect solar cell surface defect.

Fig. 4
figure 4

The structure diagram of the proposed MSFE module

The proposed method

Multiscale feature extraction module

There are many categories of steel surface defects and the size of each defect category varies widely. The average size of each category in the NEU-DET [34] dataset is shown in Table 1. To enhance the feature extraction ability of the model, this work designs a Multiscale Feature Extraction (MSFE) module as shown in Fig. 4. The MSFE module has three branches. The first branch is composed of a \(1\times 1\) Conv module and X bottleneck modules. And the second Conv of the bottlenecks is a \(3\times 3\) Conv module. The second branch includes a \(1\times 1\) Conv module and X bottleneck modules. The second Conv of the bottlenecks is a \(5\times 5\) Conv module. The third branch only has a \(1\times 1\) Conv module. There is a Concat module after the three branches, which can splice the outputs of the three branches in the channel dimension. The final module of the MSFE module is a \(1\times 1\) Conv module, which is used to fuse the outputs of the three branches. Finally, the convolutional layer of the second Conv module in the bottleneck module is substituted by the depthwise convolutional [35] layer to save parameters, as shown in Fig. 4b. The proposed MSFE module extracts multiscale features through three branches, which have different convolutional kernel sizes. It then fuses the obtained multiscale features by a Concat module and a \(1\times 1\) Conv module. Compared with the C3 module of the baseline model, the proposed MSFE module can extract features more effectively. This work uses the designed MSFE module to replace the C3 module of the baseline model.

Efficient feature fusion module

Feature fusion is a key research area in the field of object detection. The front feature map generated by the backbone network of the model has stronger location information, while the back feature map has stronger semantic information. Strong location information facilitates the prediction of the location of defects, while powerful semantic information facilitates the prediction of the category of defects. Effective fusion of strong location information with powerful semantic information can improve the detection performance of the model. The FPN (Feature Pyramid Network) [36] is a well-known feature fusion method. It has a top-to-bottom process that can pass strong semantic information from top to bottom. And it effectively fuses the strong semantic information with the powerful location information through a lateral connection. Compared with the FPN, the PANet [22] has an additional bottom-to-top process that can pass strong location information from bottom to top. And it effectively fuses the strong location information with the powerful semantic information through a lateral connection. Therefore, PANet has a better fusion effect than FPN. However, as the model deepens, the features extracted by the backbone network become blurred in the bottom-to-top process of PANet. To solve this problem, the features generated in the backbone network are fed into the bottom-to-top process of PANet for feature fusion. And the operation is called the EFF (Efficient Feature Fusion) module. The EFF module is illustrated in Fig. 5 and is computed as:

$$\begin{aligned} \text {EFF}(F_{1},F_{2},F_{3})=\text {Conv1}(\text {Concat}(F_{1},F_{2}))+\text {Conv2}(F_{3}), \end{aligned}$$
(1)

where \(F_{1}\) is the feature from the front layer of the EFF module, \(F_{2}\) is the feature from the top-to-bottom process of PANet, and \(F_{3}\) is the feature from the backbone network. The Conv1 includes a \(1\times 1\) convolutional layer, a normalization layer, and a SiLU activation function. And the Conv2 includes a \(3\times 3\) convolutional layer, a normalization layer, and a SiLU activation function.

Fig. 5
figure 5

The structure diagram of the EFF module, where Conv module includes a convolutional layer, a normalization layer, and a SiLU activation function

New bottleneck

ConvNeXt [37] reduces the normalization layer and activation function of its blocks. This change brings improvement for ConvNeXt. We think that too many activation functions and normalization layers will make the model overfit the data, which will adversely affect the training and prediction of the model. Thus, this paper designs a new bottleneck module with fewer normalization layers and activation functions to solve the problem. As shown in Fig. 6a, the original bottleneck module of the MSFE module has two normalization layers and two activation functions. And the new bottleneck proposed in this work is shown in Fig. 6b. It has a single normalization layer and a single activation function. And the designed new Bottleneck is integrated into the MSFE module.

Fig. 6
figure 6

Bottleneck modules. a The original bottleneck of the MSFE module. b The proposed bottleneck. conv2d is the convolutional layer, BatchNorm is the normalization layer, SiLU is the activation function, and dwconv2d is the depthwise convolutional layer

Fig. 7
figure 7

The structure diagram of the proposed method

Deepen the backbone network

The defects on the steel surface have complex textures, and their features are difficult to extract. To further enhance the feature extraction ability of the model, the backbone network of the model is deepened by changing the stage ratio. The original stage ratio of the backbone network is 1:2:3:1 as shown in Fig. 2, and the new stage ratio is 2:2:4:2 after deepening the backbone network as shown in Fig. 7.

Model structure

After the above four improvements, the steel surface defect detection model is obtained and shown in Fig. 7. The proposed model is composed of three components: the backbone network, the neck network, and the head network. The backbone network is responsible for extracting features; the neck network is used to fuse features; and the head network outputs the detection results.

Experiments and analyses

Dataset

In experiments, the public NEU-DET [34] dataset is used to verify the effectiveness and performance of the proposed method. The NEU-DET dataset shows the steel surface defect and it has 1800 images in total. The dataset has six categories of steel defect, namely “crazing”, “patches”, “inclusion”, “pitted_surface”, “rolled-in_scale” and “scratches”. Some samples of the dataset are shown in Fig. 8. This paper splits the dataset into a training dataset and a test dataset according to 8:2, and finally gets 1448 training images and 352 test images, respectively.

Fig. 8
figure 8

Samples of the NEU-DET dataset, where af represents crazing, patches, inclusion, pitted_surface, rolled-in_scale and scratches category of the steel surface defect, respectively

Implementation details

All experiments in this work are conducted on the RTX 3090 GPU with Python 3.8 version, PyTorch version 1.9.0, and CUDA version 11.1. The batch size is set to 16 to avoid memory overflow. The resolution of the input figures is set to \(640\times 640\). All models are trained for 300 epochs in total. The IoU threshold of the NMS operation is set to 0.6.

Table 2 Comparisons with other detection models on the NEU-DET dataset
Table 3 Ablation studies on the NEU-DET dataset

Evaluation metrics

TP represents true positive, which means the model detects true defects. FP represents false positive, which means the model obtains defects but the defects are false. FN stands for false negative, which refers to the missing detection defects. Precision measures how accurate a model is when it predicts a positive result, meaning the proportion of true positives divided by all positives. Recall, on the other hand, measures how well a model can identify all positive samples, meaning the proportion of true positives divided by all actual positives. A PR curve is a curve made with recall as the horizontal coordinate and precision as the vertical coordinate. AP is the area of the region enclosed by the PR curve of a defect category and the coordinate axis. In short, the precision, recall, AP, and mAP are computed as:

$$\begin{aligned} P= & {} \frac{\text {TP}}{\text {TP}+\text {FP},} \end{aligned}$$
(2)
$$\begin{aligned} R= & {} \frac{\text {TP}}{\text {TP}+\text {FN},} \end{aligned}$$
(3)
$$\begin{aligned} \text {AP}= & {} \int _{0}^{1}P(R)\text {d}R, \end{aligned}$$
(4)
$$\begin{aligned} \text {mAP}= & {} \frac{\sum _{i=1}^N \text {AP}}{N}, \end{aligned}$$
(5)

where P is precision, and R represents recall. The mAP is the average value of all APs. This paper chooses to use the mAP@0.5 and the mAP@0.5:0.95 as the evaluation metrics to verify the effectiveness of the proposed method.

In deep learning, FLOPs stands for “floating point operations” and is used to measure the computational complexity of a machine learning model. FLOPs is computed as:

$$\begin{aligned} \text {FLOPs} = (2 \times C_{\textrm{in}} \times K^2-1) \times W_{\textrm{out}} \times H_{\textrm{out}} \times C_{\textrm{out}}, \end{aligned}$$
(6)

where \(C_{\textrm{in}}\) represents the number of input channels, K is the kernel size. \(W_{\textrm{out}}\) and \(H_{\textrm{out}}\) denote the width and the height of the output feature map, respectively. \(C_{\textrm{out}}\) represents the number of output channels. This paper uses FLOPs as the evaluation metric to compare the computational complexity of the models.

Comparisons with other detection models

Several comparison experiments are conducted on the public NEU-DET dataset, and the results are reported in Table 2. The proposed method is compared with many state-of-the-art models, including YOLOv3, ScaledYOLOv4-csp [38], YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv7 [39]. It is clear that the proposed method achieves better performance at mAP@0.5 and mAP@0.5:0.95. From Table 2, the number of parameters in the proposed model is only 2.5 M larger than that of the smallest model. Also, the FLOPs of the proposed model is only a little larger than the smallest model. And the proposed model obtains the highest score, it achieves 73.08% and 37.57% in mAP@0.5 and mAP@0.5:0.95, respectively. Compared with the well-known YOLOv3 model, the proposed method is higher by 2.69% in mAP@0.5 and 1.56% in mAP@0.5:0.95. The number of parameters in our method is only 0.15 times the number of parameters in YOLOv3. Our model is higher by 3.71% and 4% in mAP@0.5 and mAP@0.5:0.95 than the ScaledYOLOv4-csp model, while the number of parameters in our method is only 0.18 times the number of parameters in ScaledYOLOv4-csp. The number of parameters in the YOLOv5s model, the YOLOv5m model, and the YOLOv5l model increases sequentially. The mAP@0.5 and mAP@0.5:0.95 of the three YOLOv5 models also increase sequentially. Our method outperforms the YOLOv5l model (0.85% and 0.83% in mAP@0.5 and mAP@0.5:0.95, respectively) and has a lower number of parameters. Compared with the SOTA object detection model, YOLOv7, the designed model still obtains higher mAP@0.5 and mAP@0.5:0.95. Therefore, the proposed method achieves the best results while keeping a small number of parameters.

Fig. 9
figure 9

Comparison of the proposed method and the baseline YOLOv5s. a mAP curve. b Loss curve

Ablation studies

In this section, several ablation studies on the NEU-DET dataset are conducted to verify the effectiveness of the modules proposed in this work. Table 3 displays the experimental results, which show the changes of the number of parameters, mAP@0.5 and mAP@0.5:0.95 during the ablation studies. The YOLOv5s model is called Model A, which is the baseline model. First, to verify the validity of the MSFE module, the C3 module is replaced by the MSFE module. This model is called Model B. Second, the EFF module is integrated into Model B to prove the efficiency of the EFF module, and the resulting model is called Model C. Third, the new Bottleneck is introduced into Model C to validate its effect. The obtained model is named Model D. Finally, the backbone network of Model D is deepened to verify the effectiveness of a deeper backbone network, and the obtained model is called Model E.

Performance of the MSFE module

It can be found from Table 3 that after Model A uses the MSFE module, mAP@0.5 increases by 0.81%, and mAP@0.5:0.95 increases by 1%. At the same time, the number of parameters of the model reduces about 1 M. Therefore, the MSFE module proposed in this paper is an efficient feature extraction module that can improve the detection accuracy of the model while reducing the number of parameters. The use of the MSFE module is worthwhile. The reason is that the designed MSFE module uses three branches with convolutional kernels of different sizes for multiscale feature extraction, which effectively enhances the model’s ability to extract features.

Performance of the EFF module

From Table 3, it can be found that after the introduction of the EFF module into model B, mAP@0.5 increases by 0.61%, and mAP@0.5:0.95 increases by 1.12%. The improvement brought by the EFF module is considerable, so the EFF module is effective. The reason is that the EFF module efficiently fuses the generated features in the backbone network with the features in the neck network.

Impact of the new bottleneck

Looking closely, it can be noted from Table 3 that the model C has almost no change in parameters after using the bottleneck module with reduced normalization layer and activation function. The mAP@0.5 of model C increases by 0.23% and the mAP@0.5:0.95 increases by 0.05%. The lift is small, but it is worth adding the new bottleneck to the model. The new Bottleneck is effective, and the reason is that the new Bottleneck reduces the negative impact of having too many normalization layers and activation functions in the model.

Impact of deepening the backbone network

It can be seen from Table 3 that after deepening the backbone network of model D, the mAP@0.5 increases by 0.56%, mAP@0.5:0.95 increases by 0.38%, and the number of parameters increases by 0.18 M. Deepening the backbone network can further improve the feature extraction ability of the model, and it is effective.

Comprehensive performance of the proposed model

Figure 9a shows the comparison of the mAP curve for the proposed method and the YOLOv5s model, and the comparison of the Loss curve during training is shown in Fig. 9b. It is clear that the proposed method is higher than the YOLOv5s model from the mAP curve and lower than the YOLOv5s method from the Loss curve. Compared with the YOLOv5s model, our final model increases by 2.21% and 2.55% in mAP@0.5 and mAP@0.5:0.95, respectively.

Figure 10 shows some samples of the detection results obtained using the proposed method and the YOLOv5s method. From Fig. 10a, it can be found that the YOLOv5s model detects only three labels and misses the top right label. The proposed method detects the four labels correctly. From Fig. 10b, we can see that there are three labels. The YOLOv5s method detects only two targets, while the proposed method finds three precise targets. Thus, the proposed method has an improvement in missed identification compared to the YOLOv5s model. As shown in Fig. 10c, we can find that the YOLOv5s model detects the background as ’patches’ category, and our method has the right detection result. Therefore, our method has a lower misunderstanding rate than the YOLOv5s model. From Fig. 10d, it can be found that the YOLOv5s model can’t precisely detect the right label, while our model detects it correctly. Thus, the proposed method has a forecast improvement over the YOLOv5s model.

Fig. 10
figure 10

Comparison of detection results for our method and YOLOv5s. The left are labels, the middle are the detection results of the YOLOv5s model, and the right are the detection results of our method

Conclusion

In this paper, a steel surface defect detection method based on deep learning is proposed. The steel surface defect is complex and multiscale. It is difficult for the model to extract features. To solve this problem, a multiscale feature extraction (MSFE) module is designed. The MSFE module uses three branches with convolutional kernels of different sizes for multiscale feature extraction. Also, an efficient feature fusion (EFF) module is proposed to overcome the problem of disappearing shallow features. The EFF module adds the feature maps of the backbone network to the neck network to improve the feature fusion ability of the model. Furthermore, a new bottleneck with a single normalization layer and a single activation function is introduced. Besides, the backbone network is deepened to further enhance the feature extraction ability of the model. Extensive ablation experiments on the public NEU-DET dataset are conducted, and the effectiveness of the modules proposed in this work is verified. And several comparison experiments with many SOTA object detection models are carried out to prove the effectiveness of the proposed model. The experimental results demonstrate that the proposed method obtains optimal scores in mAP@0.5 and mAP@0.5:0.95. In the future, we will continue to optimize the model structure and improve the detection performance of the model for steel surface defect detection.