Introduction

Textiles are an integral part of people's daily lives. As a basic industry, the textile industry occupies an important position in the world industry, so it is vital to ensure the quality of cloth production. In the actual production of cloth, the surface of cloth suffers from many defects due to material quality, mechanical factors, dye type, yarn size, and human factors. Common cloth defect samples are shown in Fig. 1. The resulting defects will not only affect the quality of the product but also cause huge waste [1, 2]. Fabric surface defect detection has become one of the key processes in industrial production, but many factories still use manual inspection, which is costly, inefficient, and subjective, long time the inspection will make workers' eyes fatigue, increase the length of inspection, reduce accuracy, and very easy to have a wrong inspection and missed inspection. Therefore, a real-time fabric defect detection system needs to be developed to improve detection efficiency, reduce labor costs, and increase business benefits.

Fig. 1
figure 1

Five common types of fabric defects (a) Broken hole; (b) Flower board; (c) Pulp spot; (d) Three threads; (e) Stain

As neural networks have been proposed, more and more problems can be solved by various neural networks and their variants, such as nonlinear problems [3,4,5], target detection [6], target segmentation [7], dynamical system control [8,9,10,11], prediction of complex systems [12, 13], natural language processing [14], etc. The continuous development of neural networks also provides an important tool in the field of fabric defect detection. At present, object detection models are divided into two main categories: (1) two-stage. The well-known one is the Region-CNN(R-CNN) [15] model, which generates a large number of candidate frames by selective search [16], followed by classification and regression of the candidate frames by CNN. Compared with the traditional methods, R-CNN significantly improves the accuracy of detection but brings a huge amount of computation. Although subsequent faster methods such as FastR-CNN [17] and FasterR-CNN [18] have been proposed, they still cannot solve the problem that two-stage detection methods cannot achieve real-time detection. (2) One-stage. The well-known one is you only look once (YOLO) model. The one-stage approach does not have a separate proposal generation phase. YOLOv5 is a very popular single-stage object detection method that can achieve a balance between speed and accuracy at the same time. Compared with the two-stage FasterR-CNN, its pair performs classification and regression only once, which greatly reduces the computational complexity and improves the detection efficiency. Improved feature extraction and feature fusion by deeper backbone and Neck. However, it is challenging to directly apply YOLOv5 to fabric defect detection, because there are many defects with similar color and background texture (such as Fig. 1(e)), and the size of the defect is greatly different (such as Fig. 1a, b), and the long The difference in aspect ratio is large (such as Fig. 1d), and there are many small objects (such as Fig. 1a), making it difficult to perform effective detection.

We propose a PEI-YOLOv5 defect detection network to achieve fast and accurate real-time detection, considering the development needs of defect detection models for embedded system deployment and real-time detection in real production. The network has the following features: (a) modifying the backbone network structure by replacing the last C3 module in the original backbone with the PDC3 module proposed in this paper, which can effectively reduce the model FLOPs and improve the feature extraction capability at the same time. (b) Improve the feature fusion part, using the EB-2 and EB-3 modules proposed in this paper enhances the fusion of defect spatial information and channel information, especially for the detection of defects similar to the background and extreme aspect ratio defects. (c) Improving the loss function, IN loss significantly improves the detection ability of small target defects and the model convergence speed through the integration of normalize Wasserstein distance and CIOU.

The contributions of this paper are as follows:

  1. (1)

    A faster convolution-PDConv is proposed, which can effectively extract spatial features and still obtain accuracy improvement while reducing the number of parameters to improve the detection speed.

  2. (2)

    The EB module is proposed to enhance the attention to the defective channels and spatial information, making full use of the multi-scale feature information.

  3. (3)

    IN loss is proposed to improve the detection capability of small targets and effectively accelerate the model convergence speed.

  4. (4)

    We evaluate the proposed PEI-YOLOv5 on a portion of the GuangDong Tianchi dataset and the NEU surface defect database, respectively. The experimental results exceed the current state-of-the-art target detection algorithms.

  5. (5)

    PEI-YOLOv5 was deployed to NVIDIA Jetson TX2, and the detection speed reached 31 FPS, which is larger than the 30 FPS required for real-time detection

Related work

Fabric defect detection

Currently, cloth defect detection algorithms are divided into two main categories: traditional algorithms and learning-based algorithms [19, 20] combined saliency image features with an improved anisotropic filter to extract features of fabric defects, considering both local gradient magnitude and a modified saliency map based on the original anisotropic diffusion model, which removes the background information of the fabric while preserving the defect edges. Shi et al. [21] proposed a fabric defect detection method based on gradient low-rank decomposition and structured graph algorithm, which first divides the defective image into non-defective and defective regions, prevents the merging of defective and non-defective regions by setting an adaptive threshold during the merging process, and finally uses the a priori information of the segmentation results to guide the matrix decomposition. In addition, there are methods to achieve fabric defect detection using statistical methods [22, 23], discrete cosine transforms [24, 25], model-based [26, 27].

With the development of computer science, learning-based approaches for fabric defect detection are becoming more and more popular, especially CNN-based methods have attracted many researchers to conduct research and achieve satisfactory results [28]. Wang et al. [29] proposed an adaptively fused attention module for the problem of difficult detection of tiny fabric defects, which enhances the spatial and channel feature maps and the attentional information flow between them, enabling the detector to better capture inconspicuous small targets, and achieves MS-COCO and CioF datasets with excellent performance on MS-COCO and CioF datasets. However, this method only obtained 12.19 FPS on RTX3090, which could not achieve real-time detection. Liu et al. [30] proposed a framework for fabric defect detection via generative adversarial networks, specifically customizing a deep semantic segmentation network for detecting different defect types and training a Multistage GAN to synthesize reasonable defects in new defect-free samples. The method is capable of continuously updating the fabric defect dataset and tuning the semantic segmentation network with good performance on multiple datasets. Chen et al. [31] addressed the problem of fabric texture interference by fusing Gabor kernels into a FasterR-CNN and trained the model using a two-stage training approach to effectively identify various backgrounds, locations, and defects. Although the above methods have made a breakthrough in detection accuracy, they have neglected detection speed. In the actual factory production, the running speed of the fabric on the conveyor is very fast, which requires the detector to have a high detection speed to realize the accurate detection of fabric defects.

Attention Mechanism.

The ability of humans to tend to focus on a certain part of the information they need in complex and diverse scenarios of real life and ignore other unimportant information is a means for humans to quickly obtain valuable information from a large amount of information using their limited information processing power. Attentional mechanisms were introduced into computer vision with the aim of mimicking this aspect of the human visual system [32]. Attention mechanisms currently play a huge role in computer vision, and can greatly improve the ability of models to capture remote information and run faster. Hu et al. [33] proposed channel attention and used it to propose SENet, which is mainly composed of squeeze-and-excitation (SE) blocks, which can effectively acquire global information, model the relationship between channels and adaptively adjust the features of channels accordingly with a computational. The computational complexity is low. However, the overly simple average pooling of SE block, which only considers channel information and ignores location information, makes it difficult to obtain complex global information. To address the problem that the global average pooling of SE block is too simple, Gao et al. [34] used global second-order pooling (GSoP) block to improve it, which enhances the access to global information to some extent, but this also adds many extra computations. Spatial attention is capable of adaptively selecting locations in spatial regions that require attention, and RAM [35], STN [36], and GENet [37] have achieved better results using spatial attention mechanisms. The convolutional block attention module (CBAM) proposed by Woo et al. [38] ties channel attention to spatial attention and introduces global pooling to obtain global information, which can better inform the network of the content and location that needs attention, greatly improving computational efficiency. The method tries to exploit the location information by reducing the channel dimension of the input tensor after large-scale convolution. However, convolution can only capture local relationships. It cannot address the remote dependencies in vision tasks [39]. Therefore, Hou et al. [40] proposed coordinate attention, which fuses location information into channel attention to enable the network to acquire larger important regions with less computation. The coordinate attention structure is very simple, can be flexibly inserted into the classical network with small computational effort, and has achieved good performance on tasks such as classification tasks [41, 42], target detection [43, 44], and semantic segmentation [45].

Multi-scale feature fusion

Convolutional neural networks have a layer-by-layer structure, and the deeper the feature map is, the larger its perceptual field is, so feature maps of different depths in the network constitute a natural multi-scale representation. However, this structure has poor detection results for small targets. The underlying reason is that there is a significant semantic divide with different feature map depths and different feature representation capabilities. To address the above drawbacks, Lin et al. [46] proposed feature pyramid networks (FPN), which can fuse the fine-grained spatial information of shallow feature maps and the semantic information of deep feature maps to improve the detection of multi-scale targets. However, the top-to-bottom layer-by-layer fusion strategy of FPN is not optimal, so many improvements have emerged on top of it. Liu et al. [47] proposed PANet. Starting from the bottom layer of FPN, a bottom-up feature refusion side channel was added to ensure that the prediction of each target fully utilizes the information of all feature layers. Tan et al. [48] proposed BiFPN based on the FPN of PANet to construct EfficientDet. BiFPN removes the node with only one input, adds a connection between the input and output feature maps of the same scale to fuse richer features, and stacks the pyramid structure multiple times to enhance the feature expression ability. Figure 2 shows the structure diagrams of FPN, PANet, and BIFPN, respectively.

Fig. 2
figure 2

Three kinds of feature pyramid structure diagram

Methods

In this section, the overall structure of the proposed fabric defect detection model PEI-YOLOv5 is first introduced, followed by detailed descriptions of PDConv, EB module and IN loss, respectively.

The network structure of PEI -YOLOv5

The YOLO series algorithm is a typical representative of the CNN single-stage algorithm in the field of image target detection. YOLOv5 adopts CSPDarknet53 as the backbone network and converts feature input of arbitrary size into feature output of fixed size by feature map fusion of local features and global features through the SPPF module. By borrowing from PANet [50], a bottom-up path is added to FPN for better integration of semantic information at different scales. YOLOv5 is built with a simple and powerful application programming interface (API) easy to deploy, very developer friendly, and has a series of advantages such as lightweight, speed, and accuracy. However, the YOLOv5 model uses a large number of convolutions with stride of 2 for downsampling, which inevitably causes the loss of semantic information and makes the network ineffective in extracting and fusing defective features in fabric defective species with Brokenhole, Stain, three thread, Flowerboard, Pulpspot and a variety of defects of different sizes, which requires models that can effectively combine shallow and deep semantic information and features.

Among the types of fabric defects, there are targets with very small, extreme aspect ratios and large differences in shape, which place high demands on the detection capability of the detector. Therefore, to achieve high accuracy and speed for fabric defect detection, this study proposes the PEI-YOLOv5 detection network. Figure 3 shows the network structure of PEI-YOLOv5. Backbone is mainly responsible for feature extraction, and although it has better feature extraction ability, the large number of parameters will reduce the detection efficiency of the model. Combining PDConv with the last C3 module of backbone, i.e., PDC3 module, not only effectively reduces parameters of Backbone but also improves the detection accuracy. The Neck part uses the structure of FPN + PAN, which mainly fuses the contextual semantic features extracted by the backbone. In this study, EB module is proposed to replace the original PFN + PAN structure on the basis of BiFPN [51]. EB-2 has two inputs, and it uses CARAFE operator for up-sampling to achieve information interaction at different scales and adapt to different scenes and data, while the dimensionality of the feature map can be reduced, thus reducing the computational effort. The up-sampled features are then concatted with branches of the Backbone structure, and finally the CA module is used to enhance the acquisition of channel information and spatial information. The main difference between EB-3 and EB-2 is that the former has three inputs and does not require up-sampling operations. The feature map after the fusion of features in the Neck part needs to be input to the Head part with detection function for the prediction of the final result. The Head part expands the number of channels using 1 × 1 convolution for the multi-scale feature maps respectively, and the final number of channels obtained from the expansion is \(N \times (C + 5)\). Where, N is the number of anchors per detection layer; C represents the number of categories to be identified; 5 represents the information (x, y, w, h) used to detect the target BoundingBox and the confidence level P of the predicted target. The localization loss of YOLOv5 is implemented by IOU, and we find that IOU is very sensitive to the bbox offset of small targets and has limited detection ability for small targets such as Broken hole. Therefore, we propose IN loss, which integrates IOU and NWD, for enhancing the recognition of small targets and also for improving the network convergence speed.

Fig. 3
figure 3

PEI-YOLOv5 overall network structure diagram

Design of PDConv

In order to design neural networks with low computational complexity, reduce the requirement for computing equipment, many researchers have proposed effective methods [52] But these "fast" neural networks are not actually fast enough, and their reductions in floating point operations (FLOPs) do not translate into an exact reduction in latency, and in some cases, do not improve at all, or even lead to worse latency [57]. In our study, we extracted the visualized feature map in the YOLOv5 backbone network, as shown in Fig. 4. We found that there is a high degree of similarity between different channels of the input feature map, which results in high redundancy. Therefore, we propose PDConv. Figure 5a shows how PDConv works, which uses deepwise convolution for spatial feature extraction for only a part of the input channels and keeps the rest of the channels unchanged. Usually, we use the first or the last consecutive \(c_{d}\) channel for calculation as a representative of the whole input feature map. Assuming that the input feature map is \(c \times h \times w\) and the convolution kernel size is \(k \times k\), the number of input and output feature map channels is kept the same, so the FLOPs of PDConv are only:

$$w \times h \times k^{2} \times c_{d}$$
(1)
Fig. 4
figure 4

The larger image in the upper left corner of the figure is the original defect image of the input, and the other images are the visualization results of different channels of the input image

Fig. 5
figure 5

PD Conv and PD block structure diagram

And the FLOPs for a normal Conv are:

$$w \times h \times k^{2} \times c_{{}}^{2}$$
(2)

In this paper, we take \(c_{d} = \frac{c}{4}\).

Note that the channels other than \(c_{d}\) cannot be removed because this would degrade PDConv to Deepwise Convolution, and we keep the remaining channels because we ensure that the feature information will flow through all channels in the subsequent convolution.

We construct the PD block based on the PDConv. It mainly consists of two paths: (1) A PDConv layer followed by two normal Conv of size \(1 \times 1\), and normalization and activation layers are added between the two \(1 \times 1\) Conv, aiming to keep the diversity of features and reduce the latency; (2) Add the original input to the output of the first path after shortcut operation to make full use of the information extracted from the features and protect the integrity of the information. The module structure is shown in Fig. 5b.

Finally, the PD block is combined with the C3 module of YOLOv5 to form the PDC3 module, whose structure is shown in Fig. 6. We replace the last C3 module in the backbone of YOLOv5 with a PDC3 module, reducing the number of parameters by about 44.1%. The comparison result is shown in Table 1.

Fig. 6
figure 6

PDC3 module structure diagram

Table 1 Comparison of parameters between C3 and PDC3

EB module

The high level feature map has a deeper abstraction of the target, contains sufficient global information, has a larger perceptual field and a stronger ability to represent contextual semantic information, so the determination of the target location is more accurate; while the spatial resolution of the low level feature map is higher than that of the high level feature map, which can more accurately identify more detailed information such as edges, contours and textures, and make accurate determination of the target class. In order to better fuse multi-scale features, we propose the EB module. Compared with EB-2, EB-3 adds one more input to more effectively fuse different multi-scale features and does not require any more up-sampling operations, and after concat operations, the same attention mechanism is used to enhance the focus on defective information. The specific structure diagram is shown in Fig. 7.

Fig. 7
figure 7

EB structure diagram

Content-Aware ReAssembly of Features

The upsampling operator used in YOLOv5 is nearest neighbor interpolation, which guides the upsampling process by the spatial distance between pixels, and since only subpixel neighborhoods are considered, some semantic information is missing. Therefore, we use the Content-Aware ReAssembly of Features (CARAFE) [49] operator for upsampling operations in the EB-2 module. The CARAFE operator can achieve information interaction at different scales, adapt to different scenes and data, and at the same time can reduce the dimensionality of the feature map, thus reducing the computational effort.

The CARAFE operator consists of two components: the kernel prediction module and the feature Reassembly Module. For the feature map with input shape \(C \times H \times W\), the former firstly compresses the feature map channels to \(C_{m}\) to reduce the computation, secondly generates recombination kernels through the encoder with parameter \(k_{encoder}^{2} \times C_{m} \times C_{up}\), and finally normalizes each recombination kernel; the latter recombines the features within the local area of each kernel through the content-aware recombination module. The CARAFE operator achieves multi-scale information interaction to enhance the accuracy and robustness. In addition to this, while realizing multi-scale information interaction, it reduces the dimensionality of the feature map, decreases the computation, speeds up the network training, and reduces the inference speed.

Coordinate attention and skip layer connection

The attention mechanism can enhance the model's ability to focus on key information and ignore other irrelevant information, which has brought a good performance improvement to the deep neural network. In fabric defect detection, there are stain defects that are similar in color to the background, which are difficult for human eyes to distinguish, and the attention mechanism that determines the attention of the defect region features can attenuate the interference of the background information of stain defects and improve the detection ability of the network for this defect. Coordinate attention enables the network to fully capture the information of a larger region while avoiding huge computational overhead, enhancing the channel, spatial, and global information acquisition capabilities. In our EB module, the number of channels increases after concat operation for different feature maps, and the channels contain semantic information of different levels of input objects, so adding CA module after each concat can effectively use the channel information and spatial information to enhance the attention to the defect location.

Figure 8 shows the network structure of CA module. In order to obtain attention on width and height of the feature maps and encode the exact position, the input feature maps are average pooling in the x and y directions to obtain the feature maps \(z_{c}^{h} (h)\) and \(z_{c}^{w} (w)\) in both directions, respectively.

$$z_{c}^{h} (h) = \frac{1}{W}\sum\limits_{0 \le i < W} {x_{c} (h,i)}$$
(3)
$$z_{c}^{w} (w) = \frac{1}{H}\sum\limits_{0 \le i < H} {x_{c} (j,w)}$$
(4)
Fig. 8
figure 8

Coordinate attention module structure diagram

After stitching the feature map, the channels are downscaled to using the convolution module, and the spatial information is encoded by BN and Non-linear activation function \(\delta\), and then obtain the feature map \(f\).

$$f = \delta (F_{1} ([z^{h} ,z^{w} ]))$$
(5)

Then \(f\) is decomposed into two independent feature maps \(f^{h}\) and \(f^{w}\), so that their dimensions are consistent with the input feature map. Two others \(1 \times 1\) convolutions and sigmoid functions are used for feature transformation to obtain \(g^{h}\) and \(g^{w}\), respectively.

$$g^{h} = \sigma (F_{h} (f^{h} ))$$
(6)
$$g^{w} = \sigma (F_{w} (f^{w} ))$$
(7)

Finally, the results of each part are combined to obtain the output of the CA module.

$$y_{c} (i,j) = x_{c} (i,j) \times g_{c}^{h} (i) \times g_{c}^{w} (j)$$
(8)

Since the maximum pooling of SPPF loses much semantic information and we are more concerned about the defective location information, a new skip layer connection (red dashed line) is added before F4 of Fig. 7. The feature fusion is enhanced while retaining the deep location semantic information.

IN loss function

The loss function is used to measure the distance between the predicted output and the expected output of the model, and the closer the two are the smaller the loss value. The loss function of YOLOv5 is mainly composed of position loss, confidence loss and classification loss. In this paper, the IOU of position loss is improved.

Figure 9 shows the visualization result after normalizing all the defective ground truth in the dataset. In the lower left corner of the figure, we can see that there are obvious dark areas, which indicates that there are more small targets in the dataset. The IOU is very sensitive to the offset of the small target bbox, as shown in Fig. 10. Specifically, for a small target with pixel size of \(5 \times 5\), a few pixels of position deviation will lead to a significant drop in IOU (from 0.39 to 0.03). And for larger size targets of \(40 \times 40\) pixels, the variation of IOU is small (from 0.91 to 0.75) for the same position deviation. Therefore, we propose the use of Normalized Wasserstein Distance (NWD) [58] in combination with IOU to improve the loss function.

Fig. 9
figure 9

Visualization of the results after normalizing all ground truth in the dataset

Fig. 10
figure 10

Sensitivity analysis of CIOU and IN for detecting tiny objects. Each grid represents one pixel, box A denotes the ground truth bounding box; box B, C denote the predicted bounding box with 1 pixel and 4 pixels diagonal deviation, respectively

Since the conventional bbox is represented by a rectangle, its corresponding IOU is more concerned with the fit between boxes, which is not good for the detection of small targets. Small targets are more concerned with localization, i.e., the target center location. The center weight of the two-dimensional projection of the 2D Gaussian distribution is the highest and gradually decreases from the center to both sides, therefore, using the 2D Gaussian distribution to fit the bbox is more consistent with the demand of small targets for bbox. The fitted two-dimensional Gaussian distribution \(N(\mu ,\sum )\) has:

$$\mu = \left[\frac{{c_{x} }}{{c_{y} }}\right],\sum = \left[\begin{array}{*{20}c} {\frac{{w^{2} }}{4}} & 0 \\ 0 & {\frac{{h^{2} }}{4}} \\ \end{array} \right]$$
(9)

where \(c_{x}\),\(c_{y}\) are the center coordinates of the bbox,\(w\),\(h\) are the length and height. After completing the two-dimensional Gaussian modeling of the bbox, the prediction frame is transformed with the real frame, which obeys \(A = N(\mu_{1} ,\sum_{1} )\) and \(B = N(\mu_{2} ,\sum_{2} )\), respectively. The second-order Wasserstein distance between A and B is defined as:

$$W_{2}^{2} (A,B) = ||\mu_{1} - \mu_{2} ||_{2}^{2} + ||\sum_{1}^{1/2} - \sum_{2}^{1/2} ||_{F}^{2}$$
(10)

where \(\|\cdot\|_{F}^{2}\) the Frobenius norm.

However,\(W_{2}^{2} (A,B)\) is a distance metric and cannot be used as a similarity metric. Therefore, it is normalized to obtain NWD, and the expression is shown in Eq. (11).

$$W_{2}^{2} (A,B) = \exp \left( - \frac{{\sqrt {W_{2}^{2} (A,B)} }}{c}\right)$$
(11)

c is a constant, determined by the average size of the target in the dataset.

The calculation formula of CIOU is as follows:

$$CIOU = IOU - \frac{{\rho^{2} (b,b_{gt} )}}{{c_{d}^{2} }} - \alpha v$$
(12)
$$v = \frac{4}{\pi}\left(\arctan \frac{{w^{gt} }}{{h^{gt} }} - \arctan \frac{w}{h}\right)^{2}$$
(13)
$$\alpha = \frac{v}{1 - IOU + v}$$
(14)
$$IOU = \frac{{S_{1} }}{{S_{2} }}$$
(15)

\(\alpha\) is the weight function;\(v\) is used to measure the similarity of the aspect ratio;\(w\) and \(w^{gt}\) represent the width of the bbox and the ground truth, respectively;\(h\) and \(h^{gt}\) represent the height of the bbox and the ground truth, respectively;\(b\) and \(b_{gt}\) represent the center point of the bbox and the ground truth, respectively;\(\rho\) represents the Euclidean distance between the two center points;\(c_{d}\) represents the diagonal distance of the smallest closure area that can contain both the bbox and the ground truth; \(S_{1}\) and \(S_{2}\) are the intersection area and union area of ground truth and bounding box, respectively.

Why not use NWD to replace CIOU? There are two main reasons: (1) NWD has a slow convergence rate and needs to increase the training epoch, which will lead to an increase in training time. (2) NWD is proposed for small target detection, while our dataset also contains many normal and larger targets, and using a combination of the two will give better results. The improved function is calculated as:

$$IN = (1 - r_{i} )(1 - W_{2}^{2} (A,B)) + r_{i} (1 - CIOU)$$
(16)

Among them \(r_{i}\) a proportional coefficient, and its value is proportional to the proportion of CIOU, and \(r_{i} = 0.6\) is taken in this paper.

As shown in the comparison in Fig. 10, the sensitivity of IN to small target offsets is much lower than that of CIOU, with a difference of 0.2888 from 0.591 to 0.3022, while the difference of CIOU to small target offsets is 0.363. Therefore, IN loss effectively improves the problem of CIOU being sensitive to small target offsets, enhances the detection ability of small target defects, and has good performance. We provide pseudo-code for PEI-YOLOv5 in Algorithm 1.

Algorithm 1:
figure a

The pseudo-code of PEI-YOLOv5 Model

Experiments

Experimental environment and dataset

The experimental platform in this study is divided into two parts: (1) window host, which mainly trains the PEI-YOLOv5; (2) NVIDIA Jetson TX2, to which the trained network will be deployed for actual testing of fabric defect detection, and the environment parameters and training parameters settings are shown in Tables 2 and 3, respectively. Figure 11 shows the fabric defect detection device we used. It mainly consists of CCD industrial camera, HD lens, NVIDIA Jetson TX2, fabric to be tested and standard light source. We selected 5 common types of defects from defective images, namely Broken holes, Stains, Three threads, Flower boards and Pulp spots. 1637 images were selected, each with a resolution of \(2446 \times 1000\), including the location of the defect, each image has a resolution of \(2446 \times 1000\) and contains the specific location and type of defect.

Table 2 Training hardware and software
Table 3 Training parameters
Fig. 11
figure 11

Fabric defect detection device

Data augmentation

Since the number of various types of defects in the original dataset was very unevenly distributed, the number of Three threads was very large while the number of other defects was small. Therefore, we expanded the images other than those containing Three threads by random inversion and scaling to obtain 3253 images, which improved the sample size imbalance to some extent. Table 4 shows each defect category before and after dataset augmentation, and the total number. Although the number of defects in the Flower board category is relatively small compared to the other four categories, after testing we found that the AP of this category has reached 95.85% and no further expansion is needed. The expanded dataset is divided into training set, validation set and test set in the ratio of 8:1:1, and the statistics of the number of images after the division are shown in Table 5. As shown in Table 6, we also did statistics on the number of each defect in the training, validation and test set.

Table 4 Comparison of the number of various defects before and after dataset expansion
Table 5 Number of training, validation, and test set statistics
Table 6 The number of defect in the training, validation and test set

Performance metrics

To verify the effectiveness of PEI-YOLOv5, the model is evaluated using precision, recall, F1, AP, mAP, FPS and FLOPs.

$$Precision = \frac{TP}{{TP + FP}}$$
(17)
$$recall = \frac{TP}{{TP + FN}}$$
(18)
$$F1 = 2 \times \frac{Precision \times Recall}{{Precision + Recall}}$$
(19)
$$AP = \int_{0}^{1} {Precision(Recall)d(Recall)}$$
(20)
$$mAP = \frac{1}{J}\sum\limits_{i = 0}^{J} {AP(i)}$$
(21)

The TP represented the number of positive samples predicted by the model as positive class, FP represented the number of negative samples predicted by the model as positive class and FN represented the number of positive samples predicted by the model as negative class. Precision was based on prediction results, predicting the correct proportions in positive sample. Recall refers to the ratio of samples that can be correctly identified among all positive samples to the entire positive samples. AP refers to the average accuracy rate of defects of the same category at different recall rates. mAP refers to the average AP value of each defect category, where J is the number of defect categories. FPS represents the number of images per second that the model can detect, the larger its value, the faster the detection speed. The full name of FLOPs is floating point operations, and its value is proportional to the complexity of the model.

Fig. 12
figure 12

Comparison chart of three kinds of loss

Loss convergence speed comparison

In "IN Loss function", we mentioned that changing IOU loss to NWD loss slows down the network convergence, while IN loss reduces the loss value and improves the network convergence speed. Figure 12 shows the comparison of loss curves for IOU loss (blue), NWD Loss only (orange) and IN loss (green) using YOLOv5. When using only NWD loss compared to Base, the two loss values are basically the same until about 20 epochs, and after epochs greater than 20 the NWD loss value is significantly larger than Base. The initial loss value of IN loss is significantly lower compared to the other two methods, and the loss value is much lower than Base and NWD throughout the training process, which makes the prediction value of the network closer to the true value and has better prediction ability.

Fabric defect recognition by attention mechanism

In EB module, the defect images are passed through the CA module, which enhances the focus of the model on the defect location. Figure 13 shows the visualization of fabric defect images based on the attention mechanism. The left side of it is the original image with ground truth labels, and the right side is the attention map of the original image. The position with more red color represents the stronger attention of the model to that position. Guided by the attention mechanism, EB module not only learns salient features from defective samples but also can focus more attention on the defective part of the image and ignore other irrelevant regions, thus performing feature fusion more effectively, especially enhancing the Stain defects with similar background color and Three thread defects with extreme aspect ratio detection capability.

Fig. 13
figure 13

Visualization of fabric defects based on attention mechanism

Comparison with other advanced methods

To further verify the effectiveness of PEI-YOLOv5, we compared it with Faster-rcnn, SSD, YOLOv3, YOLOv5n, YOLOX-tiny, YOLOv7-tiny and YOLOv8n. We compared the five fabric defect type AP values, mAP@0.5 and GFLOPs, of the above methods, respectively, as shown in Table 7. It can be seen from the table, our proposed PEI-YOLOv5 has advantages over the other seven methods in terms of accuracy and computation. The two-stage classic method Faster-RCNN achieved only 43.3% of mAP, because the original anchor size of the method is very different from the ground truth of the dataset, especially the category three threads, which is a very slender defect whose ground truth belongs to the extreme aspect ratio. Faster-RCNN only has an AP value of 3.95%. SSD and YOLOv3, which are both one-stage detection methods only obtained 72.95% and 75.09% of mAP with high GFLOPs. YOLOX-tiny and YOLOv7-tiny have similar GFLOPs and obtained 80.34% and 82.36% mAP respectively with reduced number of parameters and improved detection accuracy compared to SSD and YOLOv3. YOLOv8, the most advanced method to date, achieved 82.78% mAP with the further reduction of GFLOPs and obtained the highest AP value of 91.24% for the defective category Broken hole. Our method improves mAP by 3.61%–87.89% compared to YOLOv5n with only 0.2 GFLOPs increase, and the Stain defect category improves up to 7.99%. The best performance was achieved in the comparison experiment, and the highest accuracy was also achieved for all four categories except Broken hole.

Table 7 Comparison of related methods on Tianchi fabric defect detection dataset

We select one image in each defect category and show the results of the comparison of the eight tested models, as shown in Fig. 14, where the white number in the lower right corner of the image represents the AP value predicted by the model for that category. The predictions of the eight methods for the defect category are all correct, but except for YOLOv7-tiny, YOLOX-tiny and PEI-YOLOv5, all other methods do not predict defects. SSD and Faster-RCNN both appear to predict a single anchor box in ground truth as two for the defect categories Flower board and Pulp spot. Our PEI-YOLOv5 not only detects defects and correctly identifies defect categories, but also has high accuracy, demonstrating its good performance.

Fig. 14
figure 14

Detection results of five categories of defects on different models (a) Three threads; (b) Broken hole; (c) Stain; (d) Flower board; (e) Pulp spot

Ablation experiments

Our PEI-YOLOv5 makes three improvements on YOLOv5, and ablation experiments were performed to verify the effectiveness of one improvement, two improvements combined separately, and three improvements combined, as shown in Table 8. Where Base is the original YOLOv5n with 84.28% mAP and 4.1 GFLOPs. By adding PDConv, the parameter is reduced by about 7.4% and the GFLOPs are reduced by 0.1. The highest FPS is obtained on 3060Ti and Jetson TX2 respectively, and the mAP is increased by 1.05% while the speed is improved, but the recall is reduced. With only one improvement, EB has the largest improvement in mAP, reaching 86.09%, which is 1.81% compared to Base. However, this also brings a small improvement in computational parameters. The enhanced attention to spatial and channel feature maps and the fusion of different scale information can effectively improve the detection of fabric defects. We also compared the EB module with the original BiFPN, which achieved the same mAP as PD Conv with the addition of a certain number of parameters, and there is still a gap between the detection effect with the EB module, so that using the original BiFPN does not bring satisfactory results. The addition of PD Conv to EB module improved the mAP by 0.82% and reduced the number of parameters by about 4.9%, indicating that the combination of the two helps to improve the performance of the model for fabric defect detection in a comprehensive manner. The use of IN loss achieved a 1.19% mAP improvement without any additional computational burden, proving that the combined use of IOU loss and NWD loss is very effective for fabric defect detection. Using the three methods at the same time, the optimal performance of mAP, Recall and F1 was obtained, and the FPS of 31 in NVIDIA Jetson TX2 is greater than 30, which can meet the requirements of real-time detection. The GFLOPs of PEI-YOLOv5 increased by only 0.2 and the mAP increased by 3.61%, which not only met the demand for real-time detection but also greatly improved the capability of fabric defect detection.

Table 8 Results of ablation experiments
Table 9 Comparison of related methods on NEU surface defect database

Experimental results of another dataset

To verify the detection performance of PEI-YOLOv5 for other tasks, we used the NEU surface defect database [59]. The dataset is a hot-rolled strip surface defect dataset with 1800 defect images, the image size is, and there are six defect categories, and we still divide the data according to the ratio of 8:1:1. The performance measures of YOLOv5 and PEI-YOLOv5 are shown in Fig. 15. Our method outperforms YOLOv5 in the prediction accuracy of each category. Table 9 shows the experimental results of our method with 7 other networks of mAP. PEI-YOLOv5 shows the best detection results, outperforming the current advanced YOLOv8, which shows that our method not only outperforms in the area of fabric defect detection but also outperforms existingmethods in other datasets.

Fig. 15
figure 15

Comparative experimental results on NEU surface defect database

Conclusion and discussion

A PEI-YOLOv5 defect detection method is proposed for fabric defects with similar colors and background textures, large differences in defect size and aspect ratio, and many small targets. It significantly improves the mAP of fabric defect detection without significantly increasing computational complexity. Firstly, the structure of backbone is modified by PD block. PDConv enhances the extraction of spatial features and effectively reduces the number of parameters, improving the detection speed while increasing the mAP by 1.05%. Secondly, the Neck part is modified by the EB module. The EB module fully combines spatial information with channel information to realize the interaction of contextual semantic information, which makes the network more focused on the defective target and effectively enhances the fusion of multi-scale semantic information and improves the mAP by 1.81%. Thirdly, we improve the loss function and propose IN loss. Its re-evaluation of loss by normalized Wasserstein distance and CIOU as a metric improves the recognition of small target defects, while significantly accelerating the model convergence speed and improving the mAP by 1.19% without adding any computational parameters. Through experimental comparison in Tianchi dataset and NEU surface defect database, the mAP reached 87.89% and 79.37%, respectively, which is better than the current advanced target detection methods, proving the effectiveness of PEI-YOLOv5. Finally, we deployed PEI-YOLOv5 on NVIDIA Jetson TX2 and achieved a detection speed of 31 FPS, which meets the requirements of real-time factory detection and proves the effectiveness of PEI-YOLOv5.

Although PEI-YOLOv5 effectively enhances the performance of the network with a small increase in computational complexity the network depth and the number of channels are low due to the lightweight model, which leads to a lack of detection capability for the three types of defects, Broken hole, Stain, and Three thread, especially for the Three threads type of defects Weak, as seen in Fig. 14a, is a defect with very small width and high height, which is difficult for the detector to effectively detect defects with extreme aspect ratios. Future research will be conducted on how to improve the detection precision for tiny network, extreme aspect ratio and background color approximation defects while keeping the model lightweight.