Introduction

Research in the field of industrial automation and control has been a constant topic. For example, Song et al. [1] focused on the finite-time prescribed performance (FTPP) control problem, Tao et al. [2] focused on unsupervised cross-domain fault diagnosis methods, Stojanovic focused on the [3] fault-tolerant control problem for hydraulic servo actuators in case of actuator failure, Song et al. [4] focused on adaptive neural finite-time elastic dynamic surface control (DSC) strategies for nonlinear fractional-order large-scale systems (FOLSS), Xin et al. [5] optimized the single stage YOLOv4 model. These studies cover nonlinear systems, fault diagnosis, and surface defect detection. It can be seen that the research related to robotic systems or industrial control systems in practical industrial applications has attracted extensive interest from scholars.

The use of automated equipment for defect detection in the manufacturing process of industrial products has become a key part of the quality inspection phase. Due to the complexity and variety of surface defects in industrial products, the different shapes, defect detection scenarios and hardware configurations. Therefore, high demands are placed on the detection of surface defects in industrial products, especially the detection of tiny defects in industrial products.

In recent years, the powerful feature extraction capability of deep convolutional neural networks(CNNs) has been proven effective in many scenarios, and some excellent object detectors have emerged, such as the one-stage YOLO series [6] and the two-stage R-CNN series [7]. Considering the low efficiency problems of two-stage methods, one-stage detection methods has been favoured in industrial scenarios.

The defect detection task for industrial products differs from the detection task in general scenarios. A lot of research and improvements have been made on the existing one-stage detection methods. For example, Zhang et al. [8] built a network for track defect detection that makes full use of contextual information as well as attention mechanisms to optimize detection performance. Su et al. [9] proposed a multi-headed cosine nonlocal attention module and embedded it into an FPN [10] to achieve better results.

Small object detection has important practical significance, such as surface defect detection in the industrial production environment, in recent years, taking into account the labor cost and the error of visual method of detection, various industrial fields have introduced machine vision-based surface defect detection equipment, the rapid development of machine vision for automatic inspection equipment has brought higher accuracy and efficiency. In practical industrial applications, small object defects and small defects on the surface of the product scene does not account for a few, but the existing algorithms are effective in accounting for a large proportion of the image or the size of the large defects, the detection of small object defects on the performance is far from being satisfactory. Small object detection in the industrial field has a large number of needs and application scenarios, so this study has a very high research significance and application value.

However, in the detection of tiny defects, the existing work mainly focuses on the research of attention mechanism and multi-scale feature fusion methods, while the research of downsampling methods in the backbone network is insufficient, and the defect images of industrial products are easy to lose tiny defects in the downsampling process of stepwise convolution. At the same time, shallow features have higher resolution and are more conducive to the detection of tiny defects, while deep features are rich in semantic information despite their large receptive field. There is a lack of research related to pre-filtering deep and shallow features and then interacting with shallow features so as to guide shallow features for tiny defect detection.

Based on the above observation, this paper proposes a defect detection network (TD-Net) for tiny targets of industrial products to improve the effectiveness of tiny defect detection. The main contributions of this paper are listed as follows:

  1. 1.

    In this paper, TD-Net is proposed to improve the detection capability of tiny defects in industrial products and to improve the problem that quality inspection process in industrial manufacturing is easy to miss the detection of tiny defects. TD-Net uses YOLOv5 [11] as the baseline model.

  2. 2.

    In this paper, the defect downsampling (DD) module is proposed. To effectively solve the problems of max pool, average pool and stepwise convolution to complete the downsampling task resulting in the loss of information of tiny defects, so that the network can better detect tiny defects, the DD module replaces the stepwise convolution that undertakes the downsampling task in the backbone network to compensate for the loss of defect information in the downsampling process.

  3. 3.

    In this paper, Semantic Information Interaction Module (SIIM) is proposed. First, channel attention is applied to deep semantic features (B5) with the maximum number of channels, and coordinate attention [12] is applied to shallow features (B3) with high resolution. Second, cascade fusion is performed between the multi-scale outputs of the backbone (B3,B4,B5). Finally, the optimized fused feature is fully interacted with the shallow (B3) feature to improve TD-Net’s ability to detect tiny defects. The method is good at alleviating the problem of missing detection of tiny defects.

  4. 4.

    In this paper, Scale Information Fusion Module (SIFM) is proposed. First, SIFM pre-filters conflicts between features by summing and stitching multi-scale information cascades. Then, SIFM uses an attention mechanism to achieve the emphasis of useful features and suppression of interference features in industrial product images. The SIFM module replaces the Concat operation in the PANet [13] and improves the defect detection performance with a small increase in the number of parameters.

Related work

Traditional detection method

Chen et al. [14] proposed a strip steel surface defect detection method based on a spectral residual visual attention model, which is based on an improved filtering algorithm to enhance the difference between the defective target and the background, and simultaneously The method is based on an improved filtering algorithm to enhance the difference between the defective target and the background, and at the same time, constructs a salient map of the defective target through the spectral residual information, which requires artificial feature extraction in industrial defect detection, and is unable to realize the fast construction of end-to-end, and lacks a certain degree of generalization ability in the face of a wider range of industrial defect detection problems.

Medina et al. [15] proposed a rotation-invariant Gabor filter, which tries to solve the problem of detecting defects in different directions, but its method has long runtime and response time and is not suitable for large-scale real-time defect detection.

Meanwhile, Liu et al. [16] proposed an improved Multiscale Block local binary pattern (MB-LBP) method, and Cao et al. [17] introduced feature vectors to describe the defect detection problem. problem.

All these methods belong to the traditional defect detection methods, and their ideas have been the important basis of many target detectors, but the traditional detection methods generally have the problems of difficulty in massive data processing, large time overhead, poor performance, cumbersome manual feature extraction and poor robustness, which gradually fail to satisfy the defect detection needs of large-scale industrial practice. In recent years, a lot of research work has been carried out on industrial defect detection based on deep convolutional neural networks.

Deep learning detection method

In recent years, with the rapid development of deep learning in the field of computer vision, more and more research has been conducted using deep learning-based methods to detect industrial defects. Modern detectors usually consist of two stages, a backbone for extracting features and a head for predicting categories and bounding boxes. The most representative two-stage detectors are the R-CNN series, including Fast R-CNN, Faster R-CNN [18] and R-FCN [19]. The most representative one-stage object detectors are the YOLO series [20,21,22,23,24]. In recent years, anchor-free detectors have also gained momentum. Such as CenterNet [25]. However, the scenarios of general-purpose object detectors and industrial defect detection are different, which leads to the general-purpose object detectors cannot obtain the optimal performance in industrial scenarios. Therefore, more and more scholars have improved the general-purpose object detector to enhance its detection capability in industrial defect detection scenarios.

Tu et al. [26] proposed an improved YOLOv3-based surface defect detection method for sawn material, which uses CIoU Loss instead of IoU Loss and lacks the proposed based on the actual industrial defect detection itself Algorithm module design.

Zhou et al. [27] proposed the DACNet to detect strip steel surface defects, Dong et al. [28] used pyramid feature fusion to enhance defect detection, Wang et al. [28] proposed the DACNet to detect surface defects in strip steel. detection, Wang et al. [29] propose a new pyramid feature fusion module, Yu et al. [30] replaced the feature pyramid network (FPN) in Neck with a bi-directional feature fusion network (BFFN), Zeng et al. [31] made full use of the contextual information for the detection of tiny defects in PCBs.As can be seen, there have been many studies related to the utilization of different semantic features and the use of contextual information as a way to improve the performance of industrial defect detection models, but the above methods integrate features with different shades of semantics and do not consider pre-filtering the conflicting feature information of the different shades of semantics prior to the integration, which affects the further improvement of the model’s performance.

The above-mentioned industrial defect detectors based on deep learning have improved the capability of general-purpose detectors in defect detection by various means such as using new fusion methods. However, most of the existing studies have focused on the attention mechanism, multi-scale feature fusion methods, and insufficient research on downsampling methods in backbone networks. Meanwhile, shallow features have higher resolution and are more conducive to the detection of tiny defects, while deep features are rich in semantic information despite their large receptive field. There is a lack of research related to pre-filtering deep and shallow features, and then interacting with shallow features to guide shallow features for tiny defect detection.

Proposed network

In this section, we introduce the proposed TD-Net in detail, whose network architecture is shown in Fig. 1. TD-Net is a one-stage detection method, and its network structure is divided into three parts: backbone, neck and detection head. The proposed DD in backbone reduces the information loss caused by stepwise convolution in downsampling, SIFM in the neck reconstructs the fusion of multi-scale features in the neck, and SIIM in the neck fuses the deep semantic features of the backbone to guide the shallow features.

In special, for the input image F, the output features from the last three feature extraction of the backbone are denoted as

$$\begin{aligned} B^F=\left\{ B3,B4,B5 \right\} \end{aligned}$$
(1)

The output of SIIM and the output of the neck are denoted, respectively, as

$$\begin{aligned} S^F= & {} \left\{ S3,S4,S5 \right\} \end{aligned}$$
(2)
$$\begin{aligned} N^F= & {} \left\{ N3,N4,N5 \right\} \end{aligned}$$
(3)
Fig. 1
figure 1

Overall architecture of TD-Net is divided into three parts: backbone, neck and head. First, the DD and CSP structures are used as the backbone for feature extraction, and then the three scales of B3, B4 and B5 features output from the backbone are input to the SIIM for conflict pre-filtering. After that, the three scales of features are input to the PANet which is very improved by SIFM for fusion. Finally, the fused features are used for multi-scale prediction

Defect downsampling (DD)

For discrimination tasks such as defect classification and defect detection, most of the current convolutional neural network architectures utilize downsampling layers to reduce the space size of the feature map. For example, the widely used Max Pooling layer, Average Pooling layer and convolutional layers with step size larger than 1 are used for the downsampling task. However, sliding windows with step size larger than 1 may prevent good preservation of recognition details, which are crucial for defect detection tasks. In particular, the impact of this approach on tiny defects is significant and can lead to missed detection of tiny defects.

Fig. 2
figure 2

Implementation of DD architecture

To effectively solve the above problems and enable the network to better detect tiny defects, this paper proposes the defect downsampling method (DD).The architecture of DD is shown in Fig. 2, which consists of two parts, one is the original convolutional layer with step size greater than 1 for initial downsampling, where Conv23 is a convolution with kernal size 3 and step size 2, BN is the batch normalization, and Activation is the SiLU activation function. The other part is the defect pooling layer (DPL) that complements the defect information for the convolutional layer with step size greater than 1. The architecture of the DPL is shown in Fig. 3.

Fig. 3
figure 3

Implementation of DPL architecture

The DPL architecture is divided into five processing steps, and the first step is Split(S). As shown in Fig. 4, the signal X is divided into four disjoint sets x0, x1, x2 and x3, which are closely related.

Fig. 4
figure 4

Implementation of split operation

The second step is Extract (E). Given a set x0 which becomes E(x0) after the operation E, the processing of E(x0) is defined as

$$\begin{aligned} E(x0)= & {} Conv(k=1,channel=C/r)\rightarrow BN \nonumber \\{} & {} \rightarrow Conv(k=1,channel=C)\rightarrow SiLU \end{aligned}$$
(4)

where SiLU is the SiLU activation function, Conv is the convolution operation, BN is the batch normalization, C is the number of channels for a given image, and r is the scaling rate.

The third step is Minus (M). Given three sets x1, x2 and x3, the three sets become M(x1), M(x2) and M(x3) after the operation M, the process is defined as

$$\begin{aligned} M(x1)= & {} x1-E(x0) \end{aligned}$$
(5)
$$\begin{aligned} M(x2)= & {} x2-E(x0) \end{aligned}$$
(6)
$$\begin{aligned} M(x3)= & {} x3-E(x0) \end{aligned}$$
(7)

The fourth step is Concat (C). Given four sets E(x0), M(x1), M(x2) and M(x3), after the operation C, the four sets are combined into a large set C(x) with four times the number of channels, then the process of C(x) is defined as

$$\begin{aligned} C(x)= & {} Concat\left( E(x0),E(M(x1)\right) \nonumber \\{} & {} ,E\left( M(x2)\right) ,E\left( M(x3)),dim=1\right) \end{aligned}$$
(8)

where dim is the dimension and 1 is the channel dimension.

The fifth step is Fusion (F). Given a set C(x), the set becomes X’, the final output of the CDL structure, after the F operation. Then the processing of X’ is defined as

$$\begin{aligned} X^{'}= & {} Conv(k=1,channel=4C/r)\nonumber \\{} & {} \rightarrow BN\rightarrow Attention \!\rightarrow \! Conv(k\!=\!1,channel\!=\!C)\nonumber \\{} & {} \rightarrow SiLU \end{aligned}$$
(9)

Among them, Attention is the attention mechanism. We believe that channel reduction after aggregating different features may produce some loss, so we introduce the attention mechanism after the first convolution for channel reduction to compensate for the loss and confusion caused by channel reduction. The Attention here is ECANet [32].

Scale feature aggregation module (SIFM)

In this section, we first introduce the proposed SIFM, and then detail how the SIFM changes the PANet so that it becomes the new neck structure of TD-Net.

SIFM

The structure of SIFM is shown in Fig. 5. Feature representation and feature differentiation between tiny defects of industrial products and between tiny defects and background is the key to detection. By summing and stitching multi-scale information cascades, pre-filtering conflicts between multi-level features, and using an attention mechanism, the emphasis of useful features and the suppression of interference features in industrial product images can be achieved.

As shown in Fig. 5, two feature maps, Y and Z, are input in SIFM. C, H and W are the channels, height and width of the feature maps, respectively. First, the feature maps Y and Z are subjected to summing operation and stitching operation to become M with channel number 2C. Then, the feature map M is stitched after global maximum pooling and global average pooling, respectively, to obtain the global information in the feature map M. Note that the splicing operation is performed in the H dimension, while the W dimension data is compressed after splicing. Then, the feature map with global information is multiplied with the original feature map M after two one-dimensional convolution and Sigmoid activation, respectively, to obtain the final output F of SIFM.

Fig. 5
figure 5

Implementation of the SIFM structure

New fusion network

In this section, we present the new neck structure proposed in this paper. As shown in Fig. 6, the differences between our proposed new neck structure and the FPN and PANet structures can be clearly seen.

In this paper, we use YOLOv5 as the baseline, and while following the PANet structure as the neck, we integrate our designed SIFM into PANet, replacing all the simple stitching operations in the original PANet structure. The pre-filtering conflict and key features are noticed by SIFM to achieve better fusion of multi-scale features.

Fig. 6
figure 6

Implementation of FPN, PAN and TD-Net neck structure

Semantic information interaction module (SIIM)

In this paper, we consider that the defect detector neck is fused with features from top to bottom and bottom to top, and the feature fusion makes each achieve better feature representation capability after interacting with information from different scales. However, this feature fusion approach suffers from the drawback that the interaction between B3 and B5 (non-adjacent layers) is not sufficient. In addition, shallow feature maps such as B3 are more likely to capture tiny defects, and deeper feature maps such as B4 and B5 will inevitably result in missing information of tiny objects during dimensionality reduction, but B4 and B5 layers have stronger semantic information of defects. Therefore, if we use the deep semantic information of B4 and B5 layers to fully interact with B3 layers and pre-filter the conflicts between multi-level features of the neck for full integration of features at all levels, we can improve the tiny object defect detection capability of the whole defect detector.

For the above considerations, we propose the lightweight SIIM, whose structure is shown in Fig. 7.

Fig. 7
figure 7

Implementation of the SIIM structure

We believe that the deep layer B5 has more channels and the shallow layer B3 has higher resolution. SIIM first introduces ECANet for B5 to help the B5 layer better focus on channel information, and SIIM introduces a coordinate attention mechanism for B3 to help the B3 layer better capture tiny defects. Meanwhile, the features of each layer have different semantic depths, and direct fusion by methods such as splicing will cause the problem of feature misalignment. To address this problem, we sum the deep B5 feature map with the B4 feature map, and also sum the shallow B3 feature map with the B4 feature map, using this process to pre-filter conflicts, and then stitch the feature maps of the two stages together. Then, the integrated features are further optimized using the modified lightweight BottleneckCSP. Finally, the obtained feature maps are further fused with B3 to guide the detection of tiny defects.

Specifically, in the modified BottleneckCSP, we reduce the number of channels of the hidden layer to 1/8 of the original one to make it lightweight.

Specifically, SIIM is a coordinate attention mechanism introduced for B3, which decomposes channel attention into two one-dimensional feature encoding processes that aggregate features along two spatial directions, respectively. The coordinate attention step can be formulated as

$$\begin{aligned} M_{h}= & {} Sigmoid\left( W_{1}\left( h_{a} \right) \right) \end{aligned}$$
(10)
$$\begin{aligned} M_{w}= & {} Sigmoid\left( W_{1}\left( w_{a} \right) \right) \end{aligned}$$
(11)

where Sigmoid is the sigmoid activation function, W\(_{1}\) is the convolution with kernal size 1 and channel number C, h\(_{a}\) is the attention in the height direction, and w\(_{a}\) is the attention in the width direction. h\(_{a}\) and w\(_{a}\) steps can be formulated as

$$\begin{aligned} h_{a},w_{a}= & {} Concat\left( AvgPool_{h}\left( F \right) , AvgPool_{w}\left( F \right) \right) \nonumber \\{} & {} \rightarrow MLP\rightarrow Split \end{aligned}$$
(12)

where Split is the splitting operation, AvgPool\(_{h}\) is the global average pooling for compression along the height direction, and AvgPool\(_{w}\) is the global average pooling for compression along the width direction, which compress the feature map F\(\in \)R\(^{C\times H\times W}\) to the size of F\(\in \)R\(^{C\times 1\times W}\) and F\(\in \)R\(^{C\times H\times 1}\). The MLP step can be formulated as follows:

$$\begin{aligned} MLP=W_{0}\left( F_{concat} \right) \rightarrow BN\rightarrow ReLU \end{aligned}$$
(13)

where W\(_{0}\) is the 1\(\times \)1 convolution with the number of channels as C/r, r is the reduction rate, and BN is the batch normalization. Finally, M\(_{h}\) and M\(_{w}\) are multiplied with the input feature map F\(\in \)R\(^{C\times H\times W}\) at the same time to obtain the final generated features.

Experimental results and analysis

Experimental setup

We conducted experiments on NEU–DET [33], GC10-DET [34] and Peking University PCB defect data sets [35] to verify the effectiveness of the proposed method. We use Pytorch to implement our network and experiments on NVIDIA A100 GPUs. TD-NET sets the learning rate to 0.01, the optimizer chooses SGD, the learning rate decay strategy is cosine learning rate decay, the training image size is 640\(\times \)640, the batch size is 32, all models are trained for 500 epochs and none of them use pre training weights.

NEU–DET

The NEU–DET data set is a defect classification data set. There are six types of defects in hot rolled steel sheets, including crazing, inclusion, patches, pitted surface, rolled-in scales, and scratches. The data set has 300 images on each defect type, for a total of 1800 images. In this paper, 1260 images are selected as the training set and 540 as the test set, with ratios of 0.7 and 0.3, respectively. sample images and annotations for each category in the NEU–DET data set are shown in Fig. 8.

Fig. 8
figure 8

Sample images and annotations for each type of defect in the NEU-DET dataset (a) scratches defect, (b) patches defect (c) crazing defect (d) inclusion defect (e) pitted defect (f) rolled-in scales defect

Fig. 9
figure 9

Sample images and annotations of some categories in the GC10-DET dataset. (a) punch defect (b) crease defect (c) oil spot defect

GC10-DET

The GC10-DET data set published by Lv et al. contains ten types of steel surface defects, such as stamping, welds, crescent seams, water stains, oil stains, silk stains, inclusions, indentations, creases and waist folds. In this paper, we use 2294 images and set the ratio of training set to test set as 8:2 in the experiment, so there are 1835 samples for training and 459 samples for testing. The sample images and annotations of some categories in the GC10-DET data set are shown in Fig. 9.

Peking University PCB

The PCB defect data set of Peking University has six categories of defects: missing holes, mouse bites, open circuits, short circuits, straight punctures and false copper, and the data set contains a total of 693 images, which are trained in strict accordance with the original division of the data set into training and test sets. sample images and annotations of some categories in the PCB defect data set are shown in Fig. 10.

Fig. 10
figure 10

Sample images and annotations of some categories in the PCB defect dataset of Peking University. (a) missing holes defect (b) mouse bite defect (c) spurious copper defect

Evaluation indicators

The evaluation indicators for the performance of the model in this paper are precision (P), recall (R), F1 value, mAP@.5. The formulas for precision and recall are as follows:

$$\begin{aligned} P= & {} \frac{TP}{TP+FP} \end{aligned}$$
(14)
$$\begin{aligned} R= & {} \frac{TP}{TP+FN} \end{aligned}$$
(15)

The formula for calculating the F1 value is as follows:

$$\begin{aligned} F1=2\frac{P\times R}{P+R} \end{aligned}$$
(16)

where P is the precision and R is the recall. The formula for calculating mAP is as follows:

$$\begin{aligned} mAP=\frac{ {\textstyle \sum _{n=1}^{N}} \int _{0}^{1} p\left( r \right) dr }{N} \end{aligned}$$
(17)

Furthermore, mAP@.5 represents mAP with an IOU threshold of 0.5.

Ablation study

Lightweight processing

The goal of this paper is to study lightweight detectors for small defects in industrial products. To ensure that the network structure can remain lightweight even after the overall modification, we made further beneficial modifications to YOLOv5s before the experiment, and this modification did not cause any loss in its performance. The original network structure undergoes deletion or replacement operations as an effective way to reduce the number of network parameters. In this study, we only lighten the neck of the network considering that the backbone network has an important position in the feature extraction of the whole network. The modifications are as follows: we replaced the four convolution operations in the neck of YOLOv5s with Ghost convolution [36]. This method reduces the number of network structure parameters and also deals well with the potential overfitting problem. With the above method, we achieved an mAP of 74.5 on NEU–DET. the improvement values of each metric are shown in Table 1.

Table 1 Effect of lightweight processing on baseline

Comprehensive performance of the network structure

We performed incremental performance tests based on a baseline for our proposed modules, including DD, SIFM, and SIIM. Table 2 shows the performance improvements when adding each component separately. It can be seen that our proposed approach achieves a large accuracy improvement based on lightweight processing and does not increase the number of parameters too much, also at a small cost in terms of inference speed. It is worth noting that the accuracy improvement is more important in meeting the real-time requirements. We first add DD after the lightweighting process to replace the original stepwise convolutional downsampling as the new backbone network downsampling method, which has a large improvement in accuracy without much cost in terms of parameters and efficiency. This shows that it is feasible to make full use of the snapshot information to make up for the tiny defective features lost by the step-length convolution. Based on the DD, we add the SIFM module to fully replace the simple Concat operation in the PANet structure, allowing better fusion and information interaction of defect features at different scales. By using SIFM, we achieve the improved accuracy of mAP with essentially no increase in parameters and inference cost. This also illustrates the lack of sufficient cross-layer information fusion of features in the backbone network for simple Concat operation in the neck network, which also leads to incorrect identification of defective features of industrial products. To better solve this problem, SIIM achieves an accuracy improvement of 1.1 mAP by augmenting the semantic information of deep features with the semantic representation of shallow features, which reduces the rate of missing detection of tiny defects in industrial products.

Table 2 Comprehensive performance of the network structure
Table 3 Comparison of test results on the NEU–DET dataset
Table 4 Comparison of test results on the PCB data set at Peking University
Table 5 Comparison of test results on the GC-10 data set
Fig. 11
figure 11

mAP visualization

Comparison experiments

In this paper, we first select some SOTA methods on NEU–DET data set to compare with the TD-Net proposed in this paper. In addition, this paper tests the effectiveness of TD-Net on GC10-DET data set and Peking University PCB defect data set to verify the generalization ability of the proposed method in this paper. The comparison results of several metrics between the TD-Net method proposed in this paper and other SOTA methods on the NEU–DET data set are shown in Table 3, which are all results obtained from testing in the same environment. Among them, TD-Net achieves the best results with 76.8\(\%\) on mAP, which far exceeds the performance of other methods. Compared with the YOLOv3-tiny (53.3\(\%\) mAP), YOLOv4-tiny (52.8\(\%\) mAP), YOLOv7-tiny (74.5\(\%\) mAP) and baseline YOLOv5s (73.2\(\%\) mAP) methods with the same parametric level and real-time performance, TD-Net improves by 23.5\(\%\), 24\(\%\), 2.3\(\%\) and 3.6\(\%\), respectively. Therefore, the TD-Net proposed in this paper is very suitable for the steel defect detection scenario. Meanwhile the comparison results of TD-Net on Peking University PCB defect data set and GC10-DET data set and are shown in Tables 4 and 5, respectively. For the PCB defect data set, it can be seen that the TD-Net proposed in this paper achieves the highest mAP with a score of 96.2\(\%\). For the GC10-DET data set, the proposed TD-Net achieves 71.5\(\%\) mAP, which also achieves the highest score compared to the SOTA method. In addition, this paper reproduces the research results of Zheng et al. [37] who improved on Yolov5x with good results, and to ensure the parametric number is comparable, their research results are reproduced in this paper with low parametric number of Yolov5s, called Zheng-s. Therefore, it can be seen that the TD-Net proposed in this paper is very suitable for the defect detection scenario of industrial products of industrial products. Figure 11 shows the mAP of the proposed model in this paper with other SOTA models on three data sets. It can be seen by Fig. 11 that the model in this paper obtains the best results in tiny object detection.

Figure 12 shows the detection results on three data sets, where a is the detection result on the PCB defect data set from Peking University, b is the detection result on the GC10-DET data set, and c is the detection result on the NEU–DET data set. As shown in the figure, it can be seen that the defect size is very small, but our TD-Net can still accurately classify and locate tiny defects in hot-rolled steel and PCBs, proving the detection capability and generalization ability of TD-Net in defect detection scenarios.

Fig. 12
figure 12

Visualization of partial detection results in three datasets. (a) Visualisation of the detection effect of our model on the Peking University PCB data set (b) Visualisation of the detection effect of our model on the GC-10 data set (c) Visualisation of the detection effect of our model on the NEU-DET data set

Conclusion

In this paper, we want to solve the problem of information loss of tiny defects during downsampling of detector backbone network and conflict pre-filtering of deep and shallow semantic feature fusion, so as to improve the defect detection capability of industrial products. For this purpose, a TD-Net for tiny defect detection is proposed in this paper. for the downsampling of the backbone network, a DD module is proposed to reduce the information loss of tiny defects. For the conflict pre-filtering problem of deep and shallow semantic feature fusion, SIIM and SIFM are proposed, respectively. SIIM fuses deep semantic information to guide shallow features for better detection of minor defects, and SIFM optimizes the structure of feature fusion network PAN by cascade fusion and attention focus. The experimental results on several industrial product defect data sets validate the effectiveness of the proposed method in this paper. Furthermore, in the future, we will continue our research in the field of industrial surface defect detection, specifically follow up the problem of defect detection in glass bottles, and try to follow up the research related to unsupervised cross-domain, generalized AI macromodels for industrial domains.