Introduction

Defect detection in industrial production is critical, as it can ensure product quality, reduce losses, and improve the competitiveness of enterprises. Most domestic enterprises have used defect detection systems based on machine vision to replace manual visual detection methods, thus improving detection efficiency. The traditional defect detection methods1,2 rely on manually labeling images and then using the supervised learning algorithm3 for training and detection. However, the method has its shortcomings. First, manual labeling is time-consuming and laborious, especially for large-scale industrial scenarios with enormous workloads. Secondly, due to the complexity and variety of industrial scenarios, the traditional supervised learning algorithm often finds it more accessible to achieve good detection results with sufficient labeled data. Unsupervised learning4,5,6 can directly use unlabeled samples to cluster the model, among which the K nearest neighbor method7 is a classic unsupervised learning method. Still, its effect differs significantly from that of the supervised learning method. Therefore, it is of great theoretical significance and practical application value to research semi-supervised semantic segmentation defect detection. Semi-supervised semantic segmentation enables the model to extract meaningful supervision from unlabeled data, supplementing the training provided by labeled data. This approach improves the accuracy and efficiency of defect detection by leveraging information from both labeled and unlabeled data. Currently, researchers have made some progress in the field of semi-supervised semantic segmentation defect detection. For example, the reconstruction model based on automatic encoder (AE)8 or generating countermeasure network (GAN)9 aims to reconstruct the standard image with minimum error and locate the anomalies according to the reconstruction error. Due to the strong generalization ability of the convolutional neural network CNN, abnormal regions may also be reconstructed correctly in the reasoning stage, which violates the basic assumptions of the reconstruction model10.

In recent years, the method based on embedding has shown better defect detection performance than the method based on reconstruction. At present, the standard semi-supervised models are mainly self-training and consistency learning. Self-training generates pseudo-labels for unlabeled data based on the knowledge from labeled samples. The model is then retrained using the combined data to improve its generalization. Noisy Student11 is a robust self-training scheme, but multiple training rounds will introduce confirmation bias issues. Fixmatch12 uses the idea of consistency learning and adopts the consistency strategy from weak to strong, in which pseudo-labels are generated by weak enhancement samples and used to supervise the counterparts with solid enhancement. Subsequent work proposed complex pseudo-tag filtering strategies, including later methods such as Fixmatch12 and Unimatch13 , all hybrid methods based on robust and weak perturbation consistency learning, and made different improvements. Defect detection is a special case of object detection in general and scale problem lies in the heart of every object detection. DSMSA-Net14 is an encoder-decoder network that integrates attention modules to address the road segmentation task on high-resolution satellite imagery. The encoder extracts multi-scale features from different convolutional layers, while the decoder includes a Scale Attention Unit (SaAU) that leverages feature maps from different residual blocks in the encoder to capture multi-scale information, and a Spatial Attention Unit (SpAU) that enhances the spatial representation of the regions of interest and extracts meaningful contextual information. This work on DSMSA-Net14 provides an important reference for addressing the multi-scale problem in defect detection. However, most of the current research focused on the task of image classification and semantic segmentation itself. The research on semi-supervised semantic segmentation defect detection in industrial scenarios was relatively few, and the technology needed to be more mature; for example, it was challenging to segment the defects of tiny targets accurately-the edge segmentation accuracy of slender defects needed more. The model has many large-scale parameters, which leads to unfavorable industrial scenarios. To sum up, this paper has the following main contributions:

  • The semi-supervised semantic segmentation model increases the disturbance of feature space and executes cross pseudo-supervision in parallel within the image-level.

  • Lightweight attention module and Shallow feature fusion improve detection speed and segmentation accuracy.

  • Dynamic threshold strategy generates more reliable pseudo labels.

  • Constructs the datasets of five scenarios: dirty bottom, broken bottle body, foreign body in noodles, folded label, and broken egg case, which provides a research basis for semi-supervised industrial defect detection.

Related work

Semi-supervised learning framework

Defects in industrial scenarios are usually diverse and complex, and there may be factors such as illumination change, angle change, size deviation, occlusion, and interference15. Disturbance of the feature space can make the model tolerant to the absence of certain features, thus improving its robustness. Even when some features are missing or the input data is noisy, the model can still perform well. In addition, different feature subset combinations can be introduced through a part of the feature channels of random dropout, which is helpful for the model to learn more abundant feature representations, reducing the situation of missed detection and false detection, and thus improving the efficiency of the production line and product quality. Moreover, to meet thehigh-efficiency requirements in industrial scenarios, the random deletion of feature channels can reduce parameters and calculations of the model and improve the efficiency of training and reasoning. In deep neural networks, reducing the number of feature channels can reduce calculation and speed up training. Therefore, in order to comprehensively improve the performance and generalization ability of the model, this paper proposes a new Two-Branch Perturbation Cross Pseudo Supervision16,17,18 model, from now on referred to as TPcps, which perturbs the input image at the image-level. One branch is added to disturb the image feature space strongly. The prediction results of the two branches tend to be consistent, as shown in Fig. 1. Specifically, the method proposed obtains the prediction results after the labeled images are directly sent into the model and use the cross entropy loss function (CEloss) as a constraint (\(p_{i}\) is the grand truth, and \(q_{i}\) is the prediction value). The cross-entropy loss function is defined as follows:

$$\begin{aligned} \text{L}_{\text {sup}}=-\frac{1}{\left| D^{l}\right| } \sum _{\text{x} \in \text{D}^{\text{l}}} \frac{1}{W \times H} \sum _{i=0}^{W \times H} \text{p}_{i} \log q_{i} \end{aligned}$$
(1)

For unlabeled images, they will enter the model through two branches. The first branch is to disturb the unlabeled images at the image-level: The disturbance methods include random geometric data enhancement: rotation, translation, and shearing. Disturbance methods also include non-geometric data enhancement, such as flipping, cropping, and re-sizing. Noise Injection and random image erasure are injected. Even small perturbations in the pixel space can generate substantial noise in the feature maps after propagation through multiple layers of the network, causing the model to acquire incorrect information and diminishing the efficiency of model learning, The second branch disturbs the image at the feature-level and randomly drips the features at the channel dimension. Two branches generate two prediction results, each other’s pseudo-labels. They are constrained by the cross pseudo-supervised loss function, so the enhanced inputs at different levels tend to have the same prediction results. The cross pseudo-supervised loss function is defined as follows:

$$\begin{aligned} \mathrm {L_{pseudo}}=\mathrm {\frac{1}{|D^{u}|}\sum _{x\in D^{u}}\frac{1}{W\times H}\sum _{i=0}^{W\times H}}(\ell \text{ce}(q_{i},l^{*}_i)+\ell \text{ce}(q^{*}_i,l_{i})) \end{aligned}$$
(2)

The final total loss is the sum of supervised loss and unsupervised loss, as shown in Eq. (3).

$$\begin{aligned} - {L}\ =\ {L_{sup}}\ +\ {L_{\text{pseudo}}} \end{aligned}$$
(3)

where \(D^{l}\) is the labeled dataset given with a size of N, \(D^{u}\) is an unlabeled dataset with a size of M. W and H are the width and height of the input image. At each position i, \(q_{i}\) and \(q^{*}_i\) are the prediction values of the first and second branches, respectively. The label vector \(l_{i}\) (\(l^{*}_i\)) is a one-hot vector computed from \(q_{i}\) (\(q^{*}_i\)). The model structure combines self-training and consistency learning so that the model can better use a small amount of labeled defect data in industrial scenarios, improving data utilization and enhancing model performance.

Fig. 1
figure 1

Semi-supervised learning framework.

Segmentation model and network improvement

Model architecture

Due to the scene’s and data’s inherent properties, the industrial defect target is weak and difficult to detect, with unclear boundaries between different defect categories, requiring high-speed detection in industrial settings. In this paper, a new network structure of DeepLabv3Plus+Simplified MobileNetv2 is designed, which inherits the advantages of DeepLabv3Plus19. Using dilated convolution and multi-scale information fusion technology, the new network can effectively capture the semantic information and detailed features in the image. At the same time, the the network’s speed of training and reasoning is improved by combining MobileNetv2 lightweight network. The segmentation accuracy of small targets and defects is improved by combining the ECA module12 and the attention module, as shown in Fig. 2.

Fig. 2
figure 2

Structure diagram of the semantic segmentation model.

The backbone network Xception model of DeepLabv3Plus has many parameters, extended training and prediction times, and cannot meet industrial needs. Replacing the original backbone network with a modified lightweight backbone network MobileNetv2, which is driven by its efficiency, lightweight architecture, scalability, accuracy, and transfer learning capabilities, effectively reduces the number of model parameters and training time while improving prediction speed and makes it well-suited for resource-constrained environments. In order to improve the resolution of the final output feature map, the fifth and seventh layer modules are changed from the original stride size 2 to stride size 1. At this time, the output resolution of the feature map of the improved MobileNetv2 network is four times higher than that of the original MobileNetv2 network, which is 1/8 of the input image and can retain more detailed information in the feature map. After the 8th floor, the number of channels of the original MobileNetv2 increased to 1280, and the number of parameters significantly increased. Prune the original MobileNetv2 network and discard the last three layers for classification. The specific networks before and after modification are shown in Tables 1 and 2. Operators are different network module layers; Input is the size of the input feature map; C is the number of output channels of the feature map, and n is the number of times the module of this layer is used; S is the convolution step size.

Table 1 Original MobileNetv2 network structure.
Table 2 Improved MobileNetv2 network structure.

Lightweight attention module

The semantic segmentation model designs a lightweight attention mechanism module after dilated convolution, aiming at improving the model’s attention to critical areas20, thus effectively extracting defective parts and reducing information duplication21. CBAM22uses the fully connected layer to map features when calculating the generated channel attention, but the calculation amount of the fully connected layer is enormous. In the spatial attention module, to aggregate broader spatial context features, a 7 × 7 large receptive field convolution kernel is used to aggregate spatial features, increasing the receptive field and the parameters of the module, which significantly limits the application of the module in industrial scenarios.

In order to solve this problem, this paper designs a one-dimensional convolution operation to aggregate the channel features of one-dimensional channel attention. As shown in Fig. 3, the size of a one-dimensional convolution kernel is the number of channels in the aggregation neighborhood. Due to the parameter-sharing nature of convolution operation, introducing one-dimensional convolution reduces the parameter quantity of the channel attention module to a constant level. For spatial attention, this paper uses dilated convolution to aggregate the spatial features of two-dimensional spatial attention; this dramatically reduces the number of module parameters in the same size receptive field.

Fig. 3
figure 3

Schematic diagram of attention mechanism.

The specific operation is executed as follows: Global mean pooling and maximum pooling operations are used to aggregate the spatial information of feature mapping. One-dimensional convolution with convolution kernel length k is used to aggregate the information of k channels in the neighborhood of the channel. Add the two convolved features element-wise, generate channel attention \(C_F\in R^{1\times H\times W}\) through Sigmoid function operation, then broadcast and expand the generated channel attention along the two spatial dimensions to \(R^{C \times H \times W}\), and multiply it with the input feature map by the corresponding elements to obtain the injected channel attention feature map. The specific channel attention calculation process is defined as follows:

$$\begin{aligned} C_F= & \,sig(f_{1D}^kAvgPool\left( F\right) +f_{1D}^kMaxPool\left( F\right) )\nonumber \\= & \, sig(f_{1D}^kF_{avg}^c+f_{1D}^kF_{max}^c) \end{aligned}$$
(4)

Here, sig denotes the Sigmoid function, and \(f^k_{1D}\) represents the one-dimensional convolution operation with a kernel size of k.

$$\begin{aligned} \text{k}=\left| {\frac{1\text{bC}}{2}}+{\frac{1}{2}}\right| _{\mathrm {{odd}}} \end{aligned}$$
(5)

The calculation of the k value23 is shown in Eq. (5), where c represents the number of channels of the input feature map, and \(\left| t \right| _{odd}\) represents the odd number closest to t.

Similarly, effective spatial feature descriptors \(F^s_{avg}\) and \(F^s_{max}\) are generated in spatial attention. Then, dilated convolution is used to encode and map the regional information that needs to be emphasized or suppressed in space, and spazzzzxtial context information is aggregated more efficiently. Finally, these weights are multiplied with the original input feature map, and after convolution, deep features containing multi-scale context information are generated. Finally, the shallow features and deep features are combined by the concat part to realize the accurate segmentation of the defective part. The calculation process of spatial attention is defined as follows:

$$\begin{aligned} S_F= & \,sig(f_{dilat}^{3\times 3}\left( [AvgPool\left( F\right) ;MaxPool\left( F\right) ]\right) )\nonumber \\= & \,sig\left( f_{dilat}^{3\times 3}\left( [F_{avg}^s;F_{max}^s]\right) \right) \end{aligned}$$
(6)

Among them, \(f^{3 \times 3}_{dilat}\) represents the dilated convolution with a convolution kernel size of 3, and the experiment uses a hole rate of 2. The model can focus more on essential regions within the image by introducing lightweight attention modules, reducing attention to irrelevant information. This enhancement improves both the efficiency and accuracy of detection. It allows the model to better adapt to various environments and conditions in industrial scenarios, enabling it to handle the complexities of industrial production lines more effectively. These complexities include variations in lighting, angles, and backgrounds. As a result, the model’s generalization capability and practical application value are significantly improved. Figure 4 compares the heatmaps before and after the lightweight attention modules were added.

Fig. 4
figure 4

Effects of adding lightweight attention modules.

Shallow feature fusion

The traditional DeepLabv3Plus model does not thoroughly combine the shallow feature map in the decoding part, and the extracted local details need to be more prosperous, making the whole model’s prediction result inaccurate. In order to solve this problem, this paper uses the idea of a feature pyramid and ECA24 to add a 1/2 size feature map based on applying a 1/4 size feature map. As shown in Fig. 5, shallow and middle features have higher resolution and show different details such as color, texture, and edge25. The resolution of deep features is low, but it shows more robust high-level semantic information. In order to reduce the loss of details in the process of downsampling the deep features, the shallow features output by the backbone network are downsampled and fused with the middle features to obtain more detailed features of the image(SFF). The number of channels is adjusted by convolution, ReLU activation function, and regularization operation to prevent too much low-level semantic information affecting the expression of deep semantic information, effectively improving the accuracy of small target segmentation with few parameters.

Fig. 5
figure 5

Structure of shallow feature fusion.

The specific operation is shown on the right in Fig. 5. A and B are the feature maps generated when the coding part is down-sampled, with the size of 1/2 and 1/4, respectively. The number of channels is 16 and 24, respectively. Feature map A obtains feature map C after passing through an inflated convolution block with a convolution kernel size of 3, step size of 2, and void ratio of 2. C is 1/4 of the input feature map, using the expansion convolution to expand the receptive field while removing noise using the ECA module to learn more useful feature information. D is the feature map after the ECA module. C and D are spliced to obtain E, and E downsized the number of channels by utilizing a 1x1 convolutional block, which results in obtaining a new feature map. The feature map output by the ASPP module is spliced with the feature map obtained after 1 × 1 convolution block and four times up-sampling. The output result of the model’s prediction is obtained by using a 3 × 3 convolutional block once and then upsampling four times on the concatenated feature map.

Dynamic threshold strategy

A dynamic threshold strategy is proposed to prevent the double-branch model from simultaneously generating the same false label12,19,26. Generating false labels leads to the TPcps model learning the wrong label information, thus reducing the detection accuracy. Our method selects reliable data according to pseudo-labels’ stability in the first training stage. The method reduces the selection criteria of pseudo-tags because the model does not learn compelling features, and the prediction results are not accurate enough, so the selection criteria of high threshold pseudo-tags are not applicable. As the model learns the discriminating characteristics of data, the model improves the detection threshold according to the feedback of the effect evaluation index, subdivides the defect edge, and further improves the semantic segmentation accuracy. Experiments show that the dynamic threshold method effectively solves the problem of the double-branch model generating false labels simultaneously, making pseudo labels more reliable.

The dynamic threshold strategy first needs to evaluate the model’s training effect at a particular moment. This paper proposes an effective learning effect evaluation index and dynamically adjusts the strategy accordingly. Specifically, the learning effect of a defect category is determined by a score, and \(S_t\) represents the learning effect of the defect at time t, as shown in Eq. (7). Before model training, a fixed threshold \(\tau\) is set, and the expression of learning performance is as follows:

$$\begin{aligned} S_t=\sum _{n=1}^N1\left( arg\left( {\text {max}}\left( p_{m,t}\left( y|u_n\right) \right) =c\right) \cdot \left( {\text {max}}\left( p_{m,t}\left( y|u_n\right) \right) >\tau \right) \right) \end{aligned}$$
(7)

For all samples belonging to a particular defect category ’c,’ the model estimates the number of instances where the predicted score for that defect category exceeds a fixed threshold. In the normalization process, a dynamic threshold strategy is also employed. In the early stages of the model, where there is a high number of undetermined samples, \(N-\Sigma _c \sigma _t\) plays a dominant role. However, in the later stages of model training, as most samples have been selected at least once, the denominator becomes the maximum value among all estimated learning effects for each class, denoted as \(max S_t\). The normalization is defined as follows:

$$\begin{aligned} Norm_{t}={\frac{S_{t}(c)}{{\text {max}}\left( {\text {max}}S_{t},N-\sum _{c}\sigma _{t}\right) }} \end{aligned}$$
(8)

The final threshold is determined as shown in Eq. (9). By combining the threshold preheating method, the dynamic threshold strategy, and the ingenious normalization method, the learning effect of the classification with the best prediction accuracy is 1. After applying the Eq. (9), its threshold becomes \(\tau\), and at this time, it has reached the upper limit of the dynamic threshold.

$$\begin{aligned} \tau _{t}(c)=\beta _{t}(c){\cdot }\tau \end{aligned}$$
(9)

Experiment

Datasets

Firstly, the defects in industrial scenarios are classified, and datasets are made according to the characteristics of each category and the existing industrial inspection problems. The first type is an additional anomaly: dirt, foreign matter, and adhesion. For this scenario, we designed the Bottle Bottom Spoils and Noodles Adhesion Detection.The second type is loss anomaly: incomplete, scratched, and broken. For this scenario,we designed for bottle breakage and poultry egg case breakage detection.The third is to replace the anomalies: mixed color, different color, impurities, and confusion. For this scenario, we designed the detection of black spots, impurities, and discoloration in noodles. The fourth is abnormal deformation: distortion, crease, and fold. For this scenario, we designed bottle label distortion and fold detection. The specific data of each category are shown in Table 3.

Table 3 Description of industrial defect categories.

In this experiment, five scene datasets are designed, including simple and complex scenarios, and the defect types include binary classification and multi-classification situations. Specifically, the contents of each dataset are shown in the following Table 4. The label tool is used to mark the defects in the defective samples. The marked samples participating in the training account for 1/2, 1/4, 1/8, and 1/16 of the total training set, respectively, and the rest of the samples enter the unlabeled branch training model as unlabeled images.

Table 4 DataSet description.

Experimental environment and benchmark experiment

This paper’sa method uses the Ubuntu 22.04.1 operating system, with a GeForce RTX 3090 GPU and a 12th Gen Intel (R) Core (TM) i7-12700k CPU. The spec details are shown in Table 5.

Table 5 Training and testing environment.

In the training stage, parallel and distributed training strategies based on the PyTorch framework are adopted in the experiment. According to the specific configuration of the experimental environment and available GPU resources, different data-parallel strategies are adopted to optimize the training efficiency. The model can use multiple GPUs to process batch data in parallel, thus speeding up the training process, calculating forward and backward propagation in parallel, and finally synchronizing and updating the model parameters on the primary device. As for the implementation details, this paper uses an SGD optimizer and polynomial learning rate scheduler with a momentum of 0.9, and the initial learning rate is 0.001. At the same time, with the increase in training rounds, the learning rate is dynamically adjusted, and the initial prediction threshold is set to 0.9. Random data enhancement within the scale of [0.5, 2.0] is adopted for image-level disturbances, which is realized by a random combination of geometric disturbances, non-geometric disturbances, and image erasure.

Comparative experiment

Comparison between self-built datasets and mainstream semi-supervised methods

In the context of semi-supervised semantic segmentation for defect detection, challenges such as low contrast and background noise significantly impact model performance. Low-contrast images often conceal defects that fuse with the background, rendering traditional segmentation methods ineffective. To address this, this paper employs multi-scale feature fusion capable of capturing rich contextual information. For instance, in images of Label Folds with minimal colour differentiation, conventional methods struggle, whereas our network enhances feature maps to identify subtle structural variations, as shown in Fig. 6a.

Fig. 6
figure 6

Low contrast and background noise.

Similarly, background noise can mislead the model by introducing irrelevant features. For instance, in the Broken Egg Shell dataset, the cracks in the egg shells are affected by the intrinsic patterns and spots on the eggs. We mitigate this issue by incorporating attention mechanisms that allow the model to focus on critical defects while disregarding noise. Additionally, data augmentation and multi-task learning strategies further enhance the model’s robustness against background interferences. Together, these methodologies not only improve the accuracy of defect detection but also provide a pathway for future advancements in this domain. as shown in Fig. 6b.

This paper compares the above industrial datasets with the SOTA method of semi-supervised learning in recent years. We evaluated several key xmetrics, including Intersection over mIoU, F1 score, and pixel accuracy. The performance metrics are summarized in Table 6. In images with low contrast, such as Label Folds, our framework demonstrated superior capabilities in identifying defects compared to AugSeg and DGCL. The attention mechanism helped isolate the defects from the noisy backgrounds, resulting in clearer segmentations. As shown in Fig. 6. In the Broken Egg Shell dataset, the cracks in the eggshells are affected by the intrinsic patterns and spots on the eggs. Our proposed model outperforms other methods, as it successfully highlights the defects while minimizing the false positives related to background interference.

Additionally, the method proposed in this paper also outperforms the SOTA methods in handling variations in lighting conditions, defect types, and background complexity. The proposed framework was further evaluated on the Noodles dataset, which contains three common defect types in noodle production: discolouration, foreign objects, and black spots, with significant variations in shape, texture, and visual characteristics. The Noodles dataset encompasses a wide range of illumination, from bright and uniform to dark and uneven. As shown in Fig. 7.

Fig. 7
figure 7

Noodles dataset defect detection.

The model demonstrated the ability to detect and segment this diverse range of defect types accurately and maintained high detection accuracy even under complex, non-uniform lighting conditions by focusing on the distinctive visual features of the defects themselves rather than solely relying on local brightness information, effectively generalizing to this diverse range of defects and suppressing background noise to reliably identify defect regions even in the presence of complex backgrounds, indicating the framework’s suitability for real-world noodle production defect detection scenarios.

To further validate the robustness of our proposed framework, we conducted extensive experiments across diverse real-world scenarios and different datasets. We evaluated the model’s performance under varying lighting conditions, defect types, and background complexities. As shown in Fig. 8. (1/4 labelled data samples are used as the supervision part of training)

Table 6 Performance metrics of the proposed framework compared to SOTA methods.
Fig. 8
figure 8

Comparison effect diagram with SOTA method.

It can be seen from Fig. 8 that the segmentation effect of this method is better in detail, such as small targets and edges, and edge segmentation is more straightforward.The proposed framework’s superior performance across these diverse datasets can be attributed to its ability to effectively capture the distinctive features of defects while suppressing irrelevant background information. The model’s architectural design and training strategies enable it to generalize well to various types of defects and environmental conditions, making it a promising solution for real-world industrial inspection tasks. However, the framework’s performance may degrade for certain extreme conditions, such as high-speed motion blur or severe occlusions, which could limit its applicability in some industrial scenarios. The second limitation is that the framework currently only handles 2D image data and cannot directly process 3D or multi-dimensional data. Future work could consider extending the framework to support more diverse data input forms.

In addition to using 1/4 labeled data samples as the supervision part of training, this paper also uses 1/16, 1/8, and 1/2 labeled samples to carry out comparative experiments. The results are shown in Table 7. Table 7 shows the experimental comparison of eggshell breakage detection. The dataset contains 3,436 images, 2,500 of which are used for training models, and 1/16 of the dataset is set as labeled samples to enter the supervised learning branch. To better illustrate the significance of semi-supervised algorithms, we also report the results obtained solely using labelled images. Please refer to the SupOnly results in the first row under each model, where it is evident that the improvement provided by the semi-supervised algorithms over SupOnly is substantial.

Table 7 Comparison results with the SOTA method.

Comparison between self-built datasets and mainstream segmentation models

In addition, this paper compares three network models with excellent semantic segmentation performance. Given the demand for defect detection efficiency in industrial scenarios, the experiment adds prediction speed and model parameter evaluation indicators to evaluate its algorithm performance comprehensively. The comparison results are shown in Table 8.

Table 8 Comparison results of different segmentation models.

The data in Table 8 shows that PSPNet+MobileNetv2 has advantages in speed and model parameters but relatively low semantic segmentation accuracy. In contrast, the proposed method, Modified DeepLabv3Plus+Simplified MobileNetv2, has a speed second only to PSPNet+MobileNetv2 and achieves the highest detection accuracy. Therefore, it offers a more robust overall performance and is more suitable for semantic segmentation defect detection tasks.

Comparison between public datasets and mainstream semi-supervised methods

In order to further verify the effectiveness of the proposed semi-supervised training strategy, we compared the proposed model with other SOTA methods of semi-supervised semantic segmentation models on public datasets. The samples are shown in Fig. 9. The dataset used in the experiment are defective electronic commutator images provided and annotated by Kolektor Group, as shown in Fig. 9. There are tiny damages or cracks on the surface of the plastic bag of the electronic commutator, and the dataset contains 399 pictures, including 52 images with visible defects,347 images without any defects.

Fig. 9
figure 9

Example of KolektorSDD dataset.

One-quarter and one-half of the training set were used in the experiment as the supervised learning part (90/180 images) and the rest as an unsupervised learning part (360/180 images). The experimental results are shown in Table 9.

Table 9 Comparison results with the SOTA method in public datasets.

In the public dataset KolektorSDD, our proposed method achieved the best results in experiments with both 1/4 labeled data and 1/2 labeled data. To further validate our method’s high real-time performance and efficiency in industrial scenarios, we conducted experiments on a large-scale dataset, Steel.

The production process of steel is exact, involving multiple stages from heating and rolling to drying and cutting35. Several machines come into contact with steel during shipment preparation, leading to varying defects. Thus, high-performance and high-accuracy defect detection for steel is confirmed in industrial production. As shown in Fig. 10.

Fig. 10
figure 10

Example of steel dataset.

This dataset is sourced from the Kaggle website, and all images are standardized to a resolution of 1600 × 256 pixels. The training set consists of 12,568 images, and the test set contains 5,506 images. Our method uses 1/4 of the training set as labeled data, while the remaining part is used as unlabeled data for model training. We conducted experiments using different network models, and the specific results are shown in Table 10.

Table 10 Comparison results of different segmentation models.

The proposed method based on the improved DeepLabv3Plus achieves the highest accuracy in the mIoU and mPA evaluation metrics, improving by 2.99% and 6.53% compared to the original method, and by 19.48% and 24.67% compared to U-Net. Regarding inference time and FPS, our method only lags behind U-Net by 0.42 ms and 2.36 fps. The inference time and frame rate are 47% and 2.1 times that of the original DeepLabv3Plus model. It also outperforms other mainstream semantic segmentation network models. Considering all four evaluation metrics, the improved DeepLabv3Plus semantic segmentation models demonstrate more robust overall performance on the Steel dataset. Furthermore, this paper assesses and compares GPU memory usage across different algorithms to provide insights into their practical deployment. The experiment was conducted using GPU chips (GeForce RTX 3090) with 10,496 CUDA cores and 24GB of memory. The batch size was fixed at 16, and the number of workers was set to 8. The GPU memory usage for different algorithms is shown in the Table 11:

Table 11 GPU memory usage and utilization.

Ablation experiment

In this paper, the model is improved according to the characteristics of different defects, and ablation experiments are conducted to prove the effectiveness of the improvement. As shown in Fig. 11, the defect detection effects under different improvement measures are presented, from which we can see that each step of improvement to the semi-supervised defect detection framework is effective, with the defect edge segmentation becoming more and more accurate, and even small defects being detected as the model is improved.

Fig. 11
figure 11

Defect detection effect under different improvements. (a) input image; (b) mask; (c) single-branch disturbance (substantial enhancement and weak enhancement disturbance) at the image-level; (d) Double-branch disturbance at the image-level and feature-level; (e) Improve the network DeepLabv3Plus+Simplified MobileNetv2; (f) Pay more attention to the defective parts in channels and spaces and increase shallow feature fusion.

The model improvements also need to consider the requirements of speed and lightweight. Table 12 lists the comparative results of the specific ablation experiments, where the baseline is the original enhanced double-branch disturbance based on solid and weak images, and the segmentation model is DeepLabv3Plus+Resnet50. Based on this, the proposed TPCPS semi-supervised framework significantly increased the mIoU index by 11.41% compared to the baseline but inevitably led to a decrease in detection speed from 55.01 fps to 48.32 fps.

To meet the high real-time requirements of industrial applications while maintaining detection accuracy as much as possible, this paper modified the original DeepLabv3Plus+Resnet50 network, reducing the model parameters by 22.34M and increasing the speed to 70.38 fps. To further improve the high-precision defect detection requirements, the authors added the LAM and SFF modules to the semi-supervised defect detection framework, which significantly increased the mIoU (4.95% increase after adding the LAM module and 4.75% increase after adding the SFF module), with a slight increase of 1.23M in model parameters.

In resource-constrained environments, it is worthwhile to trade a slight speed reduction for a significant improvement in detection accuracy, and the addition of the LAM and SFF modules is necessary. Further research could investigate more efficient network architectures or model compression techniques to achieve even higher inference speeds while maintaining improved detection performance.

Table 12 Ablation experiments with different modules added.

Conclusion

This paper proposes a semi-supervised semantic segmentation method based on industrial scenarios. It increases the disturbance of feature space and carries out cross pseudo-supervision in parallel within the image-level. The method outputs the most accurate pixel-by-pixel features, realizes accurate segmentation of defect parts, and uses a small amount of labeled data to detect defects, saving high labor costs.

Moreover, in the aspect of segmentation model design, the DeepLabv3Plus model, which currently has excellent semantic segmentation performance, has been selected to be modified. With the lightweight network MobileNetv2, the detection efficiency is improved to ensure the accuracy of semantic segmentation. To address the issues of missed and false detections, shallow fusion features were added after the backbone network to reduce detail loss and improve the accuracy of small target segmentation. Additionally, lightweight channel and spatial attention mechanisms were incorporated after dilated convolution to enhance focus on pixels surrounding defects at two levels. This approach ensures more accurate defect edge segmentation and maintains robustness in complex and variable industrial scenarios.

In addition, due to the presence of numerous defects and varying detection difficulties in industrial scenarios, this method employs a dynamic threshold strategy

Initially, a low threshold model is used to fit the defects better, while a higher threshold in the later stages focuses on segmentation details to ensure final detection accuracy. Extensive experiments under various data settings demonstrate that this method significantly improves model performance and surpasses other mainstream methods in industrial scenarios.

In terms of potential future research directions and broader impact considerations, the proposed semi-supervised defect detection framework could be further enhanced by exploring more efficient network architectures, expanding to a wider range of defect types, and integrating with automated inspection systems. Beyond the specific applications presented, this framework has the potential to be applied in diverse industrial domains, thereby improving production efficiency and quality control. Moreover, such advancements could lead to broader social benefits, such as ensuring product safety and reducing environmental waste, while also considering ethical implications related to data privacy and algorithmic bias as the technology is integrated into larger automated inspection systems.