CISO: Co‑iteration semi‑supervised learning for visual object detection

Semi-supervised learning offers a solution to the high cost and limited availability of manually labeled samples in supervised learning. In semi-supervised visual object detection, the use of unlabeled data can significantly enhance the performance of deep learning models. In this paper, we introduce an end-to-end framework, named CISO (Co-Iteration Semi-Supervised Learning for Object Detection), which integrates a knowledge distillation approach and a collaborative, iterative semi-supervised learning strategy. To maximize the utilization of pseudo-label data and address the scarcity of pseudo-label data due to high threshold settings, we propose a mean iteration approach where all unlabeled data is applied to each training iteration. Pseudo-label data with high confidence is extracted based on an ever-changing threshold (average intersection over union of all pseudo-labeled data). This strategy not only ensures the accuracy of the pseudo-label but also optimizes the use of unlabeled data. Subsequently, we apply a weak-strong data augmentation strategy to update the model. Lastly, we evaluate CISO using Swin Transformer model and conduct comprehensive experiments on MS-COCO. Our framework showcases impressive results, outperforms the state-of-the-art methods by 2.16 mAP and 1.54 mAP with 10% and 5% labeled data, respectively.


Introduction
Deep learning [2,20,55,58] has achieved remarkable progress in computer vision, natural language processing, and speech recognition [46,57].Visual object detection, a fundamental task in the field of computer vision, has seen the emergence of deep performance evaluation.It is worth noting that our proposed CISO framework outperforms most SSOD methods and achieves superior performance.The contributions of this paper are as follows: (1) We propose CISO, a collaborative and iterative SSOD framework that extensively leverages unlabeled data.Besides, knowledge distillation and weak-strong data augmentation are also applied to our framework for the purpose of improving model accuracy and efficiency.(2) To reduce the number of incorrect pseudo-label and avoid the overfitting problem caused by using the inability to update pseudo-label, we propose Mean Iteration method, a scheme for pseudo-label selection based on the IoU average value.(3) We test and validate CISO by using the MS-COCO dataset and conduct extensive experiments.The sresults show that our proposed method achieved advanced performance.We also performed ablation experiments to conduct the analytics of our method.
In the rest of the paper, we present related work in Section 2. Our methodology is discussed in Section 3. Section 4 presents the analysis of the experimental results.Finally, our conclusions are drawn in Section 6.

Visual object detection
Visual object detection is a popular research direction in computer vision, which is widely employed in various industries that can reduce the consumption of labor costs and has important social significance [14,16,28,37,47].At present, visual object detection algorithms can be grouped into two categories: One is an end-to-end and one-stage network [24,32,44] which dominates in training efficiency, such as YOLO family [32,45], the other is a two-stage network [9,10,33] which requires the use of region proposal CNN for feature extraction and classification, such as ResNet and R-CNN as well as Fast R-CNN and Faster R-CNN [33].
Until recently, Transformers with a self-attention mechanism have also been employed in various tasks, including visual object detection, image classification, image segmentation, and video detection.Transformer models have not only received increasing attention but also have achieved good results [26], such as DETR for visual object detection [5].However, a majority of these methods require training based on large amounts of labeled data, which is very labor-intensive and time-consuming.Therefore improving the performance of object detection models through semi-supervised learning has gradually been required and needs us to pay much attention on it.We adopt Swin Transformer [26] in this article to develop the framework.

Semi-supervised learning
Semi-supervised learning [59] aims to generate pseudo-label for unlabeled data samples by training a small number of labeled data samples, typically with much larger amount of unlabeled data than labelled data.The methods [1,2,13] apply semi-supervised learning to visual object detection.The core idea of Semi-Supervised Object Detection (SSOD) is to make full use of unlabeled data to improve the performance of the model.Currently, consistency-based learning and pseudo-label-based learning are the two main research directions of SSOD.The former can be referred to as a soft pseudo label, while the latter is a hard pseudo label.Early SSOD methods include CSD [15], which is based on consistent learning and proposes background elimination.
While STAC [40] proposes a SSOD method based on the hard pseudo label and also used consistency learning.After that, instant-teaching [60] improved on STAC by implementing instant pseudo-label training.The unbiased teacher [25] approach addressed the class imbalance problem.Moreover, data augmentation is effective in improving SSOD [22,25,60], such as Mixup [60] and Cutout [22].Based on these approaches, we focus on the efficient use of unlabeled data as a means to improve model performance.

Knowledge distillation
Knowledge distillation, which is essentially model compression [12,54], is proposed to be applied to classification tasks in a simple way.Unlike quantization and pruning methods, knowledge distillation proposes a teacher-student network, where the output of teacher network is knowledge, and the student network is applied to transfer knowledge for distillation.The performance and accuracy of the teacher network are higher, and the network structure is more complex than that of student network.There are two methods of knowledge acquisition in knowledge distillation; one is to use one-stage features [29,30,34], the other is to transfer knowledge through multi-stage information [11,18,51].Knowledge distillation can lead to better model performance, reduce model latency, and compress network parameters [12].Therefore, in this article, we take the consideration of adding a knowledge distillation method to our framework and improving the model performance.

The structure of our framework
Figure 1 illustrates our CISO framework.We split the whole training process into three stages.In the first stage, small batches of randomly selected labeled data are employed for training the student model, while pseudo-label is generated for the unlabeled data by using the teacher model, reliable data and unreliable data were selected according to the threshold ≥ Mean (IoU).In the second stage, the labeled data and the reliable data are fed into the student learning model for training at the same time.At this point, the unreliable data generated in the first stage is released back into the unlabeled data, the pseudo-label is generated in the full unlabeled data.Finally, the selection process for reliable data is repeated.Note that our Mean Iteration iterates four times and performs weak-strong data augmentation based on the data in each iteration.In the third stage, all the reliable data, unreliable data, and labeled data are fed into the model for training, the final detection model is obtained.

CISO: Co-iteration SSL for object detection
Pseudo labeling A plethora of experiments have demonstrated that the efficient use of pseudo-label data can improve the accuracy of algorithms [3,22,25], leading to considerations of leveraging pseudo-label data and enhance model performance by proposing co-iteration semi-supervised learning based on knowledge distillation.This differs from both classical STAC [40] and Instant-teaching [57].STAC pioneered the application of SSL in visual object detection tasks by conducting self-training with pseudo-label and augmenting the data with consistent regularization.This method requires training the teacher model in advance and then training the student model.In contrast, our CISO achieves the end-to-end transfer of parameter data between models by using knowledge distillation to complete semi-supervised learning.Moreover, while instant-teaching is also an end-to-end way and our CISO inherits its self-training method, CISO retains all the unlabeled data instead of removing the unlabeled data (i.e., pseudo-label data with high confidence).Furthermore, we propose Mean Iteration, in which the threshold τ is continuously updated with our proposed method to enhance pseudo-label utilization and model performance.
To describe CISO in detail, we initially train each iteration by simultaneously generating a pseudo-label for the unlabeled data, using both pseudo-label data and a small amount of labeled data.Specifically, in data batches, the labeled and unlabeled data are randomly sampled according to a set ratio, usually 1:10.Following that, we employ two models during the training process, namely, the teacher model and the student model for knowledge distillation.The teacher model is responsible for generating a pseudo-label for the unlabeled data, while the student model is responsible for conducting the training.Notably, the teacher model is based on the student model updated with the Exponential Moving Average (EMA).This end-to-end approach eliminates the need for complex multi-stage training schemes.
CISO also implements Mean Iteration, which facilitates mutual reinforcement between the pseudo-label and detection training process, rendering the training results increasingly effective.The details of Mean Iteration will be described later.Finally, all data, both labeled and unlabeled, are combined in the network to train the model and obtain the final detection model.Furthermore, for comparison purposes with STAC and Instant-Teaching, we perform weak-strong data augmentation based on the unlabeled data.In this approach, the weakly augmented data are inferred in the initial model to obtain the corresponding Fig. 1 The proposed semi-supervised object detection framework CISO.We are use of the teacher model in knowledge distillation to generate pseudo-label for the unlabeled data and train iterations with the student model.We only select pseudo-label with greater than or equal to the mean of .During the training period, the number of Mean Iteration was 4. We conducted weak-strong data augmentation based on the given data prediction scores.The pseudo-label of the corresponding data is obtained according to a threshold τ, while the strongly augmented data is then passed through the model to obtain the prediction scores and calculate the loss with the pseudo-label.
Overall, we train the model with the same loss function in STAC [40] and instantteaching [57], which are the consistency regularization loss and the cross-entropy loss.The supervised loss consists of a classification loss function L ce and a bounding box regression loss function L 1 , as shown in Eq. 1.
where s is the index of the labeled image, i is the index of the anchor in the image, n is the total number of generated bounding boxes, P(c i ) is the predicted probability of anchor i becoming an object in image X, and G(c i ) is the label of anchor i.Then, P r i is the pre- dicted generated bounding boxes coordinates, and G(r i ) is the actual labeled coordinates.
Pertaining to the unsupervised loss part, the predicted probability distribution and frame coordinates of the model obtained by a small batch of weakly augmented unlabeled data are firstly calculated by using Eq. 2, and the pseudo-label is converted into hard labels as the finally obtained labels by Eq. 3.
Thus, the unsupervised loss function is written as Eq. 4, which is shown as where u is the index of the unlabeled image, Ĝ(c u i ) and G(r u i ) are the pseudo-label generated by the model itself, M c u i denotes the maximum prediction value, and is the confidence level.
Combined Eqs. 1 and 4, the final loss function can be written as Eq. 5, where u is the unsupervised loss weight.
Mean iteration CISO make use of a portion of the labeled data to train the student model, while the teacher model generates a pseudo-label for the unlabeled data.In this step, we calculate the Intersection over Union (IoU) of all the pseudo-labeled data, and then determine the average of these IoU values to set the threshold for generating pseudo-labels.Furthermore, taken the mean value of IoU as the threshold τ, two types of pseudo-label data are generated, i.e., pseudo-labels with high confidence and pseudo-labels with low confidence.We consider the pseudo-labels with τ greater than the mean τ to be reliable labels, and the remaining pseudo-labels to be unreliable labels.Afterwards, the student model is trained a second time using both the labeled data and the reliable label data.After model training, the teacher model is applied to predict the unlabeled data and generate both reliable and unreliable label data (5) L total = u L u +L s again.It is worth noting that the pseudo-labeled data are generated randomly each time, so the reliable and unreliable labeled data are different with each iteration.To achieve iterative training, we retain all the unlabeled data in each training cycle of the Student model, without removing any of the classified unlabeled data from the pseudo-label data.
The proposed approach allows the threshold τ to be continuously updated from one iteration to the next.Since previous semi-supervised learning methods are prone to adopting pseudo-label data with a high threshold τ (e.g., τ = 0.9), this leads to data imbalance.Therefore, our CISO makes the best use of the pseudo-label data and ensures the accuracy of the pseudo-label data due to collaborative iterations.We conducted only four iterations of the experiment.Upon conducting a fifth iteration, there were no additional variations in what the model learned, which we will describe in detail in the ablation study.The results show that our method leads to improve model performance.

Weak-strong data augmentation
The SSL method using consistent regularization is closely related to data augmentation, which enables the model to gain much information in pseudolabel data playing a positive impact.Regarding soft augmentation, we conducted cropping, rotating, flipping, and translation to improve the quality of the labeled data in the pre-training if the quality of the pseudo-labeled data was low.While substantial augmentation, we harnessed cutmix [56] for consistent learning on unlabeled data.Cutmix was chosen because it can apply both hard and soft fusion to two images, allowing the information from the entire image to be utilized without the dataset changing after image mixing.Furthermore, Cutmix does not loose the region information as Cutout does, which affects the training efficiency, nor does it introduce some of the pseudo-pixel information as Mixup does.By utilizing both weak and strong data augmentation, we increase the amount of data and noises, improve the robustness and generalization ability of the model and avoid overfitting.Figure 2 illustrates the strategies for different classes of strong and weak data augmentation strategies.
Specifically, as shown in the cutmix image section in Fig. 2, two images were randomly selected for the combination to generate a new training sample; given unlabeled data U i , two images , the new sample is N = (X n , Y n ) .We completed a regional dropout from the U 1 sample by combining the corresponding regions in the U 2 sam- ple where the U 1 sample is dropped as: where X is the image sample and Y is the image label, λ is employed as the ratio of the combined regions of image U 1 and U 2 , as with Cutmix, we set λ to be in the range (0, 1), where M is the binary mask indicating where images U 1 and U 2 were extracted.Besides, 1 indicates that the value of the mask matrix element is set to 1. Finally, element-wise multiplication ⊙ is utilized in Eq. 7.
Afterwards, Eq. 8 shows how the extracted mask region is calculated.We are use of the same random method as cutmix and define the coordinates of the mask region C = (r X, r Y , r W , r H ), where W is the width of the image U i , H is the length of image U i , r X and r Y are selected from the ranges (0, W) and (0, H), respectively.

Datasets
We propose the semi-supervised visual object detection framework CISO and conduct performance evaluation based on the large-scale dataset MS-COCO [23] and PASCAL VOC [8].MS-COCO is a dataset for visual object detection, segmentation, and other scenarios.It has a total of 330 K images, of which over 200 K images were labeled, and it also has 80 object classes and 91 stuff categories.We adopt the same experimental protocol as STAC [40] and instant-teaching [60], that is, we randomly selected 1%, 5%, and 10% of the labeled data for testing, and the rest of image samples are employed as unlabeled data.Our mAP is presented based on 80 object classes.Then, we selected VOC07 and VOC12 from the PASCAL VOC dataset as labeled and unlabeled sets, respectively.

Implementation details
We applied the CISO framework to Swin Transformer.In this article, we are use of , u , and .The three hyperparameters u and are set to 1.0 and 1.0 respectively, while is dynamic, i.e., τ ≥ Mean (IoU).The initialization of our network weights is all performed by the ImageNet pre-training model.We selected 1%, 5%, and 10% MS-COCO protocols, the experiments were performed by using a quick learning schedule.Furthermore, our training parameters were kept consistent with STAC and instant-teaching, as detailed in Table 1.
Although we adopted Swin Transformer as the feature extractor, we took use of Faster R-CNN as the detector to make a fair comparison with the experimental results of other models.Besides, we also conducted an experiment using the same backbone network ResNet-50 as the other model to verify the validity of our model.

Results
In the last two years, semi-supervised visual object detection methods have gradually gained attention.We compare our method with other state-of-the-art semi-supervised object detection methods and report the mAP and AP values for each protocol, the results of the comparison are shown in Tables 2 and 3. Based on the experimental protocols, we find out that our proposed CISO outperformed all other SSOD methods to achieve the state-of-the-art outcome, which is evident that collaborative iteration and mean thresholding strategy significantly improved the performance of semi-supervised visual object detection.
Specifically, in Table 2, under the 1% protocol, our CISO's mAP value reached 22.00, an improvement up to 1.54 mAP; under the 5% protocol, our CISO increased the mAP value from Soft Teacher [50] method from 30.74 to 30.90, resulting in an improvement of 0.16 mAP values; under 10% protocol, our CISO improved the mAP value from Soft Teacher's result from 34.04 to 36.20, which improves the mAP value by 2.16.Finally, compared with the new semi-supervised learning baseline, LabelMatch [7], our mAPs is 0.71 higher under 10% of the protocol.Even for our experiments using ResNet-50 as the backbone network, CISO still outperforms other models, with mAPs of 21.04, 29.50, and 34.20 for 1%, 5%, and 10% protocols, respectively.The adoption of Swin Transformer indicates that our method is also applicable to the Transformer model with a self-attention mechanism.As depicted in Table 3, when we used VOC07 and VOC12 dataset as labeled and unlabeled data respectively, our CISO* increased the AP 50:95 from 50.00 to 51.77 compared to the Instant Teaching.Afterwards, we added 20 categories of MS-COCO dataset to the unlabeled data.When there is more unlabeled data, we also found that the AP 50:95 of CISO* is 3.03 higher than that of the instant teaching.In addition, for the application of Swin Transformer, our method's AP 50:95 is also higher than other methods, verifying the effectiveness of our model.
We observed that the improvement in mAP value became more prominent as the amount of labeled data increased, from an improvement 1.54 mAP in the 1% protocol to an improvement 2.16 mAP in the 10% protocol.We find that this problem is related to the fact that we released the pseudo-labeled data back into the unlabeled data.This might be due to the release of the pseudo-labeled data, which leads to a higher probability of extracting duplicate pseudo-labeled data again in the next iteration.We leave this consideration for later investigation.Moreover, Fig. 3 shows the prediction results.5 Ablation study

Implementation details analysis of the number of mean iterations
In Fig. 1, we detailed that the mean iteration part in the green dashed box is required to iterate for a number of 4 iterations, so we analyze the impact of the number of Mean Iteration in this section.We tested the model under the protocol of 10% MS-COCO, with the remaining 90% being unlabeled data.The experimental results are shown in Table 4, where we see that six experiments were conducted with the number of iterations set to 1, 2, 3, 4, 5, and 6, respectively.As the number of iterations varies from 1 to 6, we conclude that the performance of our model is getting progressively better.However, starting from iteration number 5, the performance of the model tends to level off.By 6-th iteration, the mAP has been improved only 0.06.Therefore, the performance and efficiency of the model will remain optimal if the number of iterations is 4.

Strong data augmentation
Since data augmentation strategies affect model performance in semi-supervised visual object detection models, we are use of weak-strong data augmentation strategies in CISO.However, the impact of solid data augmentation on model performance is much significant.For a fair comparison, we took advantage of the cutmix strategy while retaining the Color + Cutout strategy.In Table 5, we summarize the mAP values using the different robust data augmentation strategies.If we take use of only Color + Cutout and Geometric strategies, the mAP value of our method does not improve much, only 1.26.Furthermore, the model performance is improved by using the Cutmix strategy, with an mAP value improvement of 0.50 compared to using Mixup and Mosaic.This validates our conjecture that the Cutmix strategy improves pseudo-label quality by not adding pseudo-pixel information to the data.CISO obtained the highest mAP value of 29.70 using the Cutmix data augmentation method.The analysis suggests that we are able to improve the performance of SSOD using cutmix.The tests in this section are based on a 5% MS-COCO protocol.

Analysis
The confidence threshold is a significant coefficient in semi-supervised target detection, and its setting directly affects the performance of the model.As other SSOD methods have taken a constant , we set to be dynamically changing, and obtain pseudo-label according to the criterion that is greater than or equal to the mean value.Since the reliable data and unreliable data have been generated after each iteration is different, the average value taken each time is dynamic (by using 10% MS-COCO protocol).
We see from Table 6 that the highest model performance is achieved if is averaged, with a mAP 36.20.Moreover, the mAP of the model continues decreasing as τ decreases.This confirms our hypothesis that the quality of the pseudo-label improves if is dynamic.Finally, whether there is a more suitable dynamic τ other than the mean value that can be applied to SSOD is the subject of our future research work.
Our study investigates the impact of the balance coefficient u on the model's performance by incorporating it into the loss function.In this section, we conduct testing using the 10% MS-COCO protocol.We set the values of τ to the dynamic mean and test the model with different values of u , specifically 0.25, 0.50, 1.00, 2.00, 3.00, and 4.00.Our   In addition to the mean , we also propose mean iterations to improve the quality of the pseudo-label by using the unlabeled data as much as possible.This is performed based on the dynamic mean and focuses on releasing the pseudo-label extracted in each iteration into the unlabeled data.As shown in Table 8, the mAP value without Mean Iteration is 33.10, which is lower than the value 3.10.Furthermore, Fig. 4 shows the visualization of pseudo-labels of the unlabeled data.This result is generated based on whether or not the Mean Iteration strategy is used.We see that using the Mean Iteration strategy is effective in generating more accurate pseudo-label, which in turn improves model performance.In this section, we still test it with the 10% MS-COCO protocol.
Finally, analysis of the size of unlabeled data is also an essential necessary.Therefore, we evaluated the 5% and 10% protocols of MS-COCO dataset.The dimensions of unlabeled data were set according to 1, 2, 4, and 8 times of the labeled data.Table 9 shows the comparison results of mAP values with variable scales of unlabeled data.We see that our method outperforms STAC and instant teaching, which indicates that CISO can efficiently utilize pseudo label data.

Conclusion
Our research presents a novel semi-supervised object detection (SSOD) learning strategy, CISO, which employs knowledge distillation and weak-strong data augmentation techniques on unlabeled data.In addition, it makes full use of unlabeled data for iterative training.To tackle the problem of model overfitting, caused by the inability to update pseudolabels, we introduce a Mean Iteration scheme.Our work effectively leverages unlabeled data to enhance model performance.While we evaluate CISO on the Swin Transformer with a self-attentive mechanism, our approach can be applied to other detectors as well.We conduct extensive experiments on the MS-COCO and PASCAL VOC datasets, and our proposed method demonstrates impressive performance, surpassing other state-of-the-art methods with higher mAP values.Currently, our research work does not address the selection of training samples and merely selects training data randomly from the dataset.However, in practical applications, labeled and unlabeled data may not adhere to the assumption of independent and identically distributed data since unlabeled data may originate from scenarios different from those of the labeled data.Therefore, our future work will focus on improving the performance of the SSOD model by exploring methods for selecting training samples that take into account of the distribution differences.

Fig. 2
Fig. 2 Visualization of weak data augmentation and strong data augmentation strategies together.The first two are the original image and the strong data augmentation cutmix.The remaining ones are the weak data augmentation, from top to bottom: Flipping, rotating, translating/shifting, and cropping

Fig. 3
Fig.3The prediction results of our proposed framework

Fig. 4
Fig. 4 The predicted pseudo-label.The top two images and the bottom two images were obtained from the non-Mean Iteration training and Mean Iteration training, respectively

Table 2
Comparisons of mAP results of different semi-supervised object detection methods using MS-COCO dataset.Ours (CISO*) indicate that we are use of ResNet-50 as the backbone network for the implementation, Ours (CISO) shows Swin Transformer was selected as the backbone network for the implementation

Table 5
Comparisons of mAP values of CISO with different strong data augmentation.For a fair comparison, we keep the Color + Cutout strategy results, presented in Table7, demonstrate that the model performs the best if u is set to 1.0.However, if u =2.0, though the performance of the model decreases, the mAP is 35.80, which is only 0.40 lower than 36.20.Furthermore, though the model performance decreases with the change of other values of u , the mAP value decreases most at u =0.25 by 5. We observe that our proposed framework is relatively robust to u .

Table 7
Comparison of mAP values with various vlues of balance coefficient u

Table 9
Comparison of mAP values with various scales of unlabeled data