Abstract
Semi-supervised learning (SSL) can utilize a large amount of unlabeled data for self-training and continuous evolution with only a few annotations. This feature makes SSL a potential candidate for dealing with data from changing and real-time environments, where deep-learning models need to be adapting to evolving and nonstable (non-i.i.d.) data streams from the real world, i.e., online evolutive scenarios. However, state-of-the-art SSL methods often have complex model design mechanisms and may cause performance degradation in a generalized and open environment. In an edge computing setup, e.g., typical in modern Internet of Things (IoT) applications, a multi-agent SSL architecture can help resolve generalization problems by sharing knowledge between models. In this paper, we introduce Mutual Match (MM), an online-evolutive SSL algorithm that integrates mutual interactive learning and soft-supervision consistency regularization, as well as unsupervised sample mining. By leveraging extra knowledge in the training process and the interactive collaboration between models, MM surpasses multiple top SSL algorithms in accuracy and convergence efficiency under the same online-evolutive experiment setup. MM simplifies the complexity of model design and follows a unified and easy-to-expandable pipeline, which can be beneficial to tasks with insufficient labeled data and frequently changing data distribution.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Semi-supervised learning (SSL) [1, 2] algorithms based on deep neural networks have achieved excellent results on several database challenging tasks. These tasks mainly study how to use more identically distributed unlabeled data for model training when the labeled data is limited. Most state-of-the-art deep learning applications rely on fully-supervised methods to provide users with discriminative results. However, these tasks generally require enormous labeled data and colossal computing power [3, 4], which results in poor scalability and cannot be applied to tasks that require adaptive model adjustment. Compared with fully supervised methods, SSL models can be trained autonomously with a small amount of annotated data while achieving competitive or superior results, which makes it a potential candidate for adapting to evolving and nonstable data streams.
With the development of edge intelligence [5], various edge devices with independent communication and computing capabilities are increasing, making the interaction and training of edge models possible. However, in real-world situations, the massive unlabeled data as acquired in real-time by edge devices is usually non-stable and of time-varying data distribution. To maintain generalization ability and provide customized recognition services, edge models should continuously perform efficient model optimization based on changing perception scenarios in which the data stream is non-i.i.d. concerning the specific task. Specifically, online evolutive learning (OEL) system [6] presents such machine learning scenarios where edge models can be automatically trained online in the real-time and changing environment without a unified cloud model.
Although an SSL-based edge model can suit the OEL system, the current state-of-the-art SSL methods do not fully consider the problem of learning with non-i.i.d. data streams. Meanwhile, existing approaches often use pseudo-labeling and consistency regularization techniques as crucial components and integrate more complex data processing and discrimination skills for better performance on different tasks, making SSL models hard to deploy widely and may suffer from performance degradation in a perceptual environment where the data distribution changes frequently.
In this paper, we investigate online evolutive SSL mechanisms that can efficiently utilize information communication among edge models and the extra knowledge in the training process, in order to ensure solid generalization ability when facing non-i.i.d. unlabeled data streams. Particularly, we introduce Mutual Match (MM), an online-evolutive SSL algorithm that integrates mutual interactive learning and soft-labeled augmentation consistency regularization, as well as further unsupervised sample mining to achieve the above goal. Through the interactive collaboration among edge models, MM surpasses multiple top semi-supervised learning algorithms such as MixMatch [7] and FixMatch [8] in terms of accuracy and efficiency in the online-evolutive experiment setup. The adaptability of MM to changing data distribution and the influence of each module of MM on the overall performance are analyzed.
Our main contributes can be summarized as follows:
- 1):
-
We analyze the existing problems in the edge intelligence environment and the need of edge models for continuous learning, and propose the OEL system that is more suitable for modern IoT applications.
- 2):
-
To maintain the model generalization ability under the OEL setting, we introduce MM, consisting of three stages: a) soft-labeled augmentation consistency regularization for the unlabeled data utilizing, b) extracting extra information through mutual interactive loss computing, c) expanding the labeled dataset through unsupervised sample mining. MM better adapts to the changing data distribution in OEL systems.
2 Related work
Our main work lies in the realization of online-evolutive SSL in a multi-model interactive environment, where a small number of annotations are employed in continuous self-evolution. The related work includes semi-supervised learning and multi-model interactive learning.
2.1 Semi-supervised learning
The more successful SSL algorithms mainly use techniques such as pseudo labeling [9], consistency regularization [10] and unsupervised feature representation learning [11]. By combining more data fusion and discrimination techniques, the performance of many SSL methods has surpassed fully supervised learning. Pseudo label [12] uses a confidence-based threshold to generate hard labels on unlabeled data for model training and gradually expand the labeled dataset with those pseudo-labeled data. Many other methods [13,14,15] follow pseudo labeling as a fundamental step of SSL and add some additional constraint or pseudo label enhancement techniques to achieve better performance. Consistency regularization [16] is a model training constraint where the model should maintain consistency in the prediction results for different random variants of the same input, which is widely used with pseudo labeling in the SSL field. Early Consistency regularization mostly used the L2 distance of the prediction results between different input variants as a loss to learn the adequate information in the unlabeled data [15]. In recent researches [17, 18], MixMatch [7] uses unlabeled data augmentation and pseudo-label fusion techniques, combined with pseudo-label low-entropy [19, 20] processing and data mixing consistency regularization to build the SSL model, which has achieved great success. FixMatch [8] performs weak augmentation and strong augmentation [21] on unlabeled data respectively for consistency regularization, where the prediction results of high-confidence weakly augmented data are used as pseudo-labels for strong augmented data. Semco [22] introduces more prior knowledge of visual similarity between categories for the semantic predictions of semi-supervised models and designs a collaborative training process based on disagreement. Extracting more knowledge conducive to training from unlabeled data is the key to the success of SSL. It is also a better way to use GAN [23, 24] or other unsupervised feature representation learning [25,26,27] methods to assist SSL systems. Chen et al. [28, 29] design a contrastive self-supervised learning method that compares the similarity between the same source data and different source data. It uses a large amount of data and computing power to learn a better feature representation and proves that a better SSL performance can be obtained through transfer learning. Kim et al. [30] constructs a better pre-trained model by maximizing the mutual information between the outputs of the same source, which also shows performance gain in SSL.
2.2 Multi-model interactive learning
The information interaction between multiple models can also effectively improve the performance of SSL. Co-training [31] trains two models based on data in different redundant views. Each model generates pseudo-labels in the unlabeled data with higher confidence in the prediction results and uses them to train another model. In this way, the model can integrate supervision from different sources [32] for learning. Since fully redundant conditional independent multi-views are challenging to obtain in most machine learning tasks, tri-training [33] proposes a collaborative training method without requiring different views. It uses boot-strap [34] sampling to obtain differently labeled datasets to train multiple models and then uses the votes of multiple models on the prediction results of unlabeled data as pseudo-labels for consistent training. Tri-net [35] extends the multi-model collaborative training method to deep learning, which designs three models with shared parameters and uses the output smearing [36] to the training labels of each model and increase the divergence between the models. When each model is training with unlabeled data, as the relatively redundant models are trained, the pseudo-labels can be determined by fusing the prediction results of other models. Starting from the principle of knowledge distillation [37], Mutual learning [38] explains that the effective interaction of knowledge between different models can help the joint improvement of each model. It trains models to minimize the discrepancy of output distribution and achieve better performance on various tasks. Through multi-directional information transmission, the evaluation and fusion of multiple models against the predicted probability distribution of the same data will enable each model to obtain more helpful information [39].
In summary, most of the leading semi-supervised learning methods [40, 41] at this stage still focus on improving the performance on specific databases. Still, there will be some data distribution issues when facing real applications. The SSL model can achieve continuous performance improvement in the case of limited labeled data, but when unlabeled data are non-i.i.d. relative to labeled data, the impact on the performance of the SSL model still needs extensive research. The multi-model interactive learning methods meet the requirements of OEL, but the existing methods only focus on ensemble SSL or introduce regularization ability for each model training, which ignoring the potential of interactive learning in an open learning scenario. With the rapid development of edge intelligence [5, 42], the combination of model interactive learning and SSL can be contributing to realizing an OEL system, which is a desirable way to solve the distribution problems and attractive to modern IoT applications.
3 Mutual match
In this section, we first present the problem definition and the overall framework of MM, and then we introduced the specific methods and roles of each key module. Specifically, we use different models to represent learnable edge devices, and the noised dataset is applied to simulate an unstable data distribution environment.
3.1 Problem definition
As shown in Fig. 1, in an OEL scenario, multiple edge models/agents in a local area can obtain perception data in real-time, and combine a small amount of labeled data for SSL model training. Through knowledge transfer and interactive learning between models, each model will better use a large amount of non-i.i.d. unlabeled data to achieve continuous model adaptation and performance improvement.
Given a classification task, let \(\mathbbm {X,U}\) be the labeled and unlabeled dataset. Let \(\mathcal {X}=\left (x_{i},y_{i}\right ),i\in \left (1,\ldots ,B\right )\) be a batch of B labeled examples sampled from \(\mathbbm {X}\), where xi are the training examples and yi are one-hot labels. Let \(\mathcal {U}=u_{i},i\in \left (1,\ldots ,\mu B\right )\) be a batch of μB unlabeled examples sampled from \(\mathbbm {U}\), where μ is a hyperparameter that scale the proportion of unlabeled to labeled training examples in each iteration. Note that the number of \(\mathbbm {X}\) is relatively small compared to \(\mathbbm {U}\), and \(\mathbbm {U}\) contains part of non-i.i.d data with \(\mathbbm {X}\). The labeled and unlabeled data batch is continuously sent to the model for each training step, and the sampled unlabeled data batch is used only once, forming an online batch learning process. Additionally, we denote \(P_{{\mathscr{M}}_{1}}\left (y\middle | x\right )\) as the predicted class distribution y produced by model \({\mathscr{M}}_{1}\) for input x, which is a general formula used to express the prediction results in this article.
3.2 Overall framework
As the unlabeled data stream is non-i.i.d., compared with the usual SSL methods, we propose an SSL framework for knowledge sharing and continuous learning between models, in order to stabilize model generalization capabilities and achieve better performance in OEL settings.
The main work of MM includes generating soft supervision information in the consistency regularization process to obtain more category-related information, introducing a dual-model mutual interactive learning algorithm to realize the information exchange between models, and using unsupervised sample mining during the training process to improve the utilization of high-confidence samples.
The overall process of our method is shown in Fig. 2. MM is an alternate training process and contains two calculation paths. Following different paths, the predictions of labeled/unlabeled data using different augmentation methods are used to calculate the supervised/unsupervised loss. Precisely, the unsupervised loss is calculated using soft supervision consistency regularization. These prediction results will also participate in the calculation of interactive loss. When one model calculates the losses, the other model will predict the same input and use the result as the target of the interaction KL loss. At the end of each step, unsupervised example mining is performed to expand the labeled dataset.
We set up three types of random image augmentation methods for MM: weak, medium, and strong, which are represented by \(\mathcal {A}_{w},\mathcal {A}_{m},\mathcal {A}_{s}\), respectively. For the function of each augmentation method, \(\mathcal {A}_{m}\) is used in supervised learning for labeled data, \(\mathcal {A}_{w}\) and \(\mathcal {A}_{s}\) are used in unsupervised part of MM for consistency regularization, both predictions of \(\mathcal {A}_{w}\) and \(\mathcal {A}_{m}\) are also used to calculate the interaction between models. The details of each form of augmentation method are described in Tables 1 and 2.
3
3.3 Soft supervision in consistency regularization
Consistency regularization has played an essential role in many semi-supervised learning tasks, which is based on the assumption that the prediction of the model for the same input should be consistent. That means for input, making various changes to it without varying the semantics will not affect the target expression of the model, and the predictions of the model for the input and its augmented version should obey the same probability distribution as close as possible. With such a constraint, the model can obtain specific task-related knowledge from unlabeled data.
Many SSL methods that apply consistency regularization mainly perform one-hot representation, sharpening [43] or other label enhancement methods [44] to the filtered prediction results and use these generated artificial labels as the supervision for training with the augmented version of the same inputs. These methods have achieved better results for specific tasks but lost some extra supervision information during training which is important for non-stationary training environments.
Different from label enhancement methods, we directly adopt a soft supervision method in the consistency regularization. Specifically, because MM is an interactive learning process, another model will also generate prediction information for the input. We use a ratio α to fuse the two predictions for generating soft labels, where the α is sampled from a beta distribution. The combined soft label contains more associate information with other categories, we consider that some category-related dark knowledge [45] of unlabeled data can be used in this way, and the filter threshold of MM can already ensure that soft labels hold low-entropy representation. Moreover, soft supervision method could better benefit classification tasks that have more categories. In terms of random image augmentation methods, unlike [8], which applies weak augmentation to labeled and unlabeled data simultaneously, we configure a medium augmentation for labeled data, which may introduce some noise to the model judgment on unlabeled data and increase the reliability of prediction results.
We further describe the formulation of supervised/unsupervised loss, and the role of soft supervision/three different augmentation methods in the loss calculation.
In classification tasks, cross-entropy is generally used to measure the difference in probability distribution between the predicted value and the target, and is used as a loss function to guide model training. Let \(H\left (p,q\right )=-\sum plog\left (q\right )\) represents the cross-entropy loss of the probability distribution p and q, taking the training of model \({\mathscr{M}}_{1}\) as an example, \(\mathcal {X},\mathcal {U}\) are a batch of labeled and unlabeled training examples, supervised loss \({\mathscr{L}}_{s}\) of \({\mathscr{M}}_{1}\) can be denote as (1):
where \(\mathcal {A}_{m}\left (x_{i}\right )\) represents the medium image augmentation method on labeled examples, \(P_{{\mathscr{M}}_{1}}\left (y\middle |\mathcal {A}_{m}\left (x_{i}\right )\right )\) is the output probability distribution of \({\mathscr{M}}_{1}\) with \(\mathcal {A}_{m}\left (x_{i}\right )\) as input, and yi is the one-hot target of xi in batch B.
For the unsupervised loss, we compute the combined predictions ri by
where \(r_{i_{1}}\) and \(r_{i_{2}}\) denote the probability distribution predicted by model \({\mathscr{M}}_{1}\) and \({\mathscr{M}}_{2}\) with the weak augmented unlabeled example ui as input, respectively, α is a scalar between [0,1], representing the fusion ratio sampled from the beta distribution, thus ri is the combined prediction by the weighted average of the two predictions and used for soft supervision consistency regularization.
Then, our consistency regularization process can be interpreted as (6) and (7):
where \(\mathcal {T}\) is a threshold to filter out low-probability prediction results through . In this way, if a prediction result \(r_{i}<\mathcal {T}\), it is discarded to calculate the unsupervised loss. We use an argwhere-like function to fetch the filtered result \(r_{i}^{(uns)}\) without one-hot encoding as the soft label, then the corresponding input ui is used to jointly calculate the unsupervised loss:
where \({\ \mathcal {A}}_{s}\left (u_{i}\right )\) means we use strong image augmentation method on filtered unlabeled examples.
Through the soft supervision in consistency regularization and the three augmentation methods for supervised/unsupervised loss calculation, each model in MM will fully use the category-related information and continuously improve judgment ability in the training process.
3.4 Mutual interactive training
Many studies have shown that the SSL method can gradually improve the ability of the model to use the knowledge in the unlabeled data during the continuous learning process. However, in OEL settings where edge devices are distributed in different areas and the perception scene is constantly changing, the training of the SSL models is always affected by non-i.i.d data streams, which may lead to poor model generalization. It will be ideal to effectively use the knowledge sharing between edge models and implement an interactive learning method to improve model performance continuously.
With the above consideration, we introduce a multi-model learning mechanism. Hinton et al. [37] explains the feasibility of quickly building a lightweight model with better performance based on knowledge transfer between models. Zhang et al. [38] uses the posterior probability between models to teach each other how to learn during the training process. We at this moment adopt the method of dual-model information interaction to achieve alternate training and evolution between models.
Follow [38], we use Kullback Leibler (KL) Divergence to measure the difference in the predicted probability distribution of labeled and unlabeled data between models, and use this as interactive information to guide the training of models alternately. Precisely, we calculate the filtered KL loss in the unsupervised part, which is further described in what follows.
The KL distance form probability distribution p to q can be computed as:
In our cases, when calculating the KL distance of the supervised part, we denote \(r^{\left (m_{1}\right )}\) and \(r^{\left (m_{2}\right )}\) as the predicted distribution of model \({\mathscr{M}}_{1}\) and \({\mathscr{M}}_{2}\) respectively, then we can calculate \(D_{KL}(r^{\left (m_{2}\right )}||r^{\left (m_{1}\right )})\) as supervised interaction loss of model \({\mathscr{M}}_{1}\). The training of these two models is performed alternately, we will describe the detailed training process in Algorithm 1.
When calculating the KL distance of the unsupervised part, let \(r_{i_{2}}=P_{{\mathscr{M}}_{2}}\left (y\middle |\mathcal {A}_{w}\left (u_{i}\right )\right )\) be the probability distribution predicted by model \({\mathscr{M}}_{2}\) with the weak augmented unlabeled example ui as input, we choose the as the KL loss target, and the \(r_{i_{1}}=P_{{\mathscr{M}}_{1}}\left (y\middle |\mathcal {A}_{w}\left (u_{i}\right )\right )\) as the KL loss part of model \({\mathscr{M}}_{1}\). Note that unlike the strong augmented \(\mathcal {A}_{s}\left (u_{i}\right )\ \) outputs we used to compute unsupervised loss, we just use the weak augmented \(\mathcal {A}_{w}\left (u_{i}\right )\) outputs to calculate the unsupervised KL loss part, and \(r_{i_{1}}^{(kl)}\) is the filtered \(r_{i_{1}}\) by the same rule of \(r_{i_{2}}^{(kl)}\) in each step. Then the unsupervised interaction loss of \({\mathscr{M}}_{1}\) can be interpreted as \(D_{KL}(r_{i_{2}}^{(kl)}|| r_{i_{1}}^{(kl)} )\).
In summary, the overall loss function of Mutual Match can be expressed as:
where \({\mathscr{L}}_{{\mathscr{M}}_{1}}\) and \({\mathscr{L}}_{{\mathscr{M}}_{2}}\) represent the loss function of model \({\mathscr{M}}_{1}\) and \({\mathscr{M}}_{2}\) respectively. The two models are trained alternately based on part of the outputs of each other. λu and λkl are loss weights of unsupervised loss and KL loss part.
We add KL loss to the training of the supervised and unsupervised parts to allow the model to generate information interaction during the alternate training process. The unsupervised KL loss enables the model to effectively use the part of the input unlabeled data that is less than the threshold \(\mathcal {T}\), providing additional regularization constraints. We believe that mutual interactive training can accelerate model convergence and improve model generalization when facing an OEL setting.
3.5 Sample mining for incremental learning
Since we use soft supervision loss in the consistency regularization part, this provides more information about the correlation between categories while adding a little uncertainty to the interactive training process, which may lead to the training error fluctuating around the local minimum.
As the iteration progresses, the predictions of models to some unlabeled samples will gradually become stable, and some soft labels with higher confidence will show a more certain probability distribution. To better use these stable prediction samples to maintain the stability of the training, we perform one-hot representation of these soft labels through sample mining processing, thereby expanding the labeled dataset and enhancing the discriminative performance of the training model with more strong supervision.
Specifically, we build an index counter list \(\mathbbm {I}\) for unlabeled data, which collects prediction results with higher confidence in each iteration and counts its occurrence frequency in \(\mathbbm {I}\). For those data whose occurrence frequency reaches indicator value δ, we expand it into the labeled data set for the calculation of supervised loss. According to the complexity of the training task and the size of the data set, we usually set the indicator value δ to a scalar from 8 to 12. In particular, if the number of occurrences of unlabeled data in \(\mathbbm {I}\) does not reach δ when its highest category confidence drops, the frequency of this data in \(\mathbbm {I}\) will be cleared. The details of the sample mining process is described in Algorithm 2.
4 Experiments
The experimental environment settings and implementation details are described, performance comparison between MM and other methods on different datasets are reported. In the ablation study, we further analyze the influence of each module of MM on the overall performance.
4.1 Implementation details
Since other methods in the past were configured in different experimental environments and aimed at standard datasets. To better evaluate the effect of our method, we compare MM with other algorithms under the same experimental conditions with unstable and time-varying data distribution. Our programming environment mainly uses Keras [46] with Tensorflow [47] as the backend.
Datasets settings
A general semi-supervised learning algorithm will select part of the labeled data from the dataset as the supervised training set and remove the labels from the remaining data as the unlabeled training set. To simulate a frequently changing perceptual environment with varying data distribution, we need to introduce more non-i.i.d. data to the unlabeled training set so that the model learning process has the same data disturbance with different feature distributions as in the realistic scene. Specifically, we add 20% of differently distributed data to the unlabeled part of each dataset, which is mainly composed of other categories of data and background data.
Model structure
Since Wide Resnet [48] has performed relatively stable in many research experiments, we adopted WRN as the primary model structure and configured different variants for different databases.
Parameter initialization
We are equipped with different parameter initialization methods for the two models used for interactive training. Specifically, we use He normal [49] and Lecun uniform [50] to initialize the two models, respectively. We uniformly set weight decay with the coefficient of 0.0005 for all models, and all models do not use dropout.
Optimizer settings
We adopted a unified standard SGD optimizer with momentum β = 0.9 and Nesterov, and exploits cosine learning rate decay [51] with the initial learning rate η = 0.03 for all models.
Training data normalization
Unlike other methods that count the mean and standard deviation of the training data set to normalize the data, considering that each batch of training data will be randomly augmented with different strengths during the training process, this will cause a slight change in the data distribution. We directly divide the image by 255.0 as a normalization method and limit the pixel value to [0, 1].
Data augmentation methods
We use the same data augmentation methods based on imgaug [52] library for all experiments, and the specific augmentation methods are described in Tables 1 and 2. For weak augmentation, we randomly select 1 to 3 augmentation methods from Table 1 to apply to the input image, and the implementation intensity of each method is also in a random range. For medium augmentation, we randomly select 3 to 6 augmentation methods from Table 1. For strong augmentation, we randomly select 4 to 9 augmentation methods from Table 2, and the order of implementation of each method is also random.
To avoid differences in the specific configuration of the model under different programming frameworks, all training processes of each method are carried out under our experimental settings. We performed K iterations for each set of experiments, and more specific hyperparameters are described in Table 3. We conducted multiple rounds of experiments on CIFAR-10/100 [53], SVHN [54], and Tiny-imagenet [55] databases using different groups and different amounts of labeled data to evaluate the performance of Mutual Match.
Each of our model training is completed on a single Nvidia RTX 2080 Ti GPU, we also trained a fully-supervised model as a baseline for comparison with semi-supervised algorithms for each different database. We report the performance comparison of Mutual Match and other methods in detail in Table 4.
Moreover, to illustrate the feasibility of MM in practical applications, we report the simulation times of several representative methods in Table 5. Because of the interactive learning process, the batch training time of our method is higher than some algorithms, and the inference time is not affected.
4.2 CIFAR-10 and SVHN
Both CIFAR-10 and SVHN databases consist of 10 categories of images with a resolution of 32×32, and we used the same Wide ResNet-28-2 model for training. For CIFAR-10, we use 50,000 training images and 10,000 testing images. For SVHN, we use 73,357 training images and 26,032 test images. Moreover, we add 20% non-i.i.d. data to the unlabeled part of each dataset to simulate real-life OEL scenarios.
In each experiment, we establish two models for MM, and use the unlabeled data to construct a simulated online learning environment, so that the model is in a batch learning process of continuously obtaining training data. We randomly select three different sets of labeled images from the training set, and perform training and evaluation on all methods three times. We calculate the mean and standard deviation of the accuracy of the three sets of training for reporting in Table 4, the accuracy of MM is better than other methods in various labeled data settings.
Figure 3a shows the training curves of several representative methods on the CIFAR10 dataset. We choose 28 iterations as an epoch to evaluate the results. Since the evaluation values frequently fluctuate throughout the training process, we use exponential weighted average to smoother the graphs. It can be seen that Mutual Match only slightly lags behind MixMatch in the first few epochs. In the subsequent training, both the convergence speed and the accuracy are higher than other methods. Figure 3b is a comparison of several methods on the SVHN dataset. MM can still converge faster and continue to improve performance.
Since MM is a batch learning process of alternating training, we can effortlessly evaluate the utilization effect of unlabeled data. From Fig. 3c , it can be seen that on the two datasets, the effective pseudo-labeled data generated by MM gradually increases with the training rounds, which shows that more unlabeled data can be effectively used. The average accuracy of these pseudo-labels can reach about 97.5% on both datasets, which means that most filtered images can effectively help model evolution.
The number of unlabeled data input in each iteration of MM is 7×64, and we can use about 70% of it in consistency regularization. A small amount of error supervision information may be a reason for affecting the performance of the model. We can reduce these effects by imposing stricter restrictions on pseudo-labels and iterating more rounds.
4.3 CIFAR-100 and tiny-imagenet
CIFAR-100 contains 50,000 training images of size 32×32 from 100 classes and 10,000 testing images, because there are more categories than CIFAR-10, we use a wider WRN-28-8 as the base model. For Tiny-imagenet, which contains 100,000 training images and 25,000 test images with 200 classes and size of 64×64, as the images size and categories grow, we use a deeper and wider WRN-34-10 as base model. We also add 20% non-i.i.d. data to the unlabeled part of each dataset to simulate real-life scenarios.
CIFAR-100 and Tiny-imagenet are both more complex multi-classification tasks, so the performance of each method is not as good as the performance on the previous datasets. We still use different labeled data ”folds” to train each method three times. As can be seen from Table 4, the accuracy of MM is still higher than other methods. Additionally, knowledge sharing between models accelerates the convergence of the model.
Meanwhile, we find that category-related information showed better gains in classification tasks with more categories and played a positive role in both consistency regularization and interactive learning. The extra information allows our method to achieve a relatively significant performance improvement compared with other methods on complex tasks.
4.4 Ablation study
Varying the distribution and amount of unlabeled data
OEL systems usually face real-time non-i.i.d. data streams, some of which are irrelevant to the target task. The characteristics of SSL methods determine that they will be affected by varying distributed data (the model predicts the probability of each input and uses it for SSL training). Our method reduces this influence through knowledge sharing and joint discrimination. At the same time, to evaluate the gains of knowledge sharing between models, we design a performance comparison experiment between some methods when the amount of unlabeled data is reduced.
We introduce different proportions of non-i.i.d. data in the unlabeled dataset to evaluate the performance of various methods in non-stationary perceptual environments. As can be seen in Fig. 4a, when the unlabeled dataset contains 10% to 50% of non-i.i.d. data, the performance of our method remains in a relatively stable range compared to Fixmatch and UDA. In Fig. 4b, even in the face of the more complex Tiny-imagenet dataset, MM still maintains reliable performance on 50% non-i.i.d. data, while other methods have shown significant degradation.
The performance of the SSL method is also sensitive to the amount of unlabeled data used for training. We design corresponding comparative experiments on some databases. It can be seen from Fig. 4c that because SVHN is a relatively simple data set, the reduction of unlabeled data (10% to 50%) does not greatly affect the performance of various methods. MM maintains a better accuracy rate under different unlabeled data ratios. In CIFAR-100, as Fig. 4d shows, under more complex tasks, MM still outperforms other methods on the lower proportion (20% to 50%) of unlabeled datasets. These results demonstrate that MM makes better use of knowledge sharing and interactive information in the learning process.
Soft or hard label in consistency regularization
We believe that using soft targets in consistency regularization will enable the model to learn more category-related information, especially in a multi-model mutual learning method such as MM. We know that H(p,q) = H(p) + DKL(p||q), where H(p) is the entropy of the target probability distribution. This formulation shows that using soft target to calculate the supervision loss is equivalent to calculating the KL divergence of the prediction result and the label while adding a low-entropy target regularization, which can be better combined with the calculation of interactive information between models. The method of mixing up [56] images and labels in [7] is also similar to using soft targets for model training, and has achieved better performance. It can be seen from Fig. 5a that the overall accuracy of the model using soft label consistency regularization is always higher than that of the hard label version, and the convergence speed is only slightly behind in the early stage. Soft supervision works in more training cases, but in some cases, it is a bit unstable. In general, soft supervision brings about 1.5% performance improvement to the model.
Weak or strong augmented outputs to calculate the unsupervised KL loss in Mutual Match
In MM, we mainly use the prediction results of filtered weak augmented unlabeled data to calculate the interactive KL loss between models. We believe that the predictions of the model for weak augmented data will have higher reliability than a strong augmented version. The interaction loss plays a vital role in accelerating model convergence and improving model discrimination ability.
We calculate unsupervised KL loss using the output of strong augmented data and compare it with our final method. It can be seen in Fig. 5b that calculating KL loss with the output of strong augmented data causes model performance degradation. This limited performance can be due to the reliability of the prediction results of strong augmented data will be reduced. Moreover, as our model has the soft target, if the strong augmented results are used to calculate the unsupervised loss and unsupervised KL loss at the same time in our soft target setting, i.e., similar to [38], which does not need to perform more filtering judgments and directly calculate the KL loss on the prediction results, then the conflict between the two loss goals may weaken the final performance.
Sample mining
We further study the impact of sample mining process on the performance of models. As shown in Fig. 5c, since the model does not generate many pseudo-labels meeting the conditions in the early training process, the performance of the model with sample mining is basically the same as the model without sample mining. After dozens of training epochs, the performance of the model using sample mining will be stronger due to the expanded training set.
5 Conclusion
We propose Mutual Match, an SSL algorithm that fully uses the interactive information between models and the extra knowledge in the training process to achieve an online-evolutive learning system. MM mainly includes soft supervision in consistency regularization, mutual interactive training, and unsupervised sample mining. Our method surpasses multiple semi-supervised algorithms regarding model training efficiency and accuracy in a unified OEL experiment setup and can be more suitable for model interaction and evolution.
Semi-supervised learning algorithms have developed rapidly and achieved good performance on many datasets. However, SSL methods with better performance generally rely on integrating a variety of different technologies and complex data judgment and processing procedures. At the same time, they are susceptible to model structure, optimization methods, image augmentation methods, and numerous hyperparameter adjustments. As a result, semi-supervised learning algorithms will face more unstable factors when extended to different tasks. To achieve better performance on a certain database, a formidable number of configurations often need to be adjusted.
Our research focuses on using model interaction information to bring more gains of online training with frequently changing non-i.i.d. data streams. We have summarized the key points with significant impact on the performance of the SSL model, and proposed a unified and easy-to-expandable learning method for multi-model interactive training. Considering applications in real scenarios, we conduct comparative experiments on multiple databases with the same experimental configuration to prove the effectiveness of our method. In the face of rapid changes in user-side perception scenarios, the proposed MM can continuously use the semantic information of unlabeled data to optimize model performance during the learning process, and leverage the computing and communication of edge and terminal devices to achieve a better generalization.
In future research, we are considering interactive learning and knowledge fusion between a larger number of models, in order to provide better self-learning and autonomous evolution methods for real-world applications. We will also investigate the OEL system in an unsupervised setting.
References
Van Engelen JE, Hoos HH (2020) A survey on semi-supervised learning. Mach Learn 109 (2):373–440
Sheikhpour R, Sarram MA, Gharaghani S, Chahooki MAZ (2017) A survey on semi-supervised feature selection methods. Pattern Recogn 64:141–158
Zhai X, Kolesnikov A, Houlsby N, Beyer L (2021) Scaling vision transformers. arXiv:2106.04560
Riquelme C, Puigcerver J, Mustafa B, Neumann M, Jenatton R, Pinto AS, Keysers D, Houlsby N (2021) Scaling vision with sparse mixture of experts. arXiv:2106.05974
Zhou Z, Chen X, Li E, Zeng L, Luo K, Zhang J (2019) Edge intelligence: Paving the last mile of artificial intelligence with edge computing. Proc IEEE 107(8):1738–1762
Zeng X, Pang C, Jiang L, Song L, Yang K, et al. (2020) White paper on networking systems of AI. https://ainsai.org/RESOURCES/White%20Paper/whitepaper.pdf
Berthelot D, Carlini N, Goodfellow IJ, Papernot N, Oliver A, Raffel C (2019) Mixmatch: A holistic approach to semi-supervised learning. In: NeurIPS, pp 5050–5060
Sohn K, Berthelot D, Carlini N, Zhang Z, Zhang H, Raffel C, Cubuk ED, Kurakin A, Li CL (2020) Fixmatch: Simplifying semi-supervised learning with consistency and confidence. In: NeurIPS
Arazo E, Ortego D, Albert P, O’Connor N E, McGuinness K (2020) Pseudo-labeling and confirmation bias in deep semi-supervised learning. In: IJCNN. IEEE, pp 1–8
Bachman P, Alsharif O, Precup D (2014) Learning with pseudo-ensembles. In: NIPS, pp 3365–3373
Zhang D, Yin J, Zhu X, Zhang C (2018) Network representation learning: A survey. IEEE transactions on Big Data 6(1):3–28
Lee D-H (2013) Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In: Workshop on challenges in representation learning, ICML, vol 3, p 896
Sajjadi M, Javanmardi M, Tasdizen T (2016) Regularization with stochastic transformations and perturbations for deep semi-supervised learning. In: NIPS, pp 1163–1171
Tarvainen A, Valpola H (2017) Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In: NIPS, pp 1195–1204
Laine S, Aila T (2017) Temporal ensembling for semi-supervised learning. In: ICLR (poster). OpenReview.net
Liu L, Li Y, Tan RT (2019) Decoupled certainty-driven consistency loss for semi-supervised learning. arXiv:1901.05657
Xie Q, Dai Z, Hovy EH, Luong T, Le Q (2020) Unsupervised data augmentation for consistency training. In: NeurIPS
Berthelot D, Carlini N, Cubuk ED, Kurakin A, Sohn K, Zhang H, Raffel C (2020) Remixmatch: Semi-supervised learning with distribution matching and augmentation anchoring. In: ICLR. OpenReview.net
Sajjadi M, Javanmardi M, Tasdizen T (2016) Mutual exclusivity loss for semi-supervised deep learning. In: 2016 IEEE International Conference on Image Processing (ICIP). IEEE, pp 1908–1912
Grandvalet Y, Bengio Y (2004) Semi-supervised learning by entropy minimization. In: NIPS, pp 529–536
Cubuk ED, Zoph B, Shlens J, Le QV (2020) Randaugment: Practical automated data augmentation with a reduced search space. In: CVPR workshops. Computer Vision Foundation / IEEE, pp 3008–3017
Nassar I, Herath S, Abbasnejad E, Buntine WL, Haffari G (2021) All labels are not created equal: Enhancing semi-supervision via label grouping and co-training. In: CVPR. Computer Vision Foundation / IEEE, pp 7241–7250
Springenberg JT (2016) Unsupervised and semi-supervised learning with categorical generative adversarial networks. In: ICLR (poster)
Odena A (2016) Semi-supervised learning with generative adversarial networks. arXiv:1606.01583
Kingma DP, Welling M (2014) Auto-encoding variational bayes. In: ICLR
Hjelm RD, Fedorov A, Lavoie-Marchildon S, Grewal K, Bachman P, Trischler A, Bengio Y (2019) Learning deep representations by mutual information estimation and maximization. In: ICLR. OpenReview.net
Jing L, Tian Y (2020) Self-supervised visual feature learning with deep neural networks: A survey. IEEE transactions on pattern analysis and machine intelligence
Chen T, Kornblith S, Norouzi M, Hinton G E (2020) A simple framework for contrastive learning of visual representations. In: ICML. Proceedings of Machine Learning Research, vol 119. PMLR, pp 1597–1607
Chen T, Kornblith S, Swersky K, Norouzi M, Hinton GE (2020) Big self-supervised models are strong semi-supervised learners. In: NeurIPS
Kim B, Choo J, Kwon Y-D, Joe S, Min S, Gwon Y (2021) Selfmatch: Combining contrastive self-supervision and consistency for semi-supervised learning. arXiv:2101.06480
Blum A, Mitchell TM (1998) Combining labeled and unlabeled data with co-training. In: COLT. ACM, pp 92–100
Nigam K, Ghani R (2000) Analyzing the effectiveness and applicability of co-training. In: CIKM. ACM, pp 86–93
Zhou Z-H, Li M (2005) Tri-training: Exploiting unlabeled data using three classifiers. IEEE Transactions on knowledge and Data Engineering 17(11):1529–1541
Davison AC, Hinkley DV (1997) Bootstrap methods and their application, Cambridge University Press
Chen D, Wang W, Gao W, Zhou Z-H (2018) Tri-net for semi-supervised deep learning. In: IJCAI. ijcai.org, pp 2014–2020
Breiman L (2000) Randomizing outputs to increase prediction accuracy. Mach Learn 40(3):229–242
Hinton GE, Vinyals O, Dean J (2015) Distilling the knowledge in a neural network. arXiv:1503.02531
Zhang Y, Xiang T, Hospedales TM, Lu H (2018) Deep mutual learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4320–4328
Pereyra G, Tucker G, Chorowski J, Kaiser L, Hinton G E (2017) Regularizing neural networks by penalizing confident output distributions. In: ICLR (workshop). OpenReview.net
Pham H, Dai Z, Xie Q, Le QV (2021) Meta pseudo labels. In: CVPR. Computer Vision Foundation / IEEE, pp 11557–11568
Sellars P, Avilés-Rivero AI, Schönlieb C-B (2021) Laplacenet: A hybrid energy-neural model for deep semi-supervised classification. arXiv:2106.04527
Li E, Zhou Z, Chen X (2018) Edge intelligence: On-demand deep learning model co-inference with device-edge synergy. In: MECOMM@SIGCOMM. ACM, pp 31–36
Goodfellow I, Bengio Y, Courville A (2016) Deep learning, MIT press
Xie Q, Dai Z, Hovy EH, Luong T, Le Q (2020) Unsupervised data augmentation for consistency training. In: NeurIPS
Hinton G, Vinyals O, Dean J (2014) Dark knowledge. Presented as the keynote in BayLearn 2:2
Géron A (2019) Hands-on machine learning with scikit-learn, keras, and tensorflow: Concepts, tools, and techniques to build intelligent systems, O’Reilly Media
Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M (2016) Tensorflow: A system for large-scale machine learning. In: 12th USENIX symposium on operating systems design and implementation (OSDI 16), pp 265–283
Zagoruyko S, Komodakis N (2016) Wide residual networks. In: BMVC. BMVA Press
He K, Zhang X, Ren S, Sun J (2015) Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In: ICCV. IEEE Computer Society, pp 1026–1034
LeCun Y, Bottou L, Orr GB, Müller K-R (2012) Efficient backprop, Lecture Notes in Computer Science, vol 7700. Springer
Loshchilov I, Hutter F (2017) SGDR: stochastic gradient descent with warm restarts. In: ICLR (poster). OpenReview.net
Jung A B, Wada K, Crall J, Tanaka S, Graving J, Reinders C, Yadav S, Banerjee J, Vecsei G, Kraft A, Rui Z, Borovec J, Vallentin C, Zhydenko S, Pfeiffer K, Cook B, Fernández I, De Rainville F -M, Weng C-H, Ayala-Acevedo A, Meudec R, Laporte M et al (2020) imgaug. Online; accessed 01-Feb-2020
Krizhevsky A, Hinton G (2009) Learning multiple layers of features from tiny images
Netzer Y, Wang T, Coates A, Bissacco A, Wu B, Ng A Y (2011) Reading digits in natural images with unsupervised feature learning
Le Y, Yang X (2015) Tiny imagenet visual recognition challenge. CS 231N 7(7):3
Zhang H, Cissé M, Dauphin Y N, Lopez-Paz D (2018) mixup: Beyond empirical risk minimization. In: ICLR (poster). OpenReview.net
Funding
This work is supported by the Shanghai Key Research Laboratory of NSAI and the China Mobile Research Fund of Chinese Ministry of Education (Grant No. KEH2310029).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Li, D., Zhu, X. & Song, L. Mutual match for semi-supervised online evolutive learning. Appl Intell 53, 3336–3350 (2023). https://doi.org/10.1007/s10489-022-03564-7
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-022-03564-7