1 Introduction

Semi-supervised learning (SSL) [1, 2] algorithms based on deep neural networks have achieved excellent results on several database challenging tasks. These tasks mainly study how to use more identically distributed unlabeled data for model training when the labeled data is limited. Most state-of-the-art deep learning applications rely on fully-supervised methods to provide users with discriminative results. However, these tasks generally require enormous labeled data and colossal computing power [3, 4], which results in poor scalability and cannot be applied to tasks that require adaptive model adjustment. Compared with fully supervised methods, SSL models can be trained autonomously with a small amount of annotated data while achieving competitive or superior results, which makes it a potential candidate for adapting to evolving and nonstable data streams.

With the development of edge intelligence [5], various edge devices with independent communication and computing capabilities are increasing, making the interaction and training of edge models possible. However, in real-world situations, the massive unlabeled data as acquired in real-time by edge devices is usually non-stable and of time-varying data distribution. To maintain generalization ability and provide customized recognition services, edge models should continuously perform efficient model optimization based on changing perception scenarios in which the data stream is non-i.i.d. concerning the specific task. Specifically, online evolutive learning (OEL) system [6] presents such machine learning scenarios where edge models can be automatically trained online in the real-time and changing environment without a unified cloud model.

Although an SSL-based edge model can suit the OEL system, the current state-of-the-art SSL methods do not fully consider the problem of learning with non-i.i.d. data streams. Meanwhile, existing approaches often use pseudo-labeling and consistency regularization techniques as crucial components and integrate more complex data processing and discrimination skills for better performance on different tasks, making SSL models hard to deploy widely and may suffer from performance degradation in a perceptual environment where the data distribution changes frequently.

In this paper, we investigate online evolutive SSL mechanisms that can efficiently utilize information communication among edge models and the extra knowledge in the training process, in order to ensure solid generalization ability when facing non-i.i.d. unlabeled data streams. Particularly, we introduce Mutual Match (MM), an online-evolutive SSL algorithm that integrates mutual interactive learning and soft-labeled augmentation consistency regularization, as well as further unsupervised sample mining to achieve the above goal. Through the interactive collaboration among edge models, MM surpasses multiple top semi-supervised learning algorithms such as MixMatch [7] and FixMatch [8] in terms of accuracy and efficiency in the online-evolutive experiment setup. The adaptability of MM to changing data distribution and the influence of each module of MM on the overall performance are analyzed.

Our main contributes can be summarized as follows:

1):

We analyze the existing problems in the edge intelligence environment and the need of edge models for continuous learning, and propose the OEL system that is more suitable for modern IoT applications.

2):

To maintain the model generalization ability under the OEL setting, we introduce MM, consisting of three stages: a) soft-labeled augmentation consistency regularization for the unlabeled data utilizing, b) extracting extra information through mutual interactive loss computing, c) expanding the labeled dataset through unsupervised sample mining. MM better adapts to the changing data distribution in OEL systems.

2 Related work

Our main work lies in the realization of online-evolutive SSL in a multi-model interactive environment, where a small number of annotations are employed in continuous self-evolution. The related work includes semi-supervised learning and multi-model interactive learning.

2.1 Semi-supervised learning

The more successful SSL algorithms mainly use techniques such as pseudo labeling [9], consistency regularization [10] and unsupervised feature representation learning [11]. By combining more data fusion and discrimination techniques, the performance of many SSL methods has surpassed fully supervised learning. Pseudo label [12] uses a confidence-based threshold to generate hard labels on unlabeled data for model training and gradually expand the labeled dataset with those pseudo-labeled data. Many other methods [13,14,15] follow pseudo labeling as a fundamental step of SSL and add some additional constraint or pseudo label enhancement techniques to achieve better performance. Consistency regularization [16] is a model training constraint where the model should maintain consistency in the prediction results for different random variants of the same input, which is widely used with pseudo labeling in the SSL field. Early Consistency regularization mostly used the L2 distance of the prediction results between different input variants as a loss to learn the adequate information in the unlabeled data [15]. In recent researches [17, 18], MixMatch [7] uses unlabeled data augmentation and pseudo-label fusion techniques, combined with pseudo-label low-entropy [19, 20] processing and data mixing consistency regularization to build the SSL model, which has achieved great success. FixMatch [8] performs weak augmentation and strong augmentation [21] on unlabeled data respectively for consistency regularization, where the prediction results of high-confidence weakly augmented data are used as pseudo-labels for strong augmented data. Semco [22] introduces more prior knowledge of visual similarity between categories for the semantic predictions of semi-supervised models and designs a collaborative training process based on disagreement. Extracting more knowledge conducive to training from unlabeled data is the key to the success of SSL. It is also a better way to use GAN [23, 24] or other unsupervised feature representation learning [25,26,27] methods to assist SSL systems. Chen et al. [28, 29] design a contrastive self-supervised learning method that compares the similarity between the same source data and different source data. It uses a large amount of data and computing power to learn a better feature representation and proves that a better SSL performance can be obtained through transfer learning. Kim et al. [30] constructs a better pre-trained model by maximizing the mutual information between the outputs of the same source, which also shows performance gain in SSL.

2.2 Multi-model interactive learning

The information interaction between multiple models can also effectively improve the performance of SSL. Co-training [31] trains two models based on data in different redundant views. Each model generates pseudo-labels in the unlabeled data with higher confidence in the prediction results and uses them to train another model. In this way, the model can integrate supervision from different sources [32] for learning. Since fully redundant conditional independent multi-views are challenging to obtain in most machine learning tasks, tri-training [33] proposes a collaborative training method without requiring different views. It uses boot-strap [34] sampling to obtain differently labeled datasets to train multiple models and then uses the votes of multiple models on the prediction results of unlabeled data as pseudo-labels for consistent training. Tri-net [35] extends the multi-model collaborative training method to deep learning, which designs three models with shared parameters and uses the output smearing [36] to the training labels of each model and increase the divergence between the models. When each model is training with unlabeled data, as the relatively redundant models are trained, the pseudo-labels can be determined by fusing the prediction results of other models. Starting from the principle of knowledge distillation [37], Mutual learning [38] explains that the effective interaction of knowledge between different models can help the joint improvement of each model. It trains models to minimize the discrepancy of output distribution and achieve better performance on various tasks. Through multi-directional information transmission, the evaluation and fusion of multiple models against the predicted probability distribution of the same data will enable each model to obtain more helpful information [39].

In summary, most of the leading semi-supervised learning methods [40, 41] at this stage still focus on improving the performance on specific databases. Still, there will be some data distribution issues when facing real applications. The SSL model can achieve continuous performance improvement in the case of limited labeled data, but when unlabeled data are non-i.i.d. relative to labeled data, the impact on the performance of the SSL model still needs extensive research. The multi-model interactive learning methods meet the requirements of OEL, but the existing methods only focus on ensemble SSL or introduce regularization ability for each model training, which ignoring the potential of interactive learning in an open learning scenario. With the rapid development of edge intelligence [5, 42], the combination of model interactive learning and SSL can be contributing to realizing an OEL system, which is a desirable way to solve the distribution problems and attractive to modern IoT applications.

3 Mutual match

In this section, we first present the problem definition and the overall framework of MM, and then we introduced the specific methods and roles of each key module. Specifically, we use different models to represent learnable edge devices, and the noised dataset is applied to simulate an unstable data distribution environment.

3.1 Problem definition

As shown in Fig. 1, in an OEL scenario, multiple edge models/agents in a local area can obtain perception data in real-time, and combine a small amount of labeled data for SSL model training. Through knowledge transfer and interactive learning between models, each model will better use a large amount of non-i.i.d. unlabeled data to achieve continuous model adaptation and performance improvement.

Fig. 1
figure 1

Conceptual diagram of SSL in Online Evolutive Learning scenario

Given a classification task, let \(\mathbbm {X,U}\) be the labeled and unlabeled dataset. Let \(\mathcal {X}=\left (x_{i},y_{i}\right ),i\in \left (1,\ldots ,B\right )\) be a batch of B labeled examples sampled from \(\mathbbm {X}\), where xi are the training examples and yi are one-hot labels. Let \(\mathcal {U}=u_{i},i\in \left (1,\ldots ,\mu B\right )\) be a batch of μB unlabeled examples sampled from \(\mathbbm {U}\), where μ is a hyperparameter that scale the proportion of unlabeled to labeled training examples in each iteration. Note that the number of \(\mathbbm {X}\) is relatively small compared to \(\mathbbm {U}\), and \(\mathbbm {U}\) contains part of non-i.i.d data with \(\mathbbm {X}\). The labeled and unlabeled data batch is continuously sent to the model for each training step, and the sampled unlabeled data batch is used only once, forming an online batch learning process. Additionally, we denote \(P_{{\mathscr{M}}_{1}}\left (y\middle | x\right )\) as the predicted class distribution y produced by model \({\mathscr{M}}_{1}\) for input x, which is a general formula used to express the prediction results in this article.

3.2 Overall framework

As the unlabeled data stream is non-i.i.d., compared with the usual SSL methods, we propose an SSL framework for knowledge sharing and continuous learning between models, in order to stabilize model generalization capabilities and achieve better performance in OEL settings.

The main work of MM includes generating soft supervision information in the consistency regularization process to obtain more category-related information, introducing a dual-model mutual interactive learning algorithm to realize the information exchange between models, and using unsupervised sample mining during the training process to improve the utilization of high-confidence samples.

The overall process of our method is shown in Fig. 2. MM is an alternate training process and contains two calculation paths. Following different paths, the predictions of labeled/unlabeled data using different augmentation methods are used to calculate the supervised/unsupervised loss. Precisely, the unsupervised loss is calculated using soft supervision consistency regularization. These prediction results will also participate in the calculation of interactive loss. When one model calculates the losses, the other model will predict the same input and use the result as the target of the interaction KL loss. At the end of each step, unsupervised example mining is performed to expand the labeled dataset.

Fig. 2
figure 2

Mutual Match training process, model \( {\mathscr{M}}_{1}\)and model \({\mathscr{M}}_{2}\) will alternate in each training step, details described in Algorithm 1

We set up three types of random image augmentation methods for MM: weak, medium, and strong, which are represented by \(\mathcal {A}_{w},\mathcal {A}_{m},\mathcal {A}_{s}\), respectively. For the function of each augmentation method, \(\mathcal {A}_{m}\) is used in supervised learning for labeled data, \(\mathcal {A}_{w}\) and \(\mathcal {A}_{s}\) are used in unsupervised part of MM for consistency regularization, both predictions of \(\mathcal {A}_{w}\) and \(\mathcal {A}_{m}\) are also used to calculate the interaction between models. The details of each form of augmentation method are described in Tables 1 and 2.

Table 1 Details of transformation methods in weak augmentation and medium augmentation

3

Table 2 Details of transformation methods in strong augmentation

3.3 Soft supervision in consistency regularization

Consistency regularization has played an essential role in many semi-supervised learning tasks, which is based on the assumption that the prediction of the model for the same input should be consistent. That means for input, making various changes to it without varying the semantics will not affect the target expression of the model, and the predictions of the model for the input and its augmented version should obey the same probability distribution as close as possible. With such a constraint, the model can obtain specific task-related knowledge from unlabeled data.

Many SSL methods that apply consistency regularization mainly perform one-hot representation, sharpening [43] or other label enhancement methods [44] to the filtered prediction results and use these generated artificial labels as the supervision for training with the augmented version of the same inputs. These methods have achieved better results for specific tasks but lost some extra supervision information during training which is important for non-stationary training environments.

Different from label enhancement methods, we directly adopt a soft supervision method in the consistency regularization. Specifically, because MM is an interactive learning process, another model will also generate prediction information for the input. We use a ratio α to fuse the two predictions for generating soft labels, where the α is sampled from a beta distribution. The combined soft label contains more associate information with other categories, we consider that some category-related dark knowledge [45] of unlabeled data can be used in this way, and the filter threshold of MM can already ensure that soft labels hold low-entropy representation. Moreover, soft supervision method could better benefit classification tasks that have more categories. In terms of random image augmentation methods, unlike [8], which applies weak augmentation to labeled and unlabeled data simultaneously, we configure a medium augmentation for labeled data, which may introduce some noise to the model judgment on unlabeled data and increase the reliability of prediction results.

We further describe the formulation of supervised/unsupervised loss, and the role of soft supervision/three different augmentation methods in the loss calculation.

In classification tasks, cross-entropy is generally used to measure the difference in probability distribution between the predicted value and the target, and is used as a loss function to guide model training. Let \(H\left (p,q\right )=-\sum plog\left (q\right )\) represents the cross-entropy loss of the probability distribution p and q, taking the training of model \({\mathscr{M}}_{1}\) as an example, \(\mathcal {X},\mathcal {U}\) are a batch of labeled and unlabeled training examples, supervised loss \({\mathscr{L}}_{s}\) of \({\mathscr{M}}_{1}\) can be denote as (1):

$$ \mathcal{L}_{s}=\frac{1}{B}{\sum}_{i=1}^{B} H\left( y_{i}\ ,P_{\mathcal{M}_{1}}\left( y\middle|\mathcal{A}_{m}\left( x_{i}\right)\right)\right), $$
(1)

where \(\mathcal {A}_{m}\left (x_{i}\right )\) represents the medium image augmentation method on labeled examples, \(P_{{\mathscr{M}}_{1}}\left (y\middle |\mathcal {A}_{m}\left (x_{i}\right )\right )\) is the output probability distribution of \({\mathscr{M}}_{1}\) with \(\mathcal {A}_{m}\left (x_{i}\right )\) as input, and yi is the one-hot target of xi in batch B.

For the unsupervised loss, we compute the combined predictions ri by

$$ r_{i_{1}}=P_{\mathcal{M}_{1}}\left( y\middle|\mathcal{A}_{w}\left( u_{i}\right)\right), $$
(2)
$$ r_{i_{2}}=P_{\mathcal{M}_{2}}\left( y\middle|\mathcal{A}_{w}\left( u_{i}\right)\right), $$
(3)
$$ \alpha \sim Beta(8.5,1.5), $$
(4)
$$ r_{i}=\alpha r_{i_{1}} +(1-\alpha) r_{i_{2}} , $$
(5)

where \(r_{i_{1}}\) and \(r_{i_{2}}\) denote the probability distribution predicted by model \({\mathscr{M}}_{1}\) and \({\mathscr{M}}_{2}\) with the weak augmented unlabeled example ui as input, respectively, α is a scalar between [0,1], representing the fusion ratio sampled from the beta distribution, thus ri is the combined prediction by the weighted average of the two predictions and used for soft supervision consistency regularization.

Then, our consistency regularization process can be interpreted as (6) and (7):

(6)

where \(\mathcal {T}\) is a threshold to filter out low-probability prediction results through . In this way, if a prediction result \(r_{i}<\mathcal {T}\), it is discarded to calculate the unsupervised loss. We use an argwhere-like function to fetch the filtered result \(r_{i}^{(uns)}\) without one-hot encoding as the soft label, then the corresponding input ui is used to jointly calculate the unsupervised loss:

$$ \mathcal{L}_{u}=\frac{1}{\mu B} {\sum}_{i=1}^{\mu B} H\left( r_{i}^{(u n s)}, P_{\mathcal{M}_{1}}\left( y \mid \mathcal{A}_{s}\left( u_{i}\right)\right)\right), $$
(7)

where \({\ \mathcal {A}}_{s}\left (u_{i}\right )\) means we use strong image augmentation method on filtered unlabeled examples.

Through the soft supervision in consistency regularization and the three augmentation methods for supervised/unsupervised loss calculation, each model in MM will fully use the category-related information and continuously improve judgment ability in the training process.

3.4 Mutual interactive training

Many studies have shown that the SSL method can gradually improve the ability of the model to use the knowledge in the unlabeled data during the continuous learning process. However, in OEL settings where edge devices are distributed in different areas and the perception scene is constantly changing, the training of the SSL models is always affected by non-i.i.d data streams, which may lead to poor model generalization. It will be ideal to effectively use the knowledge sharing between edge models and implement an interactive learning method to improve model performance continuously.

With the above consideration, we introduce a multi-model learning mechanism. Hinton et al. [37] explains the feasibility of quickly building a lightweight model with better performance based on knowledge transfer between models. Zhang et al. [38] uses the posterior probability between models to teach each other how to learn during the training process. We at this moment adopt the method of dual-model information interaction to achieve alternate training and evolution between models.

Follow [38], we use Kullback Leibler (KL) Divergence to measure the difference in the predicted probability distribution of labeled and unlabeled data between models, and use this as interactive information to guide the training of models alternately. Precisely, we calculate the filtered KL loss in the unsupervised part, which is further described in what follows.

The KL distance form probability distribution p to q can be computed as:

$$ D_{K L}(q \| p)=\sum q \log \frac{q}{p} $$
(8)

In our cases, when calculating the KL distance of the supervised part, we denote \(r^{\left (m_{1}\right )}\) and \(r^{\left (m_{2}\right )}\) as the predicted distribution of model \({\mathscr{M}}_{1}\) and \({\mathscr{M}}_{2}\) respectively, then we can calculate \(D_{KL}(r^{\left (m_{2}\right )}||r^{\left (m_{1}\right )})\) as supervised interaction loss of model \({\mathscr{M}}_{1}\). The training of these two models is performed alternately, we will describe the detailed training process in Algorithm 1.

figure g

When calculating the KL distance of the unsupervised part, let \(r_{i_{2}}=P_{{\mathscr{M}}_{2}}\left (y\middle |\mathcal {A}_{w}\left (u_{i}\right )\right )\) be the probability distribution predicted by model \({\mathscr{M}}_{2}\) with the weak augmented unlabeled example ui as input, we choose the as the KL loss target, and the \(r_{i_{1}}=P_{{\mathscr{M}}_{1}}\left (y\middle |\mathcal {A}_{w}\left (u_{i}\right )\right )\) as the KL loss part of model \({\mathscr{M}}_{1}\). Note that unlike the strong augmented \(\mathcal {A}_{s}\left (u_{i}\right )\ \) outputs we used to compute unsupervised loss, we just use the weak augmented \(\mathcal {A}_{w}\left (u_{i}\right )\) outputs to calculate the unsupervised KL loss part, and \(r_{i_{1}}^{(kl)}\) is the filtered \(r_{i_{1}}\) by the same rule of \(r_{i_{2}}^{(kl)}\) in each step. Then the unsupervised interaction loss of \({\mathscr{M}}_{1}\) can be interpreted as \(D_{KL}(r_{i_{2}}^{(kl)}|| r_{i_{1}}^{(kl)} )\).

In summary, the overall loss function of Mutual Match can be expressed as:

$$ \begin{array}{@{}rcl@{}} \mathcal{L}_{\mathcal{M}_{1}}=\mathcal{L}_{s}^{(m_{1})}&+&\lambda_{kl} D_{KL}(r^{(m_{2})}||r^{(m_{1})})+\lambda_{u}\mathcal{L}_{u}^{(m_{1})}\\&+&\lambda_{kl}D_{KL}(r_{i_{2}}^{(kl)}||r_{i_{1}}^{(kl)}), \end{array} $$
(9)
$$ \begin{array}{@{}rcl@{}} \mathcal{L}_{\mathcal{M}_{2}}=\mathcal{L}_{s}^{(m_{2})}&+&\lambda_{kl} D_{KL}(r^{(m_{1})}||r^{(m_{2})})+\lambda_{u}\mathcal{L}_{u}^{(m_{2})}\\&+&\lambda_{kl}D_{KL}(r_{i_{1}}^{(kl)}||r_{i_{2}}^{(kl)}), \end{array} $$
(10)

where \({\mathscr{L}}_{{\mathscr{M}}_{1}}\) and \({\mathscr{L}}_{{\mathscr{M}}_{2}}\) represent the loss function of model \({\mathscr{M}}_{1}\) and \({\mathscr{M}}_{2}\) respectively. The two models are trained alternately based on part of the outputs of each other. λu and λkl are loss weights of unsupervised loss and KL loss part.

We add KL loss to the training of the supervised and unsupervised parts to allow the model to generate information interaction during the alternate training process. The unsupervised KL loss enables the model to effectively use the part of the input unlabeled data that is less than the threshold \(\mathcal {T}\), providing additional regularization constraints. We believe that mutual interactive training can accelerate model convergence and improve model generalization when facing an OEL setting.

3.5 Sample mining for incremental learning

Since we use soft supervision loss in the consistency regularization part, this provides more information about the correlation between categories while adding a little uncertainty to the interactive training process, which may lead to the training error fluctuating around the local minimum.

As the iteration progresses, the predictions of models to some unlabeled samples will gradually become stable, and some soft labels with higher confidence will show a more certain probability distribution. To better use these stable prediction samples to maintain the stability of the training, we perform one-hot representation of these soft labels through sample mining processing, thereby expanding the labeled dataset and enhancing the discriminative performance of the training model with more strong supervision.

Specifically, we build an index counter list \(\mathbbm {I}\) for unlabeled data, which collects prediction results with higher confidence in each iteration and counts its occurrence frequency in \(\mathbbm {I}\). For those data whose occurrence frequency reaches indicator value δ, we expand it into the labeled data set for the calculation of supervised loss. According to the complexity of the training task and the size of the data set, we usually set the indicator value δ to a scalar from 8 to 12. In particular, if the number of occurrences of unlabeled data in \(\mathbbm {I}\) does not reach δ when its highest category confidence drops, the frequency of this data in \(\mathbbm {I}\) will be cleared. The details of the sample mining process is described in Algorithm 2.

figure h

4 Experiments

The experimental environment settings and implementation details are described, performance comparison between MM and other methods on different datasets are reported. In the ablation study, we further analyze the influence of each module of MM on the overall performance.

4.1 Implementation details

Since other methods in the past were configured in different experimental environments and aimed at standard datasets. To better evaluate the effect of our method, we compare MM with other algorithms under the same experimental conditions with unstable and time-varying data distribution. Our programming environment mainly uses Keras [46] with Tensorflow [47] as the backend.

Datasets settings

A general semi-supervised learning algorithm will select part of the labeled data from the dataset as the supervised training set and remove the labels from the remaining data as the unlabeled training set. To simulate a frequently changing perceptual environment with varying data distribution, we need to introduce more non-i.i.d. data to the unlabeled training set so that the model learning process has the same data disturbance with different feature distributions as in the realistic scene. Specifically, we add 20% of differently distributed data to the unlabeled part of each dataset, which is mainly composed of other categories of data and background data.

Model structure

Since Wide Resnet [48] has performed relatively stable in many research experiments, we adopted WRN as the primary model structure and configured different variants for different databases.

Parameter initialization

We are equipped with different parameter initialization methods for the two models used for interactive training. Specifically, we use He normal [49] and Lecun uniform [50] to initialize the two models, respectively. We uniformly set weight decay with the coefficient of 0.0005 for all models, and all models do not use dropout.

Optimizer settings

We adopted a unified standard SGD optimizer with momentum β = 0.9 and Nesterov, and exploits cosine learning rate decay [51] with the initial learning rate η = 0.03 for all models.

Training data normalization

Unlike other methods that count the mean and standard deviation of the training data set to normalize the data, considering that each batch of training data will be randomly augmented with different strengths during the training process, this will cause a slight change in the data distribution. We directly divide the image by 255.0 as a normalization method and limit the pixel value to [0, 1].

Data augmentation methods

We use the same data augmentation methods based on imgaug [52] library for all experiments, and the specific augmentation methods are described in Tables 1 and 2. For weak augmentation, we randomly select 1 to 3 augmentation methods from Table 1 to apply to the input image, and the implementation intensity of each method is also in a random range. For medium augmentation, we randomly select 3 to 6 augmentation methods from Table 1. For strong augmentation, we randomly select 4 to 9 augmentation methods from Table 2, and the order of implementation of each method is also random.

To avoid differences in the specific configuration of the model under different programming frameworks, all training processes of each method are carried out under our experimental settings. We performed K iterations for each set of experiments, and more specific hyperparameters are described in Table 3. We conducted multiple rounds of experiments on CIFAR-10/100 [53], SVHN [54], and Tiny-imagenet [55] databases using different groups and different amounts of labeled data to evaluate the performance of Mutual Match.

Table 3 List of Mutual Match hyperparameters for all datasets

Each of our model training is completed on a single Nvidia RTX 2080 Ti GPU, we also trained a fully-supervised model as a baseline for comparison with semi-supervised algorithms for each different database. We report the performance comparison of Mutual Match and other methods in detail in Table 4.

Table 4 Comparison of accuracy for CIFAR-10/100, SVHN and Tiny-imagenet on 3 different unlabeled data folds

Moreover, to illustrate the feasibility of MM in practical applications, we report the simulation times of several representative methods in Table 5. Because of the interactive learning process, the batch training time of our method is higher than some algorithms, and the inference time is not affected.

Table 5 Comparison of simulation time for various representative methods on CIFAR-10, the training and inference time is calculated based on the execution time of each batch

4.2 CIFAR-10 and SVHN

Both CIFAR-10 and SVHN databases consist of 10 categories of images with a resolution of 32×32, and we used the same Wide ResNet-28-2 model for training. For CIFAR-10, we use 50,000 training images and 10,000 testing images. For SVHN, we use 73,357 training images and 26,032 test images. Moreover, we add 20% non-i.i.d. data to the unlabeled part of each dataset to simulate real-life OEL scenarios.

In each experiment, we establish two models for MM, and use the unlabeled data to construct a simulated online learning environment, so that the model is in a batch learning process of continuously obtaining training data. We randomly select three different sets of labeled images from the training set, and perform training and evaluation on all methods three times. We calculate the mean and standard deviation of the accuracy of the three sets of training for reporting in Table 4, the accuracy of MM is better than other methods in various labeled data settings.

Figure 3a shows the training curves of several representative methods on the CIFAR10 dataset. We choose 28 iterations as an epoch to evaluate the results. Since the evaluation values frequently fluctuate throughout the training process, we use exponential weighted average to smoother the graphs. It can be seen that Mutual Match only slightly lags behind MixMatch in the first few epochs. In the subsequent training, both the convergence speed and the accuracy are higher than other methods. Figure 3b is a comparison of several methods on the SVHN dataset. MM can still converge faster and continue to improve performance.

Fig. 3
figure 3

Performance comparison and unlabeled data analysis during model training. (a) Accuracy curves of different models in the training process on CIFAR-10 with 4000 labeled examples. (b) Accuracy curves of different models in the training process on SVHN with 1000 labeled examples. (c) The curve of the number of effective pseudo-labels and the accuracy of pseudo-labels obtained by our method in the training process

Since MM is a batch learning process of alternating training, we can effortlessly evaluate the utilization effect of unlabeled data. From Fig. 3c , it can be seen that on the two datasets, the effective pseudo-labeled data generated by MM gradually increases with the training rounds, which shows that more unlabeled data can be effectively used. The average accuracy of these pseudo-labels can reach about 97.5% on both datasets, which means that most filtered images can effectively help model evolution.

The number of unlabeled data input in each iteration of MM is 7×64, and we can use about 70% of it in consistency regularization. A small amount of error supervision information may be a reason for affecting the performance of the model. We can reduce these effects by imposing stricter restrictions on pseudo-labels and iterating more rounds.

4.3 CIFAR-100 and tiny-imagenet

CIFAR-100 contains 50,000 training images of size 32×32 from 100 classes and 10,000 testing images, because there are more categories than CIFAR-10, we use a wider WRN-28-8 as the base model. For Tiny-imagenet, which contains 100,000 training images and 25,000 test images with 200 classes and size of 64×64, as the images size and categories grow, we use a deeper and wider WRN-34-10 as base model. We also add 20% non-i.i.d. data to the unlabeled part of each dataset to simulate real-life scenarios.

CIFAR-100 and Tiny-imagenet are both more complex multi-classification tasks, so the performance of each method is not as good as the performance on the previous datasets. We still use different labeled data ”folds” to train each method three times. As can be seen from Table 4, the accuracy of MM is still higher than other methods. Additionally, knowledge sharing between models accelerates the convergence of the model.

Meanwhile, we find that category-related information showed better gains in classification tasks with more categories and played a positive role in both consistency regularization and interactive learning. The extra information allows our method to achieve a relatively significant performance improvement compared with other methods on complex tasks.

4.4 Ablation study

Varying the distribution and amount of unlabeled data

OEL systems usually face real-time non-i.i.d. data streams, some of which are irrelevant to the target task. The characteristics of SSL methods determine that they will be affected by varying distributed data (the model predicts the probability of each input and uses it for SSL training). Our method reduces this influence through knowledge sharing and joint discrimination. At the same time, to evaluate the gains of knowledge sharing between models, we design a performance comparison experiment between some methods when the amount of unlabeled data is reduced.

We introduce different proportions of non-i.i.d. data in the unlabeled dataset to evaluate the performance of various methods in non-stationary perceptual environments. As can be seen in Fig. 4a, when the unlabeled dataset contains 10% to 50% of non-i.i.d. data, the performance of our method remains in a relatively stable range compared to Fixmatch and UDA. In Fig. 4b, even in the face of the more complex Tiny-imagenet dataset, MM still maintains reliable performance on 50% non-i.i.d. data, while other methods have shown significant degradation.

Fig. 4
figure 4

Plot of ablation study on various proportions of non-i.i.d. unlabeled data: (a) Performance comparison of different methods with 10% to 50% non-i.i.d. unlabeled data on CIFAR-10. (b) Performance comparison of different methods with 10% to 50% non-i.i.d. unlabeled data on Tiny-imagenet. Plot of ablation study on various percentage of unlabeled dataset: (c) Performance comparison of different methods with 50% to 90% unlabeled data for model training on SVHN. (d) Performance comparison of different methods with 50% to 90% unlabeled data for model training on CIFAR-100

The performance of the SSL method is also sensitive to the amount of unlabeled data used for training. We design corresponding comparative experiments on some databases. It can be seen from Fig. 4c that because SVHN is a relatively simple data set, the reduction of unlabeled data (10% to 50%) does not greatly affect the performance of various methods. MM maintains a better accuracy rate under different unlabeled data ratios. In CIFAR-100, as Fig. 4d shows, under more complex tasks, MM still outperforms other methods on the lower proportion (20% to 50%) of unlabeled datasets. These results demonstrate that MM makes better use of knowledge sharing and interactive information in the learning process.

Soft or hard label in consistency regularization

We believe that using soft targets in consistency regularization will enable the model to learn more category-related information, especially in a multi-model mutual learning method such as MM. We know that H(p,q) = H(p) + DKL(p||q), where H(p) is the entropy of the target probability distribution. This formulation shows that using soft target to calculate the supervision loss is equivalent to calculating the KL divergence of the prediction result and the label while adding a low-entropy target regularization, which can be better combined with the calculation of interactive information between models. The method of mixing up [56] images and labels in [7] is also similar to using soft targets for model training, and has achieved better performance. It can be seen from Fig. 5a that the overall accuracy of the model using soft label consistency regularization is always higher than that of the hard label version, and the convergence speed is only slightly behind in the early stage. Soft supervision works in more training cases, but in some cases, it is a bit unstable. In general, soft supervision brings about 1.5% performance improvement to the model.

Fig. 5
figure 5

Plot of ablation study on each module of Mutual Match. (a) The performance comparison of soft or hard supervision consistency regularization is used in Mutual Match training. (b) Performance comparison of Mutual Match when using weak or strong augmented outputs for unsupervised interactive information calculation. (c) Performance comparison of using or not using unsupervised sample mining on Mutual Match

Weak or strong augmented outputs to calculate the unsupervised KL loss in Mutual Match

In MM, we mainly use the prediction results of filtered weak augmented unlabeled data to calculate the interactive KL loss between models. We believe that the predictions of the model for weak augmented data will have higher reliability than a strong augmented version. The interaction loss plays a vital role in accelerating model convergence and improving model discrimination ability.

We calculate unsupervised KL loss using the output of strong augmented data and compare it with our final method. It can be seen in Fig. 5b that calculating KL loss with the output of strong augmented data causes model performance degradation. This limited performance can be due to the reliability of the prediction results of strong augmented data will be reduced. Moreover, as our model has the soft target, if the strong augmented results are used to calculate the unsupervised loss and unsupervised KL loss at the same time in our soft target setting, i.e., similar to [38], which does not need to perform more filtering judgments and directly calculate the KL loss on the prediction results, then the conflict between the two loss goals may weaken the final performance.

Sample mining

We further study the impact of sample mining process on the performance of models. As shown in Fig. 5c, since the model does not generate many pseudo-labels meeting the conditions in the early training process, the performance of the model with sample mining is basically the same as the model without sample mining. After dozens of training epochs, the performance of the model using sample mining will be stronger due to the expanded training set.

5 Conclusion

We propose Mutual Match, an SSL algorithm that fully uses the interactive information between models and the extra knowledge in the training process to achieve an online-evolutive learning system. MM mainly includes soft supervision in consistency regularization, mutual interactive training, and unsupervised sample mining. Our method surpasses multiple semi-supervised algorithms regarding model training efficiency and accuracy in a unified OEL experiment setup and can be more suitable for model interaction and evolution.

Semi-supervised learning algorithms have developed rapidly and achieved good performance on many datasets. However, SSL methods with better performance generally rely on integrating a variety of different technologies and complex data judgment and processing procedures. At the same time, they are susceptible to model structure, optimization methods, image augmentation methods, and numerous hyperparameter adjustments. As a result, semi-supervised learning algorithms will face more unstable factors when extended to different tasks. To achieve better performance on a certain database, a formidable number of configurations often need to be adjusted.

Our research focuses on using model interaction information to bring more gains of online training with frequently changing non-i.i.d. data streams. We have summarized the key points with significant impact on the performance of the SSL model, and proposed a unified and easy-to-expandable learning method for multi-model interactive training. Considering applications in real scenarios, we conduct comparative experiments on multiple databases with the same experimental configuration to prove the effectiveness of our method. In the face of rapid changes in user-side perception scenarios, the proposed MM can continuously use the semantic information of unlabeled data to optimize model performance during the learning process, and leverage the computing and communication of edge and terminal devices to achieve a better generalization.

In future research, we are considering interactive learning and knowledge fusion between a larger number of models, in order to provide better self-learning and autonomous evolution methods for real-world applications. We will also investigate the OEL system in an unsupervised setting.