Revisiting Consistency Regularization for Semi-Supervised Learning

Consistency regularization is one of the most widely-used techniques for semi-supervised learning (SSL). Generally, the aim is to train a model that is invariant to various data augmentations. In this paper, we revisit this idea and find that enforcing invariance by decreasing distances between features from differently augmented images leads to improved performance. However, encouraging equivariance instead, by increasing the feature distance, further improves performance. To this end, we propose an improved consistency regularization framework by a simple yet effective technique, FeatDistLoss, that imposes consistency and equivariance on the classifier and the feature level, respectively. Experimental results show that our model defines a new state of the art across a variety of standard semi-supervised learning benchmarks as well as imbalanced semi-supervised learning benchmarks. Particularly, we outperform previous work by a significant margin in low data regimes and at large imbalance ratios. Extensive experiments are conducted to analyze the method, and the code will be published.

Consistency regularization [2,28,44] is one of the most widely-used SSL methods.Recent work [46,51,27] achieves strong performance by utilizing unlabeled data in a way that model predictions should be invariant to input perturbations.However, when using advanced and strong data augmentation schemes, we question if the model should be invariant to such strong perturbations.On the right of Figure 1 we illustrate that strong data augmentation leads to perceptually highly diverse images.Thus, we argue that improving equivariance on such strongly augmented images can provide even better performance rather than making the model invariant to all kinds of augmentations.To this end, we propose a simple yet effective technique, Feature Distance Loss (FeatDistLoss), to improve data-augmentation-based consistency regularization.
We formulate our FeatDistLoss as to explicitly encourage invariance or equivariance between features from different augmentations while enforcing the same Invariance Equivariance original weak augmentation strong augmentation Fig. 1: Left: Binary classification task.Stars are features of strongly augmented images and circles are of weakly augmented images.While encouraging invariance by decreasing distance between features from differently augmented images gives good performance (left), encouraging equivariant representations by increasing the distance regularizes the feature space more, leading to even better generalization performance.Right: Examples of strongly and weakly augmented images from CIFAR-100.The visually large difference between them indicates that it can be more beneficial if they are treated differently.semantic class label.Figure 1 left shows the intuition behind the idea.Specifically, encouragement of equivariance for the same image but different augmentations (increase distance between stars and circles of the same color) pushes representations apart from each other, thus, covering more space for the class.Imposing invariance, on the contrary, makes the representations of the same semantic class more compact.In this work we empirically find that increasing equivariance to differently augmented versions of the same image can lead to better performance especially when rather few labels are available per class (see section 4.3).
This paper introduces the method CR-Match which combines FeatDistLoss with other strong techniques defining a new state-of-the-art across a wide range of settings of standard SSL benchmarks, including CIFAR-10, CIFAR-100, SVHN, STL-10, and Mini-Imagenet.More specifically, our contribution is fourfold.(1) We improve data-augmentation-based consistency regularization by a simple yet effective technique for SSL called FeatDistLoss which regularizes the distance between feature representations from differently augmented images of the same class.(2) We show that while encouraging invariance results in good performance, encouraging equivariance to differently augmented versions of the same image consistently results in even better generalization performance.(3) We provide comprehensive ablation studies on different distance functions and different augmentations with respect to the proposed FeatDistLoss.(4) In combination with other strong techniques, we achieve new state-of-the-art results across a variety of standard semi-supervised learning benchmarks, specifically in low data regimes.
that the model should output consistent predictions for perturbed versions of the same input.Many works explored different ways to generate such perturbations.For example, [48] uses an exponential moving average of the trained model to produce another input; [44,28] use random max-pooling and Dropout [47]; [51,6,46,27] use advanced data augmentation; [7,49,6] use MixUp regularization [54], which encourages convex behavior "between" examples.Another spectrum of popular approaches is pseudo-labeling [45,35,29], where the model is trained with artificial labels.[1] trained the model with "soft" pseudo-labels from network predictions; [38] proposed a meta learning method that deploys a teacher model to adjust the pseudo-label alongside the training of the student; [46,29] learn from "hard" pseudo-labels and only retain a pseudo-label if the largest class probability is above a predefined threshold.Furthermore, there are many excellent works around generative models [25,37,16] and graph-based methods [32,31,5,24].We refer to [8,55,56] for a more comprehensive introduction of SSL methods.
Noise injection plays a crucial role in consistency regularization [51].Thus advanced data augmentation, especially combined with weak data augmentation, introduces stronger noise to unlabeled data and brings substantial improvements [6,46].[46] proposes to integrate pseudo-labeling into the pipeline by computing pseudo-labels from weakly augmented images, and then uses the cross-entropy loss between the pseudo-labels and strongly augmented images.Besides the classifier level consistency, our model also introduces consistency on the feature level, which explicitly regularizes representation learning and shows improved generalization performance.Moreover, self-supervised learning is known to be beneficial in the context of SSL.In [20,9,10,41], self-supervised pre-training is used to initialize SSL.However, these methods normally have several training phases, where many hyper-parameters are involved.We follow the trend of [53,6] to incorporate an auxiliary self-supervised loss alongside training.Specifically, we optimizes a rotation prediction loss [19].
Equivariant representations are recently explored by capsule networks [43,22].They replaced max-pooling layers with convolutional strides and dynamic routing to preserve more information about the input, allowing for preservation of part-whole relationships in the data.It has been shown, that the input can be reconstructed from the output capsule vectors.Another stream of work on group equivariant networks [12,50,13] explores various equivariant architectures that produce transform in a predictable linear manner under transformations of the input.Different from previous work, our work explores equivariant representations in the sense that differently augmented versions of the same image are represented by different points in the feature space despite the same semantic label.As we will show in section 4.3, information like object location or orientation is more predictable from our model when features are pushed apart from each other.

CR-Match
Consistency regularization is highly-successful and widely-adopted technique in SSL [2,28,44,46,51,27].In this work, we aim to leverage and improve it by even further regularizing the feature space.To this end, we present a simple yet effective technique FeatDistLoss to explicitly regularize representation learning and classifier learning at the same time.We describe our SSL method, called CR-Match, which shows improved performance across many different settings, especially in scenarios with few labels.In this section, we first describe our technique FeatDistLoss and then present CR-Match that combines FeatDistLoss with other regularization techniques inspired from the literature.Background: The idea of consistency regularization [2,28,44] is to encourage the model predictions to be invariant to input perturbations.Given a batch of n unlabeled images u i , i ∈ (1, ..., n), consistency regularization can be formulated as the following loss function:

Feature Distance Loss
where f is an encoder network that maps an input image to a d-dimensional feature space; A and α are two stochastic functions which are, in our case, strong and weak augmentations, respectively (details in Section 3.3).By minimizing the L 2 distance between perturbed images, the representation is therefore encouraged to become more invariant with respect to different augmentations, which helps generalization.The intuition behind this is that a good model should be robust to data augmentations of the images.FeatDistLoss: As shown in Figure 2, we extend the above consistency regularization idea by introducing consistency on the classifier level and invariance or equivariance on the feature level.FeatDistLoss thus allows to apply different types of control for these levels.In particular, when encouraging to reduce the feature distance, it becomes similar to classic consistency regularization, and encourages invariance between differently augmented images.As argued above, making the model predictions invariant to input perturbations gives good generalization performance.Instead, in this work we find it is more beneficial to treat images from different augmentations differently because some distorted images are largely different from their original images as demonstrated visually in Figure 1 right.Therefore, the final model (CR-Match) uses FeatDistLoss to increase the distance between image features from augmentations of different intensities while at the same time enforcing the same semantic label for them.Note that in Section 4.3, we conduct an ablation study on the choice of distance function, where we denote CR-Match as CR-Equiv, and the model that encourages invariance as CR-Inv.
The final objective for the FeatDistLoss consists of two terms: L Dist (on the feature level), that explicitly regularizes feature distances between embeddings, and a standard cross-entropy loss L P seudoLabel (on the classifier level) based on pseudo-labeling.
With L Dist we either decrease or increase the feature distance between weakly and strongly augmented versions of the same image in a low-dimensional space projected from the original feature space to overcome the curse of dimensionality [4].Let d(•, •) be a distance metric and z be a linear layer that maps the highdimensional feature into a low-dimensional space.Given an unlabeled image u i , we first extract features with strong and weak augmentations by f (A(u i )) and f (α(u i )) as shown in Figure 2 (a), and then FeatDistLoss is computed as: Different choices of performing L Dist are studied in Section 4.3, where we find empirically that applying L Dist at (a) in Figure 2 gives the best performance.At the same time, images from strong and weak augmentations should have the same class label because they are essentially generated from the same original image.Inspired by [46], given an unlabeled image u i , a pseudo-label distribution is first generated from the weakly augmented image by pi = g(f (α(u i ))), and then a cross-entropy loss is computed between the pseudo-label and the prediction for the corresponding strongly augmented version as: where CE is the cross-entropy, g is a linear classifier that maps a feature representation to a class distribution, and A(u i ) denotes the operator for strong augmentations.
Putting it all together, FeatDistLoss processes a batch of unlabeled data u i , i ∈ (1, ..., B u ) with the following loss: where c i = max pi is the confidence score, and 1{•} is the indicator function which outputs 1 when the confidence score is above a threshold.This confidence thresholding mechanism ensures that the loss is only computed for unlabeled images for which the model generates a high-confidence prediction, which gives a natural curriculum to balance labeled and unlabeled losses [46].
As mentioned before, depending on the function d, FeatDistLoss can decrease the distance between features from different data augmentation schemes (when d is a distance function, thus pulling the representations together), or increase it (when d is a similarity function, thus pushing the representations apart).As shown in table 4, we find that both cases results in an improved performance.However, increasing the distance between weakly and strongly augmented examples consistently results in better generalization performance.We conjecture that the reason lies in the fact that FeatDistLoss by increasing the feature distance explores equivariance properties (differently augmented versions of the same image having distinct features but the same label) of the representations.It encourages the model to have more distinct weakly and strongly augmented images while still imposing the same label, which leads to both more expressive representation and more powerful classifier.As we will show in Section 4.3, information like object location or orientation is more predictable from models trained with FeatDistLoss that pushes the representations apart.Additional ablation studies of other design choices such as the distance function and the linear projection z are also provided in Section 4.3.

Overall CR-Match
Now we describe our SSL method called CR-Match leveraging the above Feat-DistLoss.Pseudo-code for processing a batch of labeled and unlabeled examples is shown in algorithm 1 in supplementary material.
Given a batch of labeled images with their labels as X = {(x i , p i ) : i ∈ (1, ..., B s )} and a batch of unlabeled images as U = {u i : i ∈ (1, ..., B u )}.CR-Match minimizes the following learning objective: where L S is the supervised cross-entropy loss for labeled images with weak data augmentation regularization; L U is our novel feature distance loss for unlabeled images which explicitly regularizes the distance between weakly and strongly augmented images in the feature space; and L Rot is a self-supervised loss for unlabeled images and stands for rotation prediction from [19] to provide an additional supervisory and regularizing signal.
Fully supervised loss for labeled data: We use cross-entropy loss with weak data augmentation regularization for labeled data: where CE is the cross-entropy loss, α(x i ) is the extracted feature from a weakly augmented image x i , g is the same linear classifier as in equation 2, and p i is the corresponding label for x i .
Self-supervised loss for unlabeled data: Rotation prediction [19] (RotNet) is one of the most successful self-supervised learning methods, and has been shown to be complementary to SSL methods [53,7,41].Here, we create four rotated images by 0 • , 90 where R is {0 • , 90 • , 180 • , 270 • }, and h is a predictor head.

Implementation Details
Data augmentation: As mentioned above, CR-Match adopts two types of data augmentations: weak augmentation and strong augmentation from [46].Specifically, the weak augmentation α corresponds to a standard random cropping and random mirroring with probability 0.5, and the strong augmentation A is a combination of RandAugment [14] and CutOut [17].At each training step, we uniformly sample two operations for the strong augmentation from a collection of transformations and apply them with a randomly sampled magnitude from a predefined range.The complete table of transformation operations for the strong augmentation is provided in the supplementary material.
Other implementation details: For our results in Section 4, we minimize the cosine similarity in FeatDistLoss, and use a fully-connected layer for the projection layer z.The predictor head h in rotation prediction loss consists of two fully-connected layers and a ReLU [34] as non-linearity.We use the same λ u = λ r = 1 in all experiments.As a common practice, we repeat each experiment with five different data splits and report the mean and the standard deviation of the error rate.We refer to supplementary material for further details and ablation of the hyper-parameters.

Experimental Results
Following protocols from previous work [7,46]

Main Results
In the following, each dataset subsection includes two paragraphs.The first provides technical details and the second discusses experimental results.CIFAR-10, CIFAR-100, and SVHN.We follow prior work [46] and use 4, 25, and 100 labels per class on CIFAR-100 and SVHN without extra data.For CIFAR-10, we experiment with settings of 4, 25, and 400 labels per class.We create labeled data by random sampling, and the remaining images are regarded as unlabeled by discarding their labels.Following [7,46,6], we use a Wide ResNet-28-2 [52] with 1.5M parameters on CIFAR-10 and SVHN, and a Wide ResNet-28-8 with 135 filters per layer (26M parameters) on CIFAR-100.
As shown in table 1 and table 2, our method improves over previous methods across all settings, and defines a new state-of-the-art.Most importantly, we improve error rates in low data regimes by a large margin (e.g., with 4 labeled examples per class on CIFAR-100, we outperform FixMatch and the second best method by 9.40% and 4.83% in absolute value respectively).Prior works [46,7,6] have reported results using a larger network architecture on CIFAR-100 to obtain better performance.On the contrary, we additionally evaluate our method on the small network used in CIFAR-10 and find that our method is more than 17 times (17 ≈ 26/1.5)parameter-efficient than FixMatch.We reach 46.05% error rate on CIFAR-100 with 4 labels per class using the small model, which is still slightly better than the result of FixMatch using a larger model.STL-10.STL-10 contains 5,000 labeled images of size 96-by-96 from 10 classes and 100,000 unlabeled images.The dataset pre-defines ten folds of 1,000 labeled examples from the training data, and we evaluate our method on five of these ten folds as in [46,6].Following [7], we use the same Wide ResNet-37-2 model (comprising 5.9M parameters), and report error rates in table 2.
Our method achieves state-of-the-art performance with 4.89% error rate.Note that FixMatch with error rate 5.17% used the more advanced CTAugment [6], which learns augmentation policies alongside model training.When evaluated with the same data augmentation (RandAugment) as we use in CR-Match, our result surpasses FixMatch by 3.09% (3.09%=7.98%-4.89%),which indicates that CR-Match itself induces a strong regularization effect.Mini-ImageNet.We follow [23,1,27] to construct the mini-ImageNet training set.Specifically, 50,000 training examples and 10,000 test examples are randomly selected for a predefined list of 100 classes [40] from ILSVRC [15].Following [27], we use a ResNet-18 network [21] as our model and experiment with settings of 40 labels per class and 100 labels per class.
As shown in table 2, our method consistently improves over previous methods and achieves a new state-of-the-art in both the 40-label and 100-label settings.Especially in the 40-label case, CR-Match achieves an error rate of 34.87% which is 4.18% higher than the second best result.Note that our method is 2 times more data efficient than the second best method FeatMatch [27] (FeatMatch, using 100 labels per class, reaches a similar error rate as our method with 40 labeled examples per class).
We conduct experiments on a single split on CIFAR-10, CIFAR-100, and SVHN with 4 labeled examples per class, and on MiniImageNet with 40 labels per class.Specifically, we remove the L Dist from equation 4 and train the model again using the same training scheme for each setting.We do not ablate L P seudoLabel and L S due to the fact that removing one of them leads to a divergence of training.We present a more detailed analysis of the pseudo-label error rate, the error rate of contributing unlabeled images, and the percentage of contributing unlabeled images during training on CIFAR-100 in the supplementary material.
We report final test error rates in table 3. We see that both RotNet and FeatDistLoss contribute to the final performance while their proportions can be different depending on the setting and dataset.For MiniImageNet, CIFAR-100 and SVHN, the combination of both outperforms the individual losses.For CIFAR-10, FeatDistLoss even outperforms the combination of both.This suggests that RotNet and FeatDistLoss are both important components for CR-Match to achieve the state-of-the-art performance.

Influence of Feature Distance Loss
In this section, we analyze different design choices for FeatDistLoss to provide additional insights of how it helps generalization.We focus on a single split with 4 labeled examples from CIFAR-100 and report results for a Wide ResNet-28-2 [52].For fair comparison, the same 4 random labeled examples for each class are used across all experiments in this section.Different distance metrics for FeatDistLoss.Here we discuss the effect of different metric functions d for FeatDistLoss.Specifically, we compare two groups of functions in table 4 left: metrics that increase the distance between features, including cosine similarity, negative JS divergence, and L2 similarity (i.e.normalized negative L2 distance); metrics that decrease the distance between features, including cosine distance, JS divergence, and L2 distance.We find that both increasing and decreasing distance between features of different augmentations give reasonable performance.However, increasing the distance always performs better than the counterpart (e.g., cosine similarity is better than cosine distance).We conjecture that decreasing the feature distance corresponds to an increase of the invariance to data augmentation and leads to ignorance of information like rotation or translation of the object.In contrast, increasing the feature distance while still imposing the same label makes the representation equivariant to these augmentations, resulting in more descriptive and expressive representation with respect to augmentation.Moreover, a classifier has to cover a broader space in the feature space to recognize rather dissimilar images from the same class, which leads to improved generalization.In summary, we found that both increasing and decreasing feature distance improve over the model which only applies consistency on the classifier level, whereas increasing distances shows better performance by making representations more equivariant.
Invariance and equivariance.Here we provide an additional analysis to demonstrate that increasing the feature distance provides equivariant features

Metric
Error Metrics that pull features together performs worse than those that push features apart.The error rate of CR-Match without FeatDistLoss is shown at the bottom.Right: Error rates of binary classification (whether a specific augmentation is applied) on the features from CR-Equiv (increasing the cosine distance) and CR-Inv (decreasing the cosine distance).We evaluate translation, scaling, rotation, and color jittering.Lower error rate indicates more equivariant features.Results are averaged over 10 runs.
while the other provides invariant features.Based on the intuition that specific transformations of the input image should be more predictable from equivariant representations, we quantify the equivariance by how accurate a linear classifier can distinguish between features from augmented and original images.Specifically, we compare two models from table 4 left: the model trained with cosine similarity denoted as CR-Equiv and the model trained with cosine distance denoted as CR-Inv.We train a linear SVM to predict whether a certain transformation is applied for the input image.1000 test images from CIFAR-100 are used for training and the rest (9000) for validation.The binary classifier is trained by an SGD optimizer with an initial learning rate of 0.001 for 50 epochs, and the feature extractor is fixed during training.We evaluate translation, scaling, rotation, and color jittering in table 4 right.All augmentations are from the standard PyTorch library.The SVM has a better error rate across all augmentations when trained on CR-Equiv features, which means information like object location or orientation is more predictable from CR-Equiv features, suggesting that CR-Equiv produces more equivariant features than CR-Inv.Furthermore, if the SVM is trained to classify strongly and weakly augmented image features, CR-Equiv achieves a 0.27% test error while CR-Inv is 46.18%.
Different data augmentations for FeatDistLoss.In our main results in Section 4.1, FeatDistLoss is computed between features generated by weak augmentation and strong augmentation.Here we investigate the impact of FeatDistLoss with respect to different types of data augmentations.Specifically, we evaluate the error rate of CR-Inv and CR-Equiv under three augmentation strategies: weak-weak pair indicates that FeatDistLoss uses two weakly augmented images, weak-strong pair indicates that FeatDistLoss uses a weak augmentation and a strong augmentation, and strong-strong pair indicates that FeatDistLoss uses two strongly augmented images.
As shown in table 5, using either CR-Inv or CR-Equiv using weak-strong pairs conistently outperforms the other augmentation settings (weak-weak and strongstrong).Additionally, CR-Equiv consistently achieves better generalization performance across all three settings.In particular, in the case advocated in this paper, namely using weak-strong pairs, CR-Equiv outperforms CR-Inv by 1.46%.Even in the other two settings, CR-Equiv leads to improved performance even though only by a small margin.This suggests that, on the one hand, that it is important to use different types of augmentations for our FeatDistLoss.And on the other hand, maximizing distances between images that are inherently different while still imposing the same class label makes the model more robust against changes in the feature space and thus gives better generalization performance.
Linear projection and confidence threshold in FeatDistLoss.As mentioned in Section 3, we apply L Dist at (a) in Figure 2 with a linear layer mapping the feature from the encoder to a low-dimensional space before computing the loss, to alleviate the curse of dimensionality.Also, the loss only takes effect when the model's prediction has a confidence score above a predefined threshold τ .Here we study the effect of other design choices in table 6.While features after the global average pooling (i.e.(b)) gives a better result than the ones directly from the feature extractor, (b) performs worse than (a) when additional projection heads are added.Thus, we use features from the feature extractor in CR-Match.
Features taken from Fig. 2  The error rate increases from 45.52% to 48.37% and 47.52% when removing the linear layer and replacing the linear layer by a MLP (two fullyconnected layers and a ReLU activation function), respectively.This suggests that a lower dimensional space serves better for comparing distances, but a non-linear mapping does not give further improvement.Moreover, when we apply FeatDistLoss for all pairs of input images by removing the confidence threshold, the test error increases from 45.52% to 46.94%, which suggests that regularization should be only performed on features that are actually used to update the model parameters, and ignoring those that are also ignored by the model.

Conclusion
The idea of consistency regularization gives rise to many successful works for SSL [2,28,44,46,51,27].While making the model invariant against input perturbations induced by data augmentation gives improved performance, the scheme tends to be suboptimal when augmentations of different intensities are used.In this work, we propose a simple yet effective improvement, called FeatDistLoss.It introduces consistency regularization on both the classifier level, where the same class label is imposed for versions of the same image, and the feature level, where distances between features from augmentations of different intensities is increased.By encouraging the representation to distinguish between weakly and strongly augmented images, FeatDistLoss encourages more equivariant representations, leading to improved classification boundaries, and a more robust model. Through

Fig. 2 :
Fig. 2: The proposed FeatDistLoss utilizes unlabeled images in two ways: On the classifier level, different versions of the same image should generate the same class label, whereas on the feature level, representations are encouraged to become either more equivariant (pushing away) or invariant (pulling together).f and f denote strong and weak features; p and p are predicted class distributions from strong and weak features; a) and b) denote features before and after the global average pooling layer.Our final model takes features from a) and encourages equivariance to differently augmented versions of the same image.An ablation study of other choices is in section 4.3.
* * Numbers are generated by [46].† Numbers are produced without CutOut.The best number is in bold and the second best number is in italic.

Table 2 :
Left [52]ror rates on STL-10 and SVHN.A Wide ResNet-28-2 and a Wide ResNet-37-2[52]is used for SVHN and STL-10, repectively.The same code base is adopted as[46]to make the results directly comparable.Notations follow table 1 Right: Error rates on Mini-ImageNet with 40 labels and 100 labels per class.All methods are evaluated on the same ResNet-18 architecture.

Table 3 :
Ablation studies across different settings.Error rates are reported for a single split.

Table 4 :
Left: Effect of different distance functions for FeatDistLoss.The same split on CIFAR-100 with 4 labels per class and a Wide ResNet-28-2 is used for all experiments.

Table 6 :
Effect of the projection head z, and the place to apply LDist.