Introduction

To reduce the reliance on a large amount of labeled data, clustering has once again attracted attention. Traditional clustering methods, such as K-means [1] and spectral clustering [2], have been widely used in various tasks. However, traditional clustering can only process low-dimensional data, and cannot effectively process high-dimensional data such as images and videos. For the above problems, deep clustering [3] is proposed, which maps the raw data to a nonlinear embedding space to obtain distinguishable features for clustering tasks. In addition, DAC [4], DCCM [5], and IIC [6] propose different methods to learn discriminative label features under different constraints. PICA [7] learns the most plausible semantic clustering solutions by partitioning confidence maximization. Although these methods have achieved good classification results, the learned features still lack stable semantic information.

For the sake of obtaining more stable semantics for clustering tasks, CC [8] adopts both instance-level and cluster-level contrastive learning to achieve good semantic clustering. Furthermore, GCC [9] employs KNN to compose the feature vector obtained by image encoding and constructs a contrastive loss function from the perspective of graph nodes. GCC applies a lightweight network to achieve good semantic clustering results. Although the above two methods both consider cluster-level information, instance-level and cluster-level features are optimized separately using different loss functions, which is difficult to achieve the optimal classification effect. Different from the above work, as shown in Fig. 1a, SCAN [10] is proposed to mine the nearest neighbors of each instance sample in the embedding space. Meanwhile, SCAN assigns the same pseudo-labels to adjacent samples to train the classifier, which achieves sound performance.

Fig. 1
figure 1

The motivation of the proposed nearest-neighbor clustering based on meta-features. The clustering in Fig. 1a is based on instance neighbors, where the nearest neighbor samples of each instance are given the same pseudo-label and used to train the classifier. Although this approach has achieved great success, it is difficult to obtain stable semantic features and is prone to error accumulation due to misclassification at the boundaries. As shown in Fig. 1b, to solve the above problems, we propose global nearest neighbor clustering based on meta-features. We select semantically stable meta-features by confidence samples in the batch range. Then, we assign the same pseudo-labels to the global nearest neighbors of the meta-features and apply them to train the classifier. Notably, we further employ label smoothing to handle the noise introduced by the pseudo-labels and avoid overconfidence in the model

However, samples on the clustering boundary may be incorrectly classified as the wrong class owing to the semantic inconsistency of neighboring samples. In addition, alternate runs of incorrect labels will lead to the accumulation of errors in the classifier. In light of these defects, we propose a pseudo-supervised clustering method based on meta-features. The method not only considers instance-level semantics, but also abstracts stable semantic features (i.e., meta-features), as well as propagates semantic information to global nearest neighbor samples in the form of pseudo-labels. Specifically, as shown in Fig. 1b, we first employ a pre-trained contrastive learning model to obtain instance features. Next, we abstract the meta-features that can represent the categories for a more stable semantic feature of clustering. Afterward, we assign the same pseudo-label to the neighboring samples in a batch with the meta-feature as the center, which effectively avoids the possible semantic inconsistency of the neighboring samples of the instance. Finally, we correct the pseudo-labels by a cross-entropy loss with label smoothing, which effectively alleviates the error accumulation. Compared with previous methods, our method not only considers category semantic information but also can effectively solve the problem of error accumulation caused by the inconsistency of instance neighbors. Furthermore, compared to CC and GCC methods, our method accomplishes a direct mapping from features to stable semantic labels, avoiding suboptimal solutions from multi-level optimization.

The main contributions of this paper are as follows:

  1. 1.

    To ensure the semantic stability of the clustering labels, we propose a method based on meta-features assigning pseudo-labels to the global nearest neighbors. The meta-features consider both instance-level semantics and category semantics, and then pseudo-labels are assigned to the global nearest neighbors of the meta-features to improve the semantic stability of clustering.

  2. 2.

    Our method optimizes pseudo-labels using cross-entropy loss with label smoothing directly, which effectively avoids sub-optimal solutions resulting from the optimizations with multiple loss functions.

  3. 3.

    Extensive experimental results show that the pseudo-supervised clustering framework based on meta-features proposed in this paper has excellent performance.

The remainder of the paper is arranged as follows: the next section introduces related work and a comparative analysis with our method, the following section elaborates the details of each part of our model and how the model works, the next section presents the datasets we adopt, and the details of our experiment. Finally, we summarize our work.

Related work

Unsupervised representation learning model

In recent years, contrastive learning has made significant progress and it can correctly capture discriminative features without any manual annotations. For example, MoCo [11] regards contrastive learning as a dictionary loop and builds a dynamic dictionary with a queue and movable averaged encoders. MoCov2 [12] modifies MoCo by utilizing MLP projection heads and more data augmentations. SimCLR [13] simplifies the recently proposed contrastive self-supervised learning algorithm without the need for a dedicated architecture or memory library. SimCLRv2 [14] shows that larger self-supervised models are more efficiently labeled. SwAV [15] does not directly compare two augmented views but uses one view to predict the code and assigns the other view to a set of learnable prototypes. Although these methods can learn outstanding feature representations, the information of the instance is not stable enough in the clustering. It remains a challenge to mine and apply stable properties for pleasant clustering assignments.

Deep clustering

Deep learning enables large-scale deep feature extraction through hierarchical nonlinear mapping capabilities. As a result, deep clustering algorithms are becoming a research hotspot in the field of unsupervised learning. For example, DEC [16] uses K-means [1] for pre-trained image features to initialize the clustering centers, and then fine-tunes the model to learn from the confident clustering assignment to sharpen the obtained prediction distribution. JULE [17] jointly optimizes the CNN in a recurrent manner, where merging operations of agglomerative clustering are conducted in the forward pass and representation learning is performed in the backward pass. DAC [4], DDC [18], and DCCM [5] alternately update the clustering assignment and the similarity between samples during training. However, they are susceptible to inconsistent estimations in the neighborhoods and thus suffer from severe error propagation during training.

There are some works that combine deep representation learning [19,20,21,22,23,24] with traditional clustering methods [25,26,27,28,29]. Most recent researches are based on clustering assignments that maximize mutual information. For example, IIC [6] is proposed to learn cluster assignment by maximizing the mutual information between images and their augmentations. DCDC [30] recommends Contrastive learning through samples and class views for better representing characteristics. However, this clustering method relies on initialization and is likely locked to low-level features. SCAN [10] firstly proposes a two-stage scheme, which finds the nearest neighbors for each feature by Contrastive self-supervised learning in the first stage. And in the second stage, it forces adjacent features to possess similar distribution probabilities. In the above process, they assign the same pseudo-labels to the neighboring embedding features of each instance. While this approach has achieved great success, there are still the following disadvantages. As depicted in Fig. 1a, on the one hand, instance-based clustering lacks stable semantics. On the other hand, this method causes error accumulation. Combined with the analysis of the above methods, we find that the current deep clustering moduses have two shortcomings. One is lack of stable semantic supervision signals, the other is that the previous clustering frameworks are indirect optimizations, which leads to sub-optimal solutions. Unlike the mentioned methods, we utilize pseudo-supervised optimization clustering to construct stable semantic pseudo-labels in the second clustering stage. Our approach effectively reduces the error accumulation problem at the classification boundary and improves the accuracy of clustering.

Method

We present our approach in the following sections. Firstly, we pre-train a contrastive representation learning model to obtain features. After acquiring the features, we compute meta-features and construct a pseudo-label optimization network to achieve semantic clustering. The detailed procedure of our method is shown in Fig. 2 and Algorithm 1.

Feature extraction network

Considering that some previous end-to-end deep clustering methods are susceptible to network initialization, we utilize the MoCo [11] to obtain instance semantic information. MoCo is proposed to replace the memory with a momentum encoder, which applies a dynamic updating sequence q to store the characteristics of previous iterations, and we further leverage the pretrained model to extract instance-level features.

Fig. 2
figure 2

Framework of our proposed method. The framework consists of a feature extraction network and a label optimization network based on meta-features. Where, the feature extraction network is applied to acquire instance features, and the label optimization network based on meta-features is resorted to map features to labels. Specifically, we obtain the instance features to find the nearest neighbors based on feature similarity, and further take the feature-weighted average of confident samples as meta-features. Meanwhile, we find the global nearest neighbors of the meta-features and assign the same pseudo-labels to the meta-features as well as their global neighboring samples. Finally, we utilize pseudo-labels to train the classifier. After each iteration, the classifier learns more semantic information than before

We define samples, augmented samples and their corresponding labels as: s, \({s}'\),u, and \({u}'\) respectively. And as for MoCo work, we will have the following form:

$$\begin{aligned} L_\textrm{moco}=-\log \frac{\exp (sim({{u}_{i}},{{u}_{i}}^{'})/\theta )}{\sum \nolimits _{j=1}^{M}{\exp (sim({{u}_{i}},{{k}_{j}})/\theta )}} \end{aligned}$$
(1)

where sim() is a similarity function similar to cosine similarity, \(\theta \) is an adjustable temperature parameter, and M is the queue size. The loss has one positive sample and M negative samples. Since the queue q does not require a gradient, M can be set large enough, which is beneficial to the feature extraction.

Pseudo-label optimization network

Inspired by “label as representation” [8], we consider image classification as a process of distinguishing samples through assigning corresponding semantic labels. And we adopt a method similar to supervised learning in this work, which directly maps features to stable semantic labels. Precisely, the Pseudo-label optimization network consists of pseudo-label annotation based on meta-features, classifier optimization, and label smoothing.

Pseudo-label annotation based on meta-features

Our task is to divide the dataset \(\chi \) into pre-defined classes S without any manual labeling. Specifically, the pre-trained contrastive learning network is used as our feature extraction network \(f(\centerdot )\). The input is composed of two parts: the one is the raw data x, which is input to the network to obtain the embedded feature; the other is augmented data R(x). Next, x is dropped to the network \(f(\centerdot )\), and we can obtain the embedded feature f(x), simultaneously, R(x) is dropped to the network \(f(\centerdot )\) and we can acquire f(R(x)). Afterwards, f(R(x)) is fed to the classifier which is composed of two MLP layers and then the classifier outputs the predicted probability \(P_{k}\). Finally, we select the K features with the highest confidence in the same semantic class.

$$\begin{aligned} P_{k}= & {} \Phi c(f(R(x))) \end{aligned}$$
(2)
$$\begin{aligned} K= & {} \tau \times n \end{aligned}$$
(3)
$$\begin{aligned} C_{k}= & {} \textrm{top}K(P_{ki},f(x)) \end{aligned}$$
(4)

where \(k\in [1,2,\ldots ,S],i\in [1,2,\ldots ,K]\), \(\tau \) is the confident ratio, n represents the number of images of a certain class. In the embedding space, we select a certain class of embedding features with the K highest probability based on the cosine similarity measure. We regard the K samples with the highest probability as confidence samples, at last, the weighted average of confidence embedding features is used as the meta-feature \(f_\textrm{Metak}\) of class,

$$\begin{aligned} f_\textrm{Metak}=\frac{1}{K}\sum \limits _{i=1}^{K}{C_{i}} \end{aligned}$$
(5)

where \(f_\textrm{Metak}\) guarantees the class semantic of the meta-features. After obtaining the meta-features, we search for neighboring embedding features and ulteriorly assign the same pseudo-labels to the neighboring features.

$$\begin{aligned} l_{k}=N_{t}(f_\textrm{Metak}) \end{aligned}$$
(6)

where \(N_{t}(\centerdot )\) represents the global nearest neighbor of the sample. \(l_{k}\) is the pseudo-label assigned to the neighbors of the meta-feature. It should be noted that to reduce the amount of calculation, the recent neighbor samples we mentioned are samples within a batch. The number of neighbors m in the global neighbor is set to \(m = N/C\). Among them, N is the number of samples within a batch and C is the number of class.

Classifier optimization

The pseudo-label optimization network is designed to train a classifier that maps embedding features to pseudo-labels. It is different from previous work by mining instance-level semantic similarity or focusing on clustering. Pseudo-supervised classification is based on meta-features that preserve instance-level distinguishing information and take into account semantic inconsistencies caused by distance measures at classification boundaries. Pseudo-label annotation based on meta-features, pseudo-label construction, and classifier optimization are a dynamic optimization process. Specifically, we optimize the classifier by minimizing the cross-entropy loss, and the optimized classifier is involved in the selection of confident samples, meta-feature construction, and pseudo-label assignment in the next iteration. Through multiple iterative optimizations, the classifier learns a good mapping of features and pseudo-labels.

Label smoothing

We minimize the cross-entropy loss to optimize the classification network. At the same time, considering the inevitable introduction of noise in the pseudo-label annotation process, we apply label smoothing to tune our model from overconfidence to noise prediction. The label smoothing specifies soft labels by adding uniform noise and improving prediction calibration. Given a sample with a label and its corresponding label \((x,l_{k})\in \chi \), we inject noise into all classes in the following manner:

$$\begin{aligned} \tilde{l_{k}}=(1- \alpha )\times l_{k}+\frac{\alpha }{C-1}(1-l_{k}) \end{aligned}$$
(7)

where \(\alpha \) denotes the label smoothing parameter and p is calculated by applying the logit vector z output from the penultimate layer of the model to the softmax function:

$$\begin{aligned} {{p}_{x}}= & {} \frac{\exp ({{z}_{x}})}{\sum \nolimits _{j}^{C}{\exp ({{z}_{j}})}} \end{aligned}$$
(8)
$$\begin{aligned} L= & {} \frac{1}{\chi }\sum _{x,{\tilde{l}}_{k}\in \chi }H({\tilde{l}}_{k},p_{x}) \end{aligned}$$
(9)

We use a loss function L to force each image and its neighbors to be classified together. This loss function L maximizes their dot product after softmax, driving the network to produce consistent and differentiated predictions. We encourage the model to predict the target class with a probability close to 1 and the non-target class with a probability close to 0. In other words, the value of the target class in the final predicted logit vector will tend to infinity, causing the model to learn toward infinitely increasing the logit difference between the predicted correct and incorrect labels. However, excessive logit differences can lead to a lack of adaptability and overconfidence in model predictions. We use label smoothing to reduce the difference between the positive and negative sample output values predicted by the model and improve the robustness of the model.

figure a

Experiments

Datasets

In experiments, we evaluate our proposed method on six datasets that are widely used for deep clustering. It contains five small datasets CIFAR-10, CIFAR-100, STL-10, Tiny-ImageNet and ImageNet-10, and a large scale dataset ImageNet-1K.

CIFAR-10/100: A natural image dataset with 50,000 samples and 10,000 test samples respectively, where CIFAR-10 contains 10 classes and CIFAR-100 contains 100 classes.

STL-10: A dataset from ImageNet containing 500/800 training/test images from 10 classes and additional 100,000 samples from several unknown classes.

Tiny-ImageNet: A very challenging dataset with 200 classes and the dataset has 100,000/10,000 training/test images.

ImageNet-10: A subset of ImageNet, this dataset contains 10 randomly selected subclasses with a total of 13,000 images.

ImageNet-1K: A large-scale hand-annotated dataset containing more than 1.2 million images in 1,000 classes.

Table 1 Summary of datasets used for evaluation

A brief statistics of six datasets are summarized in Table 1. To verify the effectiveness of our method, we use the following clustering setup: we train and test on the entire dataset for CIFAR-10/100, and set 20 superclasses as ground-truth labels for CIFAR-100. We cluster STL-10 using 13,000 labeled images. For ImageNet-10, we use 13,000 images in its training set for training and testing. In addition, we test the clustering performance of our method on Tiny-ImageNet using 100,000 images in the training set. Finally, we test the performance on the large dataset ImageNet-1K. To perform a comparison with existing methods, we utilize 1,281,167 images for training and 50,000 images for testing.

Table 2 The clustering performance on three challenging object image benchmarks after the clustering (\(\star \)) step

Evaluation metrics

We use three widely used clustering performance metrics, including normalized mutual information (NMI), accuracy (ACC), and adjusted rand index (ARI), to evaluate our method. All these metrics scale from 0 to 1. The higher the value of these indicators have, the better the cluster performance is.

Experimental setup

Our method is implemented by PyTorch 1.6.0 and we use 4 NVIDIA GEFORCE GTX 1080Ti 11G graphics cards for training and test on Ubuntu 20.04. ResNet18 or ResNet34 is our backbone network in the feature extraction network. The entropy term weight is set to 5. After the pre-trained model is trained for 1200 epochs, the parameter is frozen and used for feature extraction. The clustering head is trained for 200 epochs. The confident ratio \( \tau \) is set to 0.6. The label smoothing parameter \(\alpha \) is set to 0.1. For evaluating the class assignment, the Hungarian method is used to map the best bijection permutation between the predictions and ground truth.

Comparison with the existing SOTA methods

Comparison on small datasets

We evaluate our method on three challenging image benchmarks and compare it with 21 representative state-of-the-art clustering methods, including traditional methods (such as K-means [1], SC [2], AC [31], NMF [32]), deep learning methods (such as AE [33], DAE [34], DCGAN [35], DeCNN [36], VAE [37], JULE [17], DEC [16], DAC [4], ADC [38], DDC [18], DCCM [5], CC [8], IIC [6], PICA [7], DCDC [30], GCC [9]) and the latest methods based on pre-trained features (such as SCAN [10]). To compare the performance of these methods, we directly adopt the best results reported in the relevant literature. We report the clustering results with ResNet18 or ResNet34 as Backbone. Furthermore, we bold the top 2 performances of the compared methods. For a fair comparison, we use ResNet18 as the backbone to compare with other methods.

The experimental results are shown in Table 2. In the wake of our proposed global pseudo-label assignment strategy based on meta-features and pseudo-label optimization method with label smoothing, our method outperforms most baselines. Compared with traditional methods, our method has an overwhelming advantage. For example, compared to AC [31], the ACC improves 61.9% on CIFAR-10 (84.7% vs. 22.8%). On CIFAR-100, ACC improves by 33.5% (47.3% vs. 13.8%). On STL-10, ACC improves by 53.1% (86.3% vs. 33.2%). On ImageNet-10, ACC improved by 67.6% (91.8% vs. 24.2%). On Tiny-ImageNet, ACC improves by 19.8% (22.5% vs. 2.7%). Our method also has significant advantages over deep learning methods. For example, compared to CC [8], our method improves ACC by 5.7% on CIFAR-10 (84.7% vs. 79%). On CIFAR-100, the ACC improves by 4.4% (47.3% vs. 42.9%). On STL-10, ACC improves by 1.3% (86.3% vs. 85%). On ImageNet-10, ACC increases by 2.5% (91.8% vs. 89.3%). On Tiny-ImageNet, ACC improves by 8.5% (22.5% vs. 14%). We also achieve better results than the pre-trained instance-level nearest neighbor clustering method SCAN. Specifically, our method improves ACC by 3.1% on CIFAR-10 (74.1% vs. 71.5%) compared to SCAN. On CIFAR-100, the ACC improves by 3.3% (47.3% vs. 44%). On STL-10, ACC improves by 7.1% (86.3% vs. 79.2%).

It is worth noting that GCC [9] achieves good clustering results by fully considering both instance-level semantics and cluster-level semantics. The experimental results show that our method achieves an advantage over GCC on CIFAR-100, STL-10, ImageNet-10, and Tiny-ImageNet. Specifically, on CIFAR-100, our method improves by 0.1% (47.3% vs. 47.2%). On STL-10, our method improves by 7.5% (86.3% vs. 78.8%). On ImageNet-10, our method improves by 1.1% (91.2% vs. 90.1%). On Tiny-ImageNet, our method improves by 8.7% (22.5% vs. 13.8%). On the low-resolution dataset CIFAR-10, our method is close to GCC (84.7% vs. 85.6%), but still better than the other methods. It is obvious that the clustering performance of high-resolution images is generally better than that of low resolution in our method. In section “Reliance on backbone network”, we add the experimental effects of using different backbones.

Comparison on large scale dataset ImageNet-1K

To demonstrate the clustering performance of our method on large-scale datasets, we perform experiments on ImageNet-1K. For a fair comparison, the same pre-trained weights of MoCov2 [12] are used in experiments as in SCAN. We adopt ResNet50 as our Backbone and conduct experiments on a Telsa A100 80G graphics card. It should be noted that the parameters compared in the report come from the SCAN report, where SimCLR [13] is the result obtained after fine-tuning with 1% of the labeled data. The experimental results are shown in Table 3, where we report the performance of the different methods in fully supervised, semi-supervised, and unsupervised situations. Our method far outperforms supervised learning as an unsupervised learning method (41.3% vs. 25.4%). In particular, our accuracy improves by 1.4% over SCAN (41.3% vs. 39.9%). This shows the superiority of our method on large-scale datasets.

Table 3 The clustering performance on the large dataset ImageNet-1K

Empirical analysis

In this section, we conduct several qualitative studies to visually analyze class confusion matrices, confidence samples, and heat maps.

Confusion matrix

As shown in Fig. 3, we visualize the confusion matrix of the three datasets. The confusion matrix of these three datasets has a clear diagonal structure, which shows that our method successfully divides the samples into different categories according to semantics. The confusion of our model mainly occurs between classes that are not easy to distinguish in reality, such as cats and dogs. And there are two reasons for the poor classification of CIFAR-10 and CIFAR-100. On the one hand, the size of these images is too small and very blurry to distinguish. On the other hand, the network cannot clarify the nuances between different categories. Especially on CIFAR-100, it is extremely difficult for the model to obtain accurate semantic discrimination features when we set 100 classes into 20 superclasses.

Fig. 3
figure 3

Confusion matrices of three datasets. From left to right are CIFAR-10, CIFAR-100, and STL-10. The row names are the predicted class labels, and the column names are the ground-truths

Image semantic visualization

As shown in Figs. 4 and 5, after completing the model training, we visualize the confidence samples of STL-10 involved in constructing meta-features. At the same time, we use the area highlighted by the heat map to indicate the location of the semantic features noticed by the model. After the maximum matching between the clustering and actual results, we find that the confidence samples constructed by the participating meta-features matched exactly with the manual annotations. For example, after maximum matching, the "horse" class successfully capture the horse class. With the most distinguishing regions in the image concentrated in different locations of the horse. The visualization results show that the confident samples selected by the model are semantically correct and all focus on the semantics of the actual labels, providing a solid basis for meta-features to assign neighbor labels.

Fig. 4
figure 4

Confidence sample visualization of the top-3. We visualize the confidence samples involved in constructing meta-features in STL-10. It is worth noting that the confidence samples pay attention to the correct semantics, which provides a guarantee for us to construct meta features with stable semantics

Fig. 5
figure 5

Image semantic visualization. We visualize the top-3 images which are used to calculate meta-features on STL-10 and we use a heat map to visualize the location of attention

Fig. 6
figure 6

Effect of cross-entropy loss with label smoothing on CIFAR-10, CIFAR-100 and STL-10. ResNet34 is our backbone. The blue bars represent the classification results directly using cross-entropy loss, and the green bars represent the classification results after using label smoothing

Fig. 7
figure 7

Effect of data augmentation on STL-10. We adopt ResNet18 as the backbone and perform the experiments on STL-10. The left figure shows the effect of our method using two data augmentation strategies, and the right figure shows the effect of applying two data augmentation strategies to SCAN

Ablation study

Several ablation studies are used in this section to demonstrate the effect of different scenarios on our proposed approach.

Effect of label smoothing

We separately evaluate the effect of using cross-entropy loss and cross-entropy loss with label smoothing on the classification results on three datasets and provide the results in Fig. 6.

As shown in Fig. 6, our model using cross-entropy loss with label smoothing in the experiment is better than directly using cross-entropy loss. On the CIFAR-10, the accuracy rate increases by 1.2%, the NMI increases by 0.5%, and the ARI increases by 0.6%. On the CIFAR-100, the accuracy rate increases by 0.7%, the NMI increases by 0.5%, and the ARI increases by 0.6%. On the STL-10, the accuracy rate increases by 1.4%, the NMI increases by 1.8%, and the ARI increases by 2.1%. The cross-entropy loss with label smoothing reduces overfitting and enhances generalization by adding noise to some extent. The experimental results verify the effectiveness of the label smoothing strategy we use.

Effect of data augmentation

As shown in Fig. 7, to explore the impact of different data augmentations on our method, we conducted an ablation study on the STL-10 using ResNet18 as the backbone. We adopt the same weak augmentation strategy of FixMatch [40] for weak augmentation in the paper, and use the same settings as SCAN for strong augmentation. Specifically, the image is strongly augmented by composing Cutout [41] and four randomly selected transformations from RandAugment [42]. In view of the fact that SCAN also employs a similar data augmentation approach, we simultaneously tested the impact of different data augmentation schemes on SCAN. As shown in Fig. 3, the left figure shows the performance of our approach using different augmentation schemes, and the right figure shows the performance of different augmentation schemes on SCAN.

Overall, the effect of using different augmentations on the clustering results of the two methods is not significant. This is because both methods use pre-trained models, which already have transformation invariance. However, a detailed comparison shows that our method is slightly better than strong augmentation when using weak augmentation (86.2% vs. 85.7%). In contrast, strong Augmentation is better than weak augmentation when the same augmentation strategy is tested on SCAN (78.4% vs. 81.2%). This is related to the way pseudo-labels are constructed and assigned: in the case of SCAN, strong augmentation is used to ensure that the model learns strong consistency since pseudo-labels are assigned for each instance finding its nearest neighbors. However, the idea of stable semantics of our approach is able to not rely on strong augmentation. Firstly, the labels assigned using meta-features as centers are more robust. Secondly, assigning pseudo-labels globally for meta-features can reduce the reliance on strong consistency compared to assigning labels to each instance’s nearest neighbors. Finally, the use of strong augmentation may lead to larger errors in the predictions of the classifier.

Fig. 8
figure 8

Comparison of the clustering performance for different values of the hyperparameter \(\tau \). For both CIFAR-10 and CIFAR-100 datasets, we use the backbone of ResNet18. The primary axis represents CIFAR-100 and the secondary axis represents CIFAR-10

Effect of the hyperparameter \(\tau \)

In this paper, \(\tau \) is an important hyperparameter for selecting high-confidence samples. To select appropriate \(\tau \) for selecting high-confidence samples to build meta-features, we perform experiments on the CIFAR-10 and CIFAR-100 datasets with ResNet18 as the backbone. As shown in Fig. 8, we select 10 values of \(\tau \) covering from low confidence to high confidence for the two datasets respectively. The performance trends of our method on the two datasets are basically the same: when the value of \(\tau \) is between 0.1 and 0.5, with the increase of \(\tau \), the accuracy of clustering has been increasing. This is because the meta-features we construct are at low confidence and cannot represent the semantic features of the category, resulting in errors in label assignment. As \(\tau \) continues to grow, the semantics of meta-features continue to improve, and the clustering effect also improves. When \(\tau \) is between 0.5 and 0.8, the samples participating in the construction of meta-features not only have high confidence, but also a large number of samples ensure the semantic representation of the category, and the clustering performance reaches a relatively stable state. When \(\tau \) is greater than 0.8, although there are a small number of high-confidence samples participating in the construction of meta-features, the reduction in the number of samples participating in the construction of meta-features causes the semantic instability of meta-features, which in turn results in poor clustering performance.

Reliance on backbone network

To examine how much our model relies on the backbone network, we test two ResNets of different depths and report the results in Table 2. (*) represents that we adopt ResNet34 as the backbone and (**) represents that we adopt ResNet18 as the backbone. It can be seen from the comparison that the representation ability of the backbone network contributes to the clustering performance. On datasets with smaller image sizes such as CIFAR-10 and CIFAR-100, the clustering performance using ResNet18 is better than that using ResNet34. This is because ResNet18 has sufficient representation ability to extract discriminative features on datasets with smaller image sizes, and using a deeper network is prone to overfitting. On datasets with larger image sizes, such as STL-10, ImageNet-10, and Tiny-ImageNet, the clustering performance of ResNet34 with stronger representation is better than that of ResNet18 as backbone.

Conclusion

The unstable semantic pseudo-labels assignment severely limits the performance of image clustering. In this paper, we first propose the concept of “meta-features”, which are stable features with sufficient stable semantic information. At the same time, we combine instance features with discriminative information and class semantic features for achieving deep clustering with stable semantics. Meanwhile, we propose a pseudo-supervised clustering algorithm based on meta-features. The framework adopts stable features to assign pseudo-labels while optimizing meta-features and pseudo-labels in a pseudo-supervised manner, which achieves a direct mapping from features to stable semantic labels. Experiments demonstrate that our proposed deep clustering based on meta-features significantly improves the accuracy of the classification task. Furthermore, our proposed method provides an idea to achieve stable semantic self-learning. We plan to extend it to other work and applications in future work.