Pseudo-supervised image clustering based on meta-features

Wang, Hao; Shao, Youjia; Yang, Tongsen; Zhao, Wencang

doi:10.1007/s40747-023-01081-9

Pseudo-supervised image clustering based on meta-features

Original Article
Open access
Published: 24 May 2023

Volume 9, pages 6541–6551, (2023)
Cite this article

Download PDF

You have full access to this open access article

Complex & Intelligent Systems Aims and scope Submit manuscript

Pseudo-supervised image clustering based on meta-features

Download PDF

Hao Wang¹,
Youjia Shao¹,
Tongsen Yang¹ &
…
Wencang Zhao ORCID: orcid.org/0000-0002-4420-3825^1,2

887 Accesses
Explore all metrics

Abstract

Stable semantics is a prerequisite for achieving excellent image clustering. However, most current methods suffer from inaccurate class semantic estimation, which limits the clustering performance. For the sake of addressing the issue, we propose a pseudo-supervised clustering framework based on meta-features. First, the framework mines meta-semantic features (i.e., meta-features) of image categories based on instance-level features, which not only preserves instance-level information but also ensures the semantic robustness of meta-features. Ulteriorly, we propagate pseudo-labels to its global neighbor samples with meta-features as the center, which effectively avoids the accumulation of errors caused by the misclassification of samples at the cluster boundary. Finally, we exploit the cross-entropy loss with label smoothing to optimize the pseudo-label optimization network. This optimization method not only achieves a direct mapping from features to stable semantic labels, but also effectively avoids suboptimal solutions caused by multi-level optimization. Extensive experiments demonstrate that our method significantly outperforms twenty-one competing clustering methods on six challenging datasets.

Semi-supervised Clustering by Selecting Informative Constraints

Combining core points and cluster-level semantic similarity for self-supervised clustering

Article 11 February 2024

Imbalance-Aware Discriminative Clustering for Unsupervised Semantic Segmentation

Article 14 May 2024

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

To reduce the reliance on a large amount of labeled data, clustering has once again attracted attention. Traditional clustering methods, such as K-means [1] and spectral clustering [2], have been widely used in various tasks. However, traditional clustering can only process low-dimensional data, and cannot effectively process high-dimensional data such as images and videos. For the above problems, deep clustering [3] is proposed, which maps the raw data to a nonlinear embedding space to obtain distinguishable features for clustering tasks. In addition, DAC [4], DCCM [5], and IIC [6] propose different methods to learn discriminative label features under different constraints. PICA [7] learns the most plausible semantic clustering solutions by partitioning confidence maximization. Although these methods have achieved good classification results, the learned features still lack stable semantic information.

For the sake of obtaining more stable semantics for clustering tasks, CC [8] adopts both instance-level and cluster-level contrastive learning to achieve good semantic clustering. Furthermore, GCC [9] employs KNN to compose the feature vector obtained by image encoding and constructs a contrastive loss function from the perspective of graph nodes. GCC applies a lightweight network to achieve good semantic clustering results. Although the above two methods both consider cluster-level information, instance-level and cluster-level features are optimized separately using different loss functions, which is difficult to achieve the optimal classification effect. Different from the above work, as shown in Fig. 1a, SCAN [10] is proposed to mine the nearest neighbors of each instance sample in the embedding space. Meanwhile, SCAN assigns the same pseudo-labels to adjacent samples to train the classifier, which achieves sound performance.

However, samples on the clustering boundary may be incorrectly classified as the wrong class owing to the semantic inconsistency of neighboring samples. In addition, alternate runs of incorrect labels will lead to the accumulation of errors in the classifier. In light of these defects, we propose a pseudo-supervised clustering method based on meta-features. The method not only considers instance-level semantics, but also abstracts stable semantic features (i.e., meta-features), as well as propagates semantic information to global nearest neighbor samples in the form of pseudo-labels. Specifically, as shown in Fig. 1b, we first employ a pre-trained contrastive learning model to obtain instance features. Next, we abstract the meta-features that can represent the categories for a more stable semantic feature of clustering. Afterward, we assign the same pseudo-label to the neighboring samples in a batch with the meta-feature as the center, which effectively avoids the possible semantic inconsistency of the neighboring samples of the instance. Finally, we correct the pseudo-labels by a cross-entropy loss with label smoothing, which effectively alleviates the error accumulation. Compared with previous methods, our method not only considers category semantic information but also can effectively solve the problem of error accumulation caused by the inconsistency of instance neighbors. Furthermore, compared to CC and GCC methods, our method accomplishes a direct mapping from features to stable semantic labels, avoiding suboptimal solutions from multi-level optimization.

The main contributions of this paper are as follows:

1.
To ensure the semantic stability of the clustering labels, we propose a method based on meta-features assigning pseudo-labels to the global nearest neighbors. The meta-features consider both instance-level semantics and category semantics, and then pseudo-labels are assigned to the global nearest neighbors of the meta-features to improve the semantic stability of clustering.
2.
Our method optimizes pseudo-labels using cross-entropy loss with label smoothing directly, which effectively avoids sub-optimal solutions resulting from the optimizations with multiple loss functions.
3.
Extensive experimental results show that the pseudo-supervised clustering framework based on meta-features proposed in this paper has excellent performance.

The remainder of the paper is arranged as follows: the next section introduces related work and a comparative analysis with our method, the following section elaborates the details of each part of our model and how the model works, the next section presents the datasets we adopt, and the details of our experiment. Finally, we summarize our work.

Related work

Unsupervised representation learning model

In recent years, contrastive learning has made significant progress and it can correctly capture discriminative features without any manual annotations. For example, MoCo [11] regards contrastive learning as a dictionary loop and builds a dynamic dictionary with a queue and movable averaged encoders. MoCov2 [12] modifies MoCo by utilizing MLP projection heads and more data augmentations. SimCLR [13] simplifies the recently proposed contrastive self-supervised learning algorithm without the need for a dedicated architecture or memory library. SimCLRv2 [14] shows that larger self-supervised models are more efficiently labeled. SwAV [15] does not directly compare two augmented views but uses one view to predict the code and assigns the other view to a set of learnable prototypes. Although these methods can learn outstanding feature representations, the information of the instance is not stable enough in the clustering. It remains a challenge to mine and apply stable properties for pleasant clustering assignments.

Deep clustering

Deep learning enables large-scale deep feature extraction through hierarchical nonlinear mapping capabilities. As a result, deep clustering algorithms are becoming a research hotspot in the field of unsupervised learning. For example, DEC [16] uses K-means [1] for pre-trained image features to initialize the clustering centers, and then fine-tunes the model to learn from the confident clustering assignment to sharpen the obtained prediction distribution. JULE [17] jointly optimizes the CNN in a recurrent manner, where merging operations of agglomerative clustering are conducted in the forward pass and representation learning is performed in the backward pass. DAC [4], DDC [18], and DCCM [5] alternately update the clustering assignment and the similarity between samples during training. However, they are susceptible to inconsistent estimations in the neighborhoods and thus suffer from severe error propagation during training.

There are some works that combine deep representation learning [19,20,21,22,23,24] with traditional clustering methods [25,26,27,28,29]. Most recent researches are based on clustering assignments that maximize mutual information. For example, IIC [6] is proposed to learn cluster assignment by maximizing the mutual information between images and their augmentations. DCDC [30] recommends Contrastive learning through samples and class views for better representing characteristics. However, this clustering method relies on initialization and is likely locked to low-level features. SCAN [10] firstly proposes a two-stage scheme, which finds the nearest neighbors for each feature by Contrastive self-supervised learning in the first stage. And in the second stage, it forces adjacent features to possess similar distribution probabilities. In the above process, they assign the same pseudo-labels to the neighboring embedding features of each instance. While this approach has achieved great success, there are still the following disadvantages. As depicted in Fig. 1a, on the one hand, instance-based clustering lacks stable semantics. On the other hand, this method causes error accumulation. Combined with the analysis of the above methods, we find that the current deep clustering moduses have two shortcomings. One is lack of stable semantic supervision signals, the other is that the previous clustering frameworks are indirect optimizations, which leads to sub-optimal solutions. Unlike the mentioned methods, we utilize pseudo-supervised optimization clustering to construct stable semantic pseudo-labels in the second clustering stage. Our approach effectively reduces the error accumulation problem at the classification boundary and improves the accuracy of clustering.

Method

We present our approach in the following sections. Firstly, we pre-train a contrastive representation learning model to obtain features. After acquiring the features, we compute meta-features and construct a pseudo-label optimization network to achieve semantic clustering. The detailed procedure of our method is shown in Fig. 2 and Algorithm 1.

Feature extraction network

Considering that some previous end-to-end deep clustering methods are susceptible to network initialization, we utilize the MoCo [11] to obtain instance semantic information. MoCo is proposed to replace the memory with a momentum encoder, which applies a dynamic updating sequence q to store the characteristics of previous iterations, and we further leverage the pretrained model to extract instance-level features.

We define samples, augmented samples and their corresponding labels as: s, ${s}'$,u, and ${u}'$ respectively. And as for MoCo work, we will have the following form:

$$\begin{aligned} L_\textrm{moco}=-\log \frac{\exp (sim({{u}_{i}},{{u}_{i}}^{'})/\theta )}{\sum \nolimits _{j=1}^{M}{\exp (sim({{u}_{i}},{{k}_{j}})/\theta )}} \end{aligned}$$

(1)

where sim() is a similarity function similar to cosine similarity, $\theta $ is an adjustable temperature parameter, and M is the queue size. The loss has one positive sample and M negative samples. Since the queue q does not require a gradient, M can be set large enough, which is beneficial to the feature extraction.

Pseudo-label optimization network

Inspired by “label as representation” [8], we consider image classification as a process of distinguishing samples through assigning corresponding semantic labels. And we adopt a method similar to supervised learning in this work, which directly maps features to stable semantic labels. Precisely, the Pseudo-label optimization network consists of pseudo-label annotation based on meta-features, classifier optimization, and label smoothing.

Pseudo-label annotation based on meta-features

Our task is to divide the dataset $\chi $ into pre-defined classes S without any manual labeling. Specifically, the pre-trained contrastive learning network is used as our feature extraction network $f(\centerdot )$. The input is composed of two parts: the one is the raw data x, which is input to the network to obtain the embedded feature; the other is augmented data R(x). Next, x is dropped to the network $f(\centerdot )$, and we can obtain the embedded feature f(x), simultaneously, R(x) is dropped to the network $f(\centerdot )$ and we can acquire f(R(x)). Afterwards, f(R(x)) is fed to the classifier which is composed of two MLP layers and then the classifier outputs the predicted probability $P_{k}$. Finally, we select the K features with the highest confidence in the same semantic class.

$$\begin{aligned} P_{k}= & {} \Phi c(f(R(x))) \end{aligned}$$

(2)

$$\begin{aligned} K= & {} \tau \times n \end{aligned}$$

(3)

$$\begin{aligned} C_{k}= & {} \textrm{top}K(P_{ki},f(x)) \end{aligned}$$

(4)

where $k\in [1,2,\ldots ,S],i\in [1,2,\ldots ,K]$, $\tau $ is the confident ratio, n represents the number of images of a certain class. In the embedding space, we select a certain class of embedding features with the K highest probability based on the cosine similarity measure. We regard the K samples with the highest probability as confidence samples, at last, the weighted average of confidence embedding features is used as the meta-feature $f_\textrm{Metak}$ of class,

$$\begin{aligned} f_\textrm{Metak}=\frac{1}{K}\sum \limits _{i=1}^{K}{C_{i}} \end{aligned}$$

(5)

where $f_\textrm{Metak}$ guarantees the class semantic of the meta-features. After obtaining the meta-features, we search for neighboring embedding features and ulteriorly assign the same pseudo-labels to the neighboring features.

$$\begin{aligned} l_{k}=N_{t}(f_\textrm{Metak}) \end{aligned}$$

(6)

where $N_{t}(\centerdot )$ represents the global nearest neighbor of the sample. $l_{k}$ is the pseudo-label assigned to the neighbors of the meta-feature. It should be noted that to reduce the amount of calculation, the recent neighbor samples we mentioned are samples within a batch. The number of neighbors m in the global neighbor is set to $m = N/C$. Among them, N is the number of samples within a batch and C is the number of class.

Classifier optimization

The pseudo-label optimization network is designed to train a classifier that maps embedding features to pseudo-labels. It is different from previous work by mining instance-level semantic similarity or focusing on clustering. Pseudo-supervised classification is based on meta-features that preserve instance-level distinguishing information and take into account semantic inconsistencies caused by distance measures at classification boundaries. Pseudo-label annotation based on meta-features, pseudo-label construction, and classifier optimization are a dynamic optimization process. Specifically, we optimize the classifier by minimizing the cross-entropy loss, and the optimized classifier is involved in the selection of confident samples, meta-feature construction, and pseudo-label assignment in the next iteration. Through multiple iterative optimizations, the classifier learns a good mapping of features and pseudo-labels.

Label smoothing

We minimize the cross-entropy loss to optimize the classification network. At the same time, considering the inevitable introduction of noise in the pseudo-label annotation process, we apply label smoothing to tune our model from overconfidence to noise prediction. The label smoothing specifies soft labels by adding uniform noise and improving prediction calibration. Given a sample with a label and its corresponding label $(x,l_{k})\in \chi $, we inject noise into all classes in the following manner:

$$\begin{aligned} \tilde{l_{k}}=(1- \alpha )\times l_{k}+\frac{\alpha }{C-1}(1-l_{k}) \end{aligned}$$

(7)

where $\alpha $ denotes the label smoothing parameter and p is calculated by applying the logit vector z output from the penultimate layer of the model to the softmax function:

$$\begin{aligned} {{p}_{x}}= & {} \frac{\exp ({{z}_{x}})}{\sum \nolimits _{j}^{C}{\exp ({{z}_{j}})}} \end{aligned}$$

(8)

$$\begin{aligned} L= & {} \frac{1}{\chi }\sum _{x,{\tilde{l}}_{k}\in \chi }H({\tilde{l}}_{k},p_{x}) \end{aligned}$$

(9)

We use a loss function L to force each image and its neighbors to be classified together. This loss function L maximizes their dot product after softmax, driving the network to produce consistent and differentiated predictions. We encourage the model to predict the target class with a probability close to 1 and the non-target class with a probability close to 0. In other words, the value of the target class in the final predicted logit vector will tend to infinity, causing the model to learn toward infinitely increasing the logit difference between the predicted correct and incorrect labels. However, excessive logit differences can lead to a lack of adaptability and overconfidence in model predictions. We use label smoothing to reduce the difference between the positive and negative sample output values predicted by the model and improve the robustness of the model.

Experiments

Datasets

In experiments, we evaluate our proposed method on six datasets that are widely used for deep clustering. It contains five small datasets CIFAR-10, CIFAR-100, STL-10, Tiny-ImageNet and ImageNet-10, and a large scale dataset ImageNet-1K.

CIFAR-10/100: A natural image dataset with 50,000 samples and 10,000 test samples respectively, where CIFAR-10 contains 10 classes and CIFAR-100 contains 100 classes.

STL-10: A dataset from ImageNet containing 500/800 training/test images from 10 classes and additional 100,000 samples from several unknown classes.

Tiny-ImageNet: A very challenging dataset with 200 classes and the dataset has 100,000/10,000 training/test images.

ImageNet-10: A subset of ImageNet, this dataset contains 10 randomly selected subclasses with a total of 13,000 images.

ImageNet-1K: A large-scale hand-annotated dataset containing more than 1.2 million images in 1,000 classes.

Table 1 Summary of datasets used for evaluation

Full size table

A brief statistics of six datasets are summarized in Table 1. To verify the effectiveness of our method, we use the following clustering setup: we train and test on the entire dataset for CIFAR-10/100, and set 20 superclasses as ground-truth labels for CIFAR-100. We cluster STL-10 using 13,000 labeled images. For ImageNet-10, we use 13,000 images in its training set for training and testing. In addition, we test the clustering performance of our method on Tiny-ImageNet using 100,000 images in the training set. Finally, we test the performance on the large dataset ImageNet-1K. To perform a comparison with existing methods, we utilize 1,281,167 images for training and 50,000 images for testing.

Table 2 The clustering performance on three challenging object image benchmarks after the clustering ($\star $) step

Full size table

Evaluation metrics

We use three widely used clustering performance metrics, including normalized mutual information (NMI), accuracy (ACC), and adjusted rand index (ARI), to evaluate our method. All these metrics scale from 0 to 1. The higher the value of these indicators have, the better the cluster performance is.

Experimental setup

Our method is implemented by PyTorch 1.6.0 and we use 4 NVIDIA GEFORCE GTX 1080Ti 11G graphics cards for training and test on Ubuntu 20.04. ResNet18 or ResNet34 is our backbone network in the feature extraction network. The entropy term weight is set to 5. After the pre-trained model is trained for 1200 epochs, the parameter is frozen and used for feature extraction. The clustering head is trained for 200 epochs. The confident ratio $ \tau $ is set to 0.6. The label smoothing parameter $\alpha $ is set to 0.1. For evaluating the class assignment, the Hungarian method is used to map the best bijection permutation between the predictions and ground truth.

Comparison with the existing SOTA methods

Comparison on small datasets

We evaluate our method on three challenging image benchmarks and compare it with 21 representative state-of-the-art clustering methods, including traditional methods (such as K-means [1], SC [2], AC [31], NMF [32]), deep learning methods (such as AE [33], DAE [34], DCGAN [35], DeCNN [36], VAE [37], JULE [17], DEC [16], DAC [4], ADC [38], DDC [18], DCCM [5], CC [8], IIC [6], PICA [7], DCDC [30], GCC [9]) and the latest methods based on pre-trained features (such as SCAN [10]). To compare the performance of these methods, we directly adopt the best results reported in the relevant literature. We report the clustering results with ResNet18 or ResNet34 as Backbone. Furthermore, we bold the top 2 performances of the compared methods. For a fair comparison, we use ResNet18 as the backbone to compare with other methods.

The experimental results are shown in Table 2. In the wake of our proposed global pseudo-label assignment strategy based on meta-features and pseudo-label optimization method with label smoothing, our method outperforms most baselines. Compared with traditional methods, our method has an overwhelming advantage. For example, compared to AC [31], the ACC improves 61.9% on CIFAR-10 (84.7% vs. 22.8%). On CIFAR-100, ACC improves by 33.5% (47.3% vs. 13.8%). On STL-10, ACC improves by 53.1% (86.3% vs. 33.2%). On ImageNet-10, ACC improved by 67.6% (91.8% vs. 24.2%). On Tiny-ImageNet, ACC improves by 19.8% (22.5% vs. 2.7%). Our method also has significant advantages over deep learning methods. For example, compared to CC [8], our method improves ACC by 5.7% on CIFAR-10 (84.7% vs. 79%). On CIFAR-100, the ACC improves by 4.4% (47.3% vs. 42.9%). On STL-10, ACC improves by 1.3% (86.3% vs. 85%). On ImageNet-10, ACC increases by 2.5% (91.8% vs. 89.3%). On Tiny-ImageNet, ACC improves by 8.5% (22.5% vs. 14%). We also achieve better results than the pre-trained instance-level nearest neighbor clustering method SCAN. Specifically, our method improves ACC by 3.1% on CIFAR-10 (74.1% vs. 71.5%) compared to SCAN. On CIFAR-100, the ACC improves by 3.3% (47.3% vs. 44%). On STL-10, ACC improves by 7.1% (86.3% vs. 79.2%).

It is worth noting that GCC [9] achieves good clustering results by fully considering both instance-level semantics and cluster-level semantics. The experimental results show that our method achieves an advantage over GCC on CIFAR-100, STL-10, ImageNet-10, and Tiny-ImageNet. Specifically, on CIFAR-100, our method improves by 0.1% (47.3% vs. 47.2%). On STL-10, our method improves by 7.5% (86.3% vs. 78.8%). On ImageNet-10, our method improves by 1.1% (91.2% vs. 90.1%). On Tiny-ImageNet, our method improves by 8.7% (22.5% vs. 13.8%). On the low-resolution dataset CIFAR-10, our method is close to GCC (84.7% vs. 85.6%), but still better than the other methods. It is obvious that the clustering performance of high-resolution images is generally better than that of low resolution in our method. In section “Reliance on backbone network”, we add the experimental effects of using different backbones.

Comparison on large scale dataset ImageNet-1K

To demonstrate the clustering performance of our method on large-scale datasets, we perform experiments on ImageNet-1K. For a fair comparison, the same pre-trained weights of MoCov2 [12] are used in experiments as in SCAN. We adopt ResNet50 as our Backbone and conduct experiments on a Telsa A100 80G graphics card. It should be noted that the parameters compared in the report come from the SCAN report, where SimCLR [13] is the result obtained after fine-tuning with 1% of the labeled data. The experimental results are shown in Table 3, where we report the performance of the different methods in fully supervised, semi-supervised, and unsupervised situations. Our method far outperforms supervised learning as an unsupervised learning method (41.3% vs. 25.4%). In particular, our accuracy improves by 1.4% over SCAN (41.3% vs. 39.9%). This shows the superiority of our method on large-scale datasets.

Table 3 The clustering performance on the large dataset ImageNet-1K

Full size table

Empirical analysis

In this section, we conduct several qualitative studies to visually analyze class confusion matrices, confidence samples, and heat maps.

Confusion matrix

As shown in Fig. 3, we visualize the confusion matrix of the three datasets. The confusion matrix of these three datasets has a clear diagonal structure, which shows that our method successfully divides the samples into different categories according to semantics. The confusion of our model mainly occurs between classes that are not easy to distinguish in reality, such as cats and dogs. And there are two reasons for the poor classification of CIFAR-10 and CIFAR-100. On the one hand, the size of these images is too small and very blurry to distinguish. On the other hand, the network cannot clarify the nuances between different categories. Especially on CIFAR-100, it is extremely difficult for the model to obtain accurate semantic discrimination features when we set 100 classes into 20 superclasses.

Image semantic visualization

As shown in Figs. 4 and 5, after completing the model training, we visualize the confidence samples of STL-10 involved in constructing meta-features. At the same time, we use the area highlighted by the heat map to indicate the location of the semantic features noticed by the model. After the maximum matching between the clustering and actual results, we find that the confidence samples constructed by the participating meta-features matched exactly with the manual annotations. For example, after maximum matching, the "horse" class successfully capture the horse class. With the most distinguishing regions in the image concentrated in different locations of the horse. The visualization results show that the confident samples selected by the model are semantically correct and all focus on the semantics of the actual labels, providing a solid basis for meta-features to assign neighbor labels.

Ablation study

Several ablation studies are used in this section to demonstrate the effect of different scenarios on our proposed approach.

Effect of label smoothing

We separately evaluate the effect of using cross-entropy loss and cross-entropy loss with label smoothing on the classification results on three datasets and provide the results in Fig. 6.

As shown in Fig. 6, our model using cross-entropy loss with label smoothing in the experiment is better than directly using cross-entropy loss. On the CIFAR-10, the accuracy rate increases by 1.2%, the NMI increases by 0.5%, and the ARI increases by 0.6%. On the CIFAR-100, the accuracy rate increases by 0.7%, the NMI increases by 0.5%, and the ARI increases by 0.6%. On the STL-10, the accuracy rate increases by 1.4%, the NMI increases by 1.8%, and the ARI increases by 2.1%. The cross-entropy loss with label smoothing reduces overfitting and enhances generalization by adding noise to some extent. The experimental results verify the effectiveness of the label smoothing strategy we use.

Effect of data augmentation

As shown in Fig. 7, to explore the impact of different data augmentations on our method, we conducted an ablation study on the STL-10 using ResNet18 as the backbone. We adopt the same weak augmentation strategy of FixMatch [40] for weak augmentation in the paper, and use the same settings as SCAN for strong augmentation. Specifically, the image is strongly augmented by composing Cutout [41] and four randomly selected transformations from RandAugment [42]. In view of the fact that SCAN also employs a similar data augmentation approach, we simultaneously tested the impact of different data augmentation schemes on SCAN. As shown in Fig. 3, the left figure shows the performance of our approach using different augmentation schemes, and the right figure shows the performance of different augmentation schemes on SCAN.

Overall, the effect of using different augmentations on the clustering results of the two methods is not significant. This is because both methods use pre-trained models, which already have transformation invariance. However, a detailed comparison shows that our method is slightly better than strong augmentation when using weak augmentation (86.2% vs. 85.7%). In contrast, strong Augmentation is better than weak augmentation when the same augmentation strategy is tested on SCAN (78.4% vs. 81.2%). This is related to the way pseudo-labels are constructed and assigned: in the case of SCAN, strong augmentation is used to ensure that the model learns strong consistency since pseudo-labels are assigned for each instance finding its nearest neighbors. However, the idea of stable semantics of our approach is able to not rely on strong augmentation. Firstly, the labels assigned using meta-features as centers are more robust. Secondly, assigning pseudo-labels globally for meta-features can reduce the reliance on strong consistency compared to assigning labels to each instance’s nearest neighbors. Finally, the use of strong augmentation may lead to larger errors in the predictions of the classifier.

Effect of the hyperparameter $\tau $

In this paper, $\tau $ is an important hyperparameter for selecting high-confidence samples. To select appropriate $\tau $ for selecting high-confidence samples to build meta-features, we perform experiments on the CIFAR-10 and CIFAR-100 datasets with ResNet18 as the backbone. As shown in Fig. 8, we select 10 values of $\tau $ covering from low confidence to high confidence for the two datasets respectively. The performance trends of our method on the two datasets are basically the same: when the value of $\tau $ is between 0.1 and 0.5, with the increase of $\tau $, the accuracy of clustering has been increasing. This is because the meta-features we construct are at low confidence and cannot represent the semantic features of the category, resulting in errors in label assignment. As $\tau $ continues to grow, the semantics of meta-features continue to improve, and the clustering effect also improves. When $\tau $ is between 0.5 and 0.8, the samples participating in the construction of meta-features not only have high confidence, but also a large number of samples ensure the semantic representation of the category, and the clustering performance reaches a relatively stable state. When $\tau $ is greater than 0.8, although there are a small number of high-confidence samples participating in the construction of meta-features, the reduction in the number of samples participating in the construction of meta-features causes the semantic instability of meta-features, which in turn results in poor clustering performance.

Reliance on backbone network

To examine how much our model relies on the backbone network, we test two ResNets of different depths and report the results in Table 2. (*) represents that we adopt ResNet34 as the backbone and (**) represents that we adopt ResNet18 as the backbone. It can be seen from the comparison that the representation ability of the backbone network contributes to the clustering performance. On datasets with smaller image sizes such as CIFAR-10 and CIFAR-100, the clustering performance using ResNet18 is better than that using ResNet34. This is because ResNet18 has sufficient representation ability to extract discriminative features on datasets with smaller image sizes, and using a deeper network is prone to overfitting. On datasets with larger image sizes, such as STL-10, ImageNet-10, and Tiny-ImageNet, the clustering performance of ResNet34 with stronger representation is better than that of ResNet18 as backbone.

Conclusion

The unstable semantic pseudo-labels assignment severely limits the performance of image clustering. In this paper, we first propose the concept of “meta-features”, which are stable features with sufficient stable semantic information. At the same time, we combine instance features with discriminative information and class semantic features for achieving deep clustering with stable semantics. Meanwhile, we propose a pseudo-supervised clustering algorithm based on meta-features. The framework adopts stable features to assign pseudo-labels while optimizing meta-features and pseudo-labels in a pseudo-supervised manner, which achieves a direct mapping from features to stable semantic labels. Experiments demonstrate that our proposed deep clustering based on meta-features significantly improves the accuracy of the classification task. Furthermore, our proposed method provides an idea to achieve stable semantic self-learning. We plan to extend it to other work and applications in future work.

Data Availability

Data available on request from the authors.

References

MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley symposium on mathematical statistics and probability, Oakland, vol 1, pp 281–297
Zelnik-Manor L, Perona P (2004) Self-tuning spectral clustering. Advances in neural information processing systems, vol 17
Caron M, Bojanowski P, Joulin A, Douze M (2018) Deep clustering for unsupervised learning of visual features. In: Proceedings of the European conference on computer vision (ECCV), pp 132–149
Chang J, Wang L, Meng G, Xiang S, Pan C (2017) Deep adaptive image clustering. In: Proceedings of the IEEE international conference on computer vision, pp 5879–5887
Wu J, Long K, Wang F, Qian C, Li C, Lin Z, Zha H (2019) Deep comprehensive correlation mining for image clustering. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 8150–8159
Ji X, Henriques JF, Vedaldi A (2019) Invariant information clustering for unsupervised image classification and segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 9865–9874
Huang J, Gong S, Zhu X (2020) Deep semantic clustering by partition confidence maximisation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8849–8858
Li Y, Hu P, Liu Z, Peng D, Zhou JT, Peng X (2021) Contrastive clustering. In: 2021 AAAI conference on artificial intelligence (AAAI)
Zhong H, Wu J, Chen C, Huang J, Deng M, Nie L, Lin Z, Hua X-S (2021) Graph contrastive clustering. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 9224–9233
Van Gansbeke W, Vandenhende S, Georgoulis S, Proesmans M, Van Gool L (2020) Scan: learning to classify images without labels. In: European conference on computer vision. Springer, pp 268–285
He K, Fan H, Wu Y, Xie S, Girshick R (2020) Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9729–9738
Chen X, Fan H, Girshick R, He K (2020) Improved baselines with momentum contrastive learning. arXiv:2003.04297
Chen T, Kornblith S, Norouzi M, Hinton G (2020) A simple framework for contrastive learning of visual representations. In: International conference on machine learning. PMLR, pp 1597–1607
Chen T, Kornblith S, Swersky K, Norouzi M, Hinton GE (2020) Big self-supervised models are strong semi-supervised learners. Adv Neural Inf Process Syst 33:22243–22255
Google Scholar
Caron M, Misra I, Mairal J, Goyal P, Bojanowski P, Joulin A (2020) Unsupervised learning of visual features by contrasting cluster assignments. Adv Neural Inf Process Syst 33:9912–9924
Google Scholar
Xie J, Girshick R, Farhadi A (2016) Unsupervised deep embedding for clustering analysis. In: International conference on machine learning. PMLR, pp 478–487
Yang J, Parikh D, Batra D (2016) Joint unsupervised learning of deep representations and image clusters. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5147–5156
Chang J, Guo Y, Wang L, Meng G, Xiang S, Pan C (2019) Deep discriminative clustering analysis. arXiv:1905.01681
Bengio Y, Lamblin P, Popovici D, Larochelle H (2006) Greedy layer-wise training of deep networks. Advances in neural information processing systems, vol 19
Dang Z, Deng C, Yang X, Huang H (2020) Multi-scale fusion subspace clustering using similarity constraint. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6658–6667
Hjelm RD, Fedorov A, Lavoie-Marchildon S, Grewal K, Bachman P, Trischler A, Bengio, Y (2018) Learning deep representations by mutual information estimation and maximization. arXiv:1808.06670
Niu C, Zhang J, Wang G, Liang J (2020) Gatcluster: self-supervised gaussian-attention network for image clustering. In: European conference on computer vision. Springer, pp 735–751
Yang X, Deng C, Zheng F, Yan J, Liu W (2019) Deep spectral clustering using dual autoencoder network. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4066–4075
Yang X, Deng C, Wei K, Yan J, Liu W (2020) Adversarial learning for robust deep clustering. Adv Neural Inf Process Syst 33:9098–9108
Google Scholar
Gowda KC, Krishna G (1978) Agglomerative clustering using the concept of mutual nearest neighbourhood. Pattern Recognit 10(2):105–112
Lloyd S (1982) Least squares quantization in pcm. IEEE Trans Inf Theory 28(2):129–137
Article MathSciNet MATH Google Scholar
Han Z, Zhang C, Fu H, Zhou JT (2021) Trusted multi-view classification. arXiv:2102.02051
Zhang C, Hu Q, Fu H, Zhu P, Cao X (2017) Latent multi-view subspace clustering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4279–4287
Zhang C, Cui Y, Han Z, Zhou JT, Fu H, Hu Q (2020) Deep partial multi-view learning. In: IEEE transactions on pattern analysis and machine intelligence (2020)
Dang Z, Deng C, Yang X, Huang H (2021) Doubly contrastive deep clustering. arXiv:2103.05484
Gowda KC, Krishna G (1978) Agglomerative clustering using the concept of mutual nearest neighbourhood. Pattern Recognit 10(2):105–112
Article MATH Google Scholar
Cai D, He X, Wang X, Bao H, Han J (2009) Locality preserving nonnegative matrix factorization. In: Twenty-first international joint conference on artificial intelligence
Bengio Y, Lamblin P, Popovici D, Larochelle H (2006) Greedy layer-wise training of deep networks. Advances in neural information processing systems, vol 19
Vincent P, Larochelle H, Lajoie I, Bengio Y, Manzagol P-A, Bottou L (2010) Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J Mach Learn Res 11(12)
Radford A, Metz L, Chintala S (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv:1511.06434
Zeiler MD, Krishnan D, Taylor GW, Fergus R (2010) Deconvolutional networks. In: 2010 IEEE computer society conference on computer vision and pattern recognition. IEEE, pp 2528–2535
Kingma DP, Welling M (2013) Auto-encoding variational bayes. arXiv:1312.6114
Haeusser P, Plapp J, Golkov V, Aljalbout E, Cremers D (2018) Associative deep clustering: training a classification network with no labels. In: German conference on pattern recognition. Springer, pp 18–32
Zelnik-Manor L, Perona P (2004) Self-tuning spectral clustering. Advances in neural information processing systems, vol 17
Sohn K, Berthelot D, Carlini N, Zhang Z, Zhang H, Raffel CA, Cubuk ED, Kurakin A, Li C-L (2020) Fixmatch: simplifying semi-supervised learning with consistency and confidence. Adv Neural Inf Process Syst 33:596–608
Google Scholar
DeVries T, Taylor GW (2017) Improved regularization of convolutional neural networks with cutout. arXiv:1708.04552
Cubuk ED, Zoph B, Shlens J, Le QV (2020) Randaugment: practical automated data augmentation with a reduced search space. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp 702–703

Download references

Author information

Authors and Affiliations

College of Automation and Electronic Engineering, Qingdao University of Science and Technology, Qingdao, 266061, China
Hao Wang, Youjia Shao, Tongsen Yang & Wencang Zhao
Qingdao Institute of Intelligent Navigation and Control, Qingdao, 266071, China
Wencang Zhao

Authors

Hao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Youjia Shao
View author publications
You can also search for this author in PubMed Google Scholar
Tongsen Yang
View author publications
You can also search for this author in PubMed Google Scholar
Wencang Zhao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wencang Zhao.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Wang, H., Shao, Y., Yang, T. et al. Pseudo-supervised image clustering based on meta-features. Complex Intell. Syst. 9, 6541–6551 (2023). https://doi.org/10.1007/s40747-023-01081-9

Download citation

Received: 19 April 2022
Accepted: 17 April 2023
Published: 24 May 2023
Issue Date: December 2023
DOI: https://doi.org/10.1007/s40747-023-01081-9

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Pseudo-supervised image clustering based on meta-features

Abstract

Similar content being viewed by others

Semi-supervised Clustering by Selecting Informative Constraints

Combining core points and cluster-level semantic similarity for self-supervised clustering

Imbalance-Aware Discriminative Clustering for Unsupervised Semantic Segmentation

Explore related subjects

Introduction

Related work

Unsupervised representation learning model

Deep clustering

Method

Feature extraction network

Pseudo-label optimization network

Pseudo-label annotation based on meta-features

Classifier optimization

Label smoothing

Experiments

Datasets

Evaluation metrics

Experimental setup

Comparison with the existing SOTA methods

Comparison on small datasets

Comparison on large scale dataset ImageNet-1K

Empirical analysis

Confusion matrix

Image semantic visualization

Ablation study

Effect of label smoothing

Effect of data augmentation

Effect of the hyperparameter \(\tau \)

Reliance on backbone network

Conclusion

Data Availability

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation