Abstract
Deep spectral clustering techniques are considered one of the most efficient clustering algorithms in data mining field. The similarity between instances and the disparity among classes are two critical factors in clustering fields. However, most current deep spectral clustering approaches do not sufficiently take them both into consideration. To tackle the above issue, we propose Semantic Spectral clustering with Contrastive learning and Neighbor mining (SSCN) framework, which performs instance-level pulling and cluster-level pushing cooperatively. Specifically, we obtain the semantic feature embedding using an unsupervised contrastive learning model. Next, we obtain the nearest neighbors partially and globally, and the neighbors along with data augmentation information enhance their effectiveness collaboratively on the instance level as well as the cluster level. The spectral constraint is applied by orthogonal layers to satisfy conventional spectral clustering. Extensive experiments demonstrate the superiority of our proposed frame of spectral clustering.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
As one of the most essential unsupervised learning techniques, spectral clustering (SC) finds extensive applications in the fields of machine learning and pattern recognition [1,2,3]. Although conventional shallow spectral clustering methods [4,5,6] have an elegant theory foundation and achieve promising clustering accuracy, they have problems concerning scalability and generalization. To overcome this issue, two-stage and end-to-end one-stage deep spectral clustering methods have been proposed. In the two-stage framework [7,8,9,10], deep SC methods first embed the spectral embedding with a deep framework and then achieve the cluster indicator within a conventional clustering algorithm. Specifially, Yang and Li et al. utilize dual autoencoder network or fully convolutional auto-encoders in the initial stage for feature learning, and then simply cluster them or cluster centers by soft k-means scores [9, 10]. Duan et al. generate the deep embedding via learning a deep autoencoder, then estimate the cluster number by a softmax autoencoder, and combine metric learning to build more powerful similarity graph [7, 8]. In the end-to-end framework, deep SC methods couple the cluster indicator learning with the spectral embedding learning, and then improves each other iteratively in a unified deep algorithm [11,12,13,14,15]. For example, Ji and Hu et al. propose to maximize the mutual information between an instance and its data augmentations [11, 12]; and our previously proposed methods integrate the cluster indicator learning with generative adversarial feature learning via a Bayesian framework [13, 14]. By exploiting these deep frameworks, spectral clustering can attain scalability and gain access to out-of-sample unseen data points, nevertheless, despite these advantages, certain limitations still exist: (1) due to the overemphasized importance of low-level features, autoencoder-based methods still lack semantic discriminative feature space; (2) in end-to-end framework, the network training is very sensitive to the initialization, which leads to biased clustering results. (3) previous approaches only leverage the instance level similarity without considering the information of semantic differences cluster-wise which help to enhance the disparity among classes.
To handle the above shortcomings, we integrate contrastive learning and spectral clustering into one deep learning framework. To begin with, we pre-train the unsupervised representation learning model to extract semantic feature embedding. Then, we search the nearest neighbor instances both globally and locally according to the similarity between the previous semantically meaningful features. Specifically, we fuse the data augmentation and neighbor searching together in our loss. What’s more, besides the intra-class instances pulling, our work also adopts the inter-class pushing strategy. We apply our Instance and Cluster level Nearest Neighbor Comparing (ICNNC) loss to optimize spectral clustering results while enforcing orthogonality. The main contributions of our work are listed as follows:
-
We propose a deep spectral clustering framework by fusing contrastive learning pattern and neighbor relation mining in spectral clustering. Revealing the core goal of contrastive loss and neighbor utilizing as well as spectral clustering loss are consistent.
-
Our proposed model not only considers intra-class cohesion (instance-level) but also inter-class separation (cluster-level), dragging each instance closer as well as pushing each cluster away.
-
We provide a novel contrastive strategy built-in spectral clustering pattern.
-
Augmentation information as well as neighbors collaboratively contribute in our contrastive strategy.
2 Related Works
Spectral Clustering (SC) [16, 17] has a solid theoretical foundation. Constructing a good similarity graph is crucial for SC, and usually the spectral clustering methods can be categorized into two distinct frameworks: traditional methods and deep methods. Traditional ones often use mathematical approaches [18,19,20,21] such as using simple Euclidean distances to construct similarity graphs or using low-rank constraints [22] on graphs that have limited improvements. Fan et al. [23] focus on how to find reliable affinity matrices through a varied set of affinity-matrix-construction methods. Yet it’s still hard to achieve superior performance than deep methods due to the high cost of constructing affinity matrix and performing eigen decomposition when dealing with large-scale real-world datasets. Some work [24] utilize a matrix completion algorithm that rapidly calculates the similarity matrix to improve computational efficiency. Since spectral clustering does not explicitly compute the mapping function, deep spectral clustering learns a non-linear embedding function using deep neural networks rather than linear eigenvalue decomposition. SpetralNet [25] adopts the siamese network to construct the similarity matrix. Its performance heavily relies on the quality of the similarity graph. To address this issue, Yang et al. [10] use the embedding layer features of an autoencoder as the input to SpectralNet. It can have joint optimization cause they integrate the process of learning autoencoder embedding features and the process of spectral embedding features. Huang et al. [26] extend SpectralNet to multi-view scenarios, where each view is processed by an individual SpectralNet and the total loss functions obtained by weighting and summing the individual SpectralNet loss functions, achieving joint training of multiple views. Yang et al. [27] extend the spectral embedding approach by minimizing the posterior probability distribution between instances rather than minimizing the Euclidean distance in the embedding space, then constrain the hidden variables of the variational autoencoder with the Gaussian mixture model. Zhang et al. [28] use spectral clustering to obtain preliminary clustering results, which are then treated as pseudo-labels and used to train a neural network in a supervised manner. In addition to directly using spectral clustering loss to train neural networks, some work [8, 29] use spectral clustering as a post-processing method. Duan et al. [8] first use an autoencoder to learn the embedding features of the instances, and then directly use spectral clustering to map the learned embedding features on the spectral embedding space. Affeldt et al. [29] use multiple autoencoders to learn embedding features separately, then merge multiple sets of embedding features to calculate the adjacency matrix, and lastly perform spectral clustering on the merged adjacency matrix to obtain clustering results.
Most deep spectral clustering utilize autoencoders [8, 10, 29] to get feature embedding, causing the limited quality of features. Thus, we choose the contrasting frame to extract meaningful semantic embeddings for deep spectral clustering.
While spectral clustering has undergone a diverse range of developments that have yielded impressive experimental results, it is still limited to the instance level. From the perspective of the ultimate goal of clustering (sufficient intra-class cohesion and sufficient inter-class separation), the above spectral clustering methods lack considering cluster level disparity.
Contrastive Learning has emerged as one of the most efficient unsupervised learning paradigms and has witnessed significant progress in representation learning fields. The central principle of contrastive learning is to find an embedding space where can get the minimized similarity between negative pairs as well as the maximized similarity between positive pairs. A breakthrough in this field is SimCLR [30], which leverages instance discrimination as its pretext task. It generates two views of each instance through a diverse range of data augmentations. Then maximizes the similarity between two augmentations of the same instance while minimizing the similarities with views from other instances. Besides, MoCo [31] builds a dynamic, large-scale dictionary of negative examples by maintaining a queue of past examples. This dictionary is used to provide negative pairs for contrastive loss optimization. It uses a moving-averaged encoder to ensure dictionary consistency.
Rarely has research explored the integration of contrastive paradigms into spectral clustering. Due to the strong semantic feature learning ability of contrastive learning frameworks, which enable the learned features to have high discriminability, we adopt a contrastive learning framework as our pre-training model. Getting a semantic feature embedding in deep spectral clustering is as important as constructing a good similarity graph in conventional spectral clustering.
Moreover, Tan et al. [32] provide a theoretical insight that the original InfoNCE loss [33] in contrastive learning is equivalent to spectral clustering on the similarity graph, they use Markov random field (MRF) method to convert graph into a distribution of subgraphs, instead of directly comparing two graphs, they use cross entropy to compare the adjacency matrix of subgraphs. By doing so, the cross entropy loss acts as a bridge, connecting the InfoNCE loss and spectral clustering. Therefore, we adopt InfoNCE loss at the clustering phase to imitate the process of spectral clustering.
3 Our Proposed Method
A common scenario is that strong instance similarity can appear not only within a class but also across classes. Traditional spectral clustering solely relies on instance similarity, which may pull in two false similar pairs. For example, two instances (a white dog and a white slim deer) located at the edge of their corresponding classes, look alike and have high values of similarity, shown in Fig. 1. Concentrating solely on the instance level would blur the boundaries between the two classes and ultimately resulting in clustering degeneration. Therefore, it’s essential to incorporate cluster level pushing to make the boundaries of classes clear.
We introduce our proposed method in the following sections. Initially, we train the unsupervised feature embedding learning model with a contrastive loss. Subsequently, utilizing the trained model, we retrieve nearest neighbors globally and locally based on the similarity between the extracted semantically meaningful features. Afterwards, we employ Instance and Cluster-level Nearest Neighbor Comparing (ICNNC) loss to optimize spectral clustering results while enforcing orthogonality. The framework is briefly outlined in Fig. 2, and the key notations and descriptions are summarized in Table 1.
To sum up, we employ contrastive learning as our pre-trained model for feature extraction due to its strong ability in generating semantically meaningful embeddings. Additionally, we employ the InfoNCE loss function during the clustering phase based on the theoretical equivalence between contrastive learning and spectral clustering [32]. The incorporation of neighbors is aimed at enhancing intra-class cohesion. The integration of augmentation and neighbor information is driven by the idea that augmentation strengthens a feature’s representation by introducing additional, self-transformed information, while neighbors contribute to the aggregation of local context information.
Spectralnet [25] approximates spectral clustering by training a mapping function that embeds input data into the eigenspace of the corresponding graph Laplacian matrix and then clusters them. In our method, we enhance the embedding phase as the first step, and the clustering network includes constrained optimization where the output orthogonality is achieved by enforcing a linear layer. The weights of this layer are determined through the QR decomposition.
3.1 Unsupervised Feature Learning Model
Contrastive learning has emerged as an outstanding framework for unsupervised learning, exhibiting remarkable performance in representation learning. To overcome the limitations of end-to-end clustering methods, which are often sensitive to network initialization, we choose to train an unsupervised feature embedding learning model first and then cluster over those semantic meaningful features. A contrastive learning framework (i.e. SimCLR) is adopted to fully pre-train the network. Specifically, given a data instance \(x_i\), a random data augmentation \(T^{\prime }\), we can get the positive pair \(x_i\) and \(x_i^{\prime }=T^{\prime }\left( x_i\right) \), and therefore the respective features \(h_i\) and \(h_i^{\prime }\), for SimCLR framework, the loss function is as follows:
B denotes the batch size, \({\textit{sim}}()\) is the similarity function (i.e. cosine similarity), \(\tau \) denotes temperature parameter. \(\textbf{1}_{j \ne i}\) is an indicator function that equals 1 when \({j \ne i}\). In this loss, we have one positive pair and treat the remaining \((2B-1)\) augmented sample pairs within a batch as negative pairs. Following the completion of training for the pre-trained model, we proceed to extract features directly from the backbone and mine the top-K nearest neighbors based on these extracted features.
In addition, the contrastive loss strategy is not only applied to the pre-trained unsupervised feature extraction model, but also extended to the subsequent stage of cluster-level neighbor pushing.
3.2 Clustering While Matching Nearest Neighbors
After searching the nearest neighbors, from another perspective, those meaningful semantic features can be regarded as supervised information and serve as positive instances for the original instances. In general, we aim to leverage the nearest neighbor information to its fullest extent.
Due to an insight that the feature matrix’s rows can function as instances’ soft labels [34], as well as the columns can be considered as cluster representations, we jointly optimize the instance and cluster level losses. Thus we can have two goals (pull in intra-class instances and push apart different classes) both achieved in one framework.
3.2.1 Instance-Level
We utilize the weights of trained feature extraction model \(\varPhi _{{\textit{pre}}}\), which is exactly the backbone of the contrastive model. Then in clustering network \(\varPhi _{{\textit{clu}}}\), we feed those instance pairs into the network. By applying the Instance and Cluster level Nearest Neighbor Comparing (ICNNC) loss and orthogonal constraint, we construct the spectral clustering network.
Considering in InfoNCE [33] loss, there is only one definite positive pair, and the remaining samples in a batch are all treated as negative samples. However, this setup introduces a potential issue when instances of the same category appear in the same batch but are actually used as negative samples, which causes the existence of false negative pairs, resulting in performance degeneration. So a straightforward and effective way to pull instances in the same class closer is to simply maximize the similarity between instances and their neighbors:
The clustering assignments \(p \in \mathbb {R}^{B \times C}\), note that C is the number of classes and B is the batch size, p and \(\mathcal {N}_{{\textit{pre}} }(p)\) are the clustering assignments of the instance’s feature h and its neighbors’s feature \(\mathcal {N}_{{\textit{pre}} }(h)\). All the neighbors are searched on the basis of the feature representation, and subsequently feed the features into the the clustering network. \(\mathcal {N}_{{\textit{pre}} }\) denotes the neighbors are searched from pre-trained unsupervised representation model \(\varPhi _{{\textit{pre}}}\). The \(\langle \cdot , \cdot \rangle \) operator denotes a dot product that is utilized for evaluating the similarity between two items.
Local and Global Nearest Instance Pulling Assume that \(h \in \mathbb {R}^{B \times D}\) is the embedding features (prior to clustering network) of p, and D is the feature’s dimension. Locally, we select the nearest neighbors \(\mathcal {N}_{{\textit{local}}}(h)\) per batch from feature h. By tracking the batch index of \(\mathcal {N}_{{\textit{local}}}(h)\), we can get the corresponding clustering assignments \(\mathcal {N}_{{\textit{local}}}(p)\). Therefore, the loss in Eq. (2) can be expressed as follows. In addition, it is worth noting that the \(\mathcal {N}_{{\textit{eighbors}}}(p)\) can be substituted with \(\mathcal {N}_{{\textit{local}}}(p)\) batch-wise or \(\mathcal {N}_{{\textit{global}}}(p)\) epoch-wise, which will mention later:
We leverage the aforementioned neighbor \(\mathcal {N}_{{\textit{pre}} }(p)\), which is searched from the pre-trained contrastive model, and concatenate them together along the batch dimension. Furthermore, we take the augmentation version \({\textit{Aug}}(p)\) into consideration.
Note that our augmentation data is obtained by applying four of the following operations, and it’s randomly selected from horizontal shear, vertical shear, horizontal translation, vertical translation, rotation, auto-contrast, color inversion, solarization, posterization, brightness adjustment, sharpness adjustment, and histogram equalization. Then we can extend our local instance loss as follows:
Here, \(\left[ \begin{array}{c} p\\ {\textit{Aug}}(p) \end{array}\right] \in \mathbb {R}^{2B \times C}\). Similarly, we can get the nearest neighbor globally to pull in intra-class distances. At the end of each epoch, we can extract the most recent global dataset features \(f \in \mathbb {R}^{N \times D}\) and its clustering assignments \(u \in \mathbb {R}^{N \times C}\), N denotes the whole size of train data. Globally, we search the nearest neighbors from f and \(\mathcal {N}_{{\textit{global}}}(u)\) indicate its predictions of every epoch. Then we have the predictions \(\mathcal {N}_{{\textit{global}}}(u)_{b } \in \mathbb {R}^{B \times C}\) by tracking the batch index. When the searched \(\mathcal {N}_{{\textit{global}}}(u)\) is in a specific batch, the global instance loss can be extended as follows:
where \(I_{{\textit{neigh}}_{{\textit{global}}}} \in \mathbb {R}^{4B \times C}\) is four \(\mathcal {N}_{ {{\textit{global}}}}(u)_{b}\) concatenate together in order to keep in consistent with \(I_{{\textit{neigh}}_{{\textit{local}}\_{\textit{four}}}}\). Hence, we integrate the semantic nearest neighbors partially and globally as well as the augmentation information.
3.2.2 Cluster-Level
When the features are projected to a space with dimensionality equal to the number of clusters, the i-th column of the feature matrix represents its likelihood of belonging to the i-th cluster. From a clustering perspective, we intuitively try to push each cluster (each column) away from each other. Our cluster-wise contrastive loss is as below:
\(q \in \mathbb {R}^{C \times B}\) is the transpose of p, and \(\mathcal {N}_{{{\textit{pre}} }}(q) \in \mathbb {R}^{C \times B}\), specifically, \({q}_{i}\) and \(\mathcal {N}_{{{\textit{pre}} }}(q)_{i} \in \mathbb {R}^{B}\). In contrast to the original form of the contrastive loss, we replace the role of the initial positives with the nearest neighbors, thereby incorporating more relevant and meaningful instances. By leveraging both local and global neighbors, we effectively promote the separation of different classes.
Local and Global Cluster Pushing Analogically, we have the \(\mathcal {N}_{{{\textit{pre}} }}(q) \in \mathbb {R}^{C \times B}\) from the pretrained contrastive model and \(\mathcal {N}_{{{\textit{local}} }}(q) \in \mathbb {R}^{C \times B}\) selected per batch, where each q row is the original p column. Therefore, the local level class contrastive loss (see in Fig. 3):
When the global level nearest neighbors are searched, located by the batch index, we can obtain \(\mathcal {N}_{ {{\textit{global}}}}(u)_{b}\in \mathbb {R}^{B \times C}\), where \(u \in \mathbb {R}^{N \times C}\) and we can have \(\mathcal {N}_{ {{\textit{global}}}}(v)_{b} \in \mathbb {R}^{C \times B}\), where \(v \in \mathbb {R}^{C \times N}\), the transpose of u. Therefore, the global level class contrastive loss (see in Fig. 4):
where \(C_{{\textit{neigh}}_{{\textit{global}}}} \in \mathbb {R}^{C \times 4B}\) is four \(\mathcal {N}_{ {{\textit{global}}}}(u)_{b}\) concatenate together in order to keep in consistent with \(C_{{\textit{neigh}}_{{\textit{local}}\_{\textit{four}}}}\). Furthermore, by doing so, we can avoid the false negative pairs problem and performance degeneration, since there will not have two same clusters in one negative pair. Each column represents a different cluster, and pushing apart those different classes is exactly our target.
Besides, we utilize an entropy term to avoid assigning all samples into a single cluster. This term promotes a more uniform distribution of predictions across the clusters \(\mathcal {C}\), and \(M(p) \in \mathbb {R}^{1 \times C}\) represents the mean of p over the batch dimension.
Therefore, we fully use the semantic nearest neighbors locally and globally at both instance level and cluster level. Consequently, we seek to minimize the overall loss as follows:
\(\lambda \) denotes the hyper-parameter. The collaborative influence of \(\mathcal {N}_{{{\textit{pre}}}}\), \(\mathcal {N}_{{{\textit{local}}}}\), \(\mathcal {N}_{{{\textit{global}} }}\) and its augmentation information through our work can allow us to jointly take advantage of the two-stage methods (sperate feature learning and clustering) and end-to-end methods. Although end-to-end methods can obtain clustering-oriented features, the quality of the extracted features can not be well ensured and could be limited by the network structure. In our work, first we extracted meaningful semantic features which avoid the above shortcoming and then during the clustering iteration, \(\mathcal {N}_{{{\textit{local}}}}\) and \(\mathcal {N}_{{{\textit{global}}}}\) can be replaced with the most proper clustering oriented semantic features. The training process of SSCN is summarized in Algorithm 1.
3.3 Deep Spectral Clustering
Spectral clustering consists of three parts: First, Construct similarity graph W and the Laplacian matrix \(L=D-W\), D denotes the degree matrix, then compute the first k eigenvectors of the Laplacian matrix, and finally apply k-means on the matrix, which is composed of the k eigenvectors. And spectral clustering’s loss function can be formulated as:
\(y_{i}\) is the output of the spectral clustering network \(F_\theta \), and m is the size of minibatch. \(W_{i, j}=w\left( x_i, x_j\right) \), it captures the similarity between \(x_{i}\) and \(x_{j}\). As for w, our goal is to have similar points \(x, x^{\prime }\) (i.e., those with large value of \(w\left( x, x^{\prime }\right) \)) mapped into an embedding space where they are close to each other.
Analogously, we would like instance \(x_{i}\) and its neighbors \(\mathcal {N}_{{{\textit{pre}}}}\), \(\mathcal {N}_{{{\textit{local}}}}\), \(\mathcal {N}_{{{\textit{global}}}}\) as well as its augmentation information to be embedded close to each other. Derived from the same intention, we have the loss function on instance level mentioned previously Eqs. (2), (3), (4) and (5), which not only achieved the same goal but also considered more thoroughly.
Furthermore, spectral clustering imposes an orthogonal constraint to prevent trivial solutions where all data is mapped to the same output vector:
Y is a \(m \times k\) output matrix where i-th row is represented by \(y_{i}^T\). In order to enforce orthogonality, we utilize the final layer of the network to achieve it. The final layer receives input from k units, and k outputs are produced through the layer which functions as a linear layer, and the weights are set to produce the orthogonalized results Y per batch. The \(m \times k\) matrix \(\tilde{Y}\) represents the inputs to this layer, we have a linear map using QR decomposition to achieve column-wise orthogonality of \(\tilde{Y}\)’s columns. To be specific, the Cholesky decomposition can be employed to obtain the QR decomposition of any matrix A that satisfies \(A^T A\) being full rank:
wherein C is a lower triangular matrix, then Q is obtained by setting \(Q=A\left( C^{-1}\right) ^T\). Therefore, The last layer performs a right multiplication of matrix \(\tilde{Y}\) by \(\sqrt{m}(\tilde{L}^{-1})^T\) to orthogonalize it. The \(\tilde{L}\) is derived from the Cholesky decomposition of \(\tilde{Y}\) and the \(\sqrt{m}\) term is introduced to fulfill Eq. (12). Every orthogonalization step involves adjusting the weights of the final layer using the QR decomposition. After completing the training of the neural network, all weights are frozen, including the last layer’s weights that work solely as a linear layer. This layer also helps to cultivate more distinguishable clustering assignments.
What’s more, generally in contrastive learning, in order to capture the substantially distinguishing features, a vast number of negative pairs are fed into the network, and data augmentation is needed in constructing positive pairs. In our work, we collaboratively use data augmentation and nearest neighbor mining [(Eqs. (4), 5, (7), (8)], which helps reduce the cost of using large number of negative pairs. And from the perspective of optimization objectives, spectral clustering and contrastive learning can be integrated into a unified optimization objective.
4 Experiments
4.1 Datasets
During the experiments section, we evaluate our method on three image datasets that are widely utilized:
4.1.1 CIFAR-10
An image dataset for training and testing encompasses 50,000/10,000 RGB images, each of which measures \(32\times 32\) pixels in size. The dataset is divided into 10 classes, with each class containing 6000 images. Each class has 5000 training samples and 1000 testing samples.
4.1.2 CIFAR-100
Another image dataset extends CIFAR-10 by including 100 classes. It has 50,000/10,000 RGB images with size \(32\times 32\) for training and testing. Each class has 500 training samples and 100 testing samples, and there is no overlap between the classes. The 20 superclasses on CIFAR-100 are regarded as ground truth.
4.1.3 STL-10
An ImageNet sourced dataset contains 13,000 samples with size \(96\times 96\) from 10 classes. Each class contains 500 training images and 800 test images. Unlike CIFAR-10 and CIFAR-100, the images are not cropped or rescaled. The 10 classes are airplane, bird, car, cat, deer, dog, horse, monkey, ship, and truck.
4.2 Evaluation Metrics
We adopt three widely-used clustering performance metrics in our experiments, including Accuracy (ACC), Normalized Mutual Information (NMI), and Adjusted Rand Index (ARI). The dominant class label determines the assigned predicted label. Values for these metrics range from 0 to 1, with better performance indicated by higher values.
4.3 Experimental Setup
We implement our work on PyTorch 1.4.0 and apply Adam optimizer with a learning rate of \(10^{-4}\), and decay for \(10^{-4}\). For these small datasets CIFAR-10/100, and STL-10, we choose the SimCLR for the pre-trained model, the network’s backbone is a standard ResNet18, and the 20 nearest neighbors are searched based on the contrastive feature learning framework. Faiss library [50] is used to mine the neighbors. In the clustering network, we adopt 100 epochs over all models and 100 batch size on STL10, 200 batch size on CIFAR-10, and 250 batch size on CIFAR-100. To accelerate the training process, we set K to 1 for both local and global K nearest neighbor searching. And the weight \(\lambda \) in entropy loss is set to 5. Our strong data augmentation forms are achieved by applying four randomly selected RandAugment [51] transformations.
4.4 Clustering Performance Comparison
We present a comparison of the clustering results of the following baselines and our SSCN methods, including conventional methods (K-Means [4], SC [16], AC [35], NMF [5]) and deep networks methods (AE [37], DAE [36], DCGAN [38], DeCNN [39], JULE [40], DEC [41], DAC [42], ADC [43], DDC [44], DCCM [45], IIC [11], PICA [46]) and recently pre-trained feature-based methods (SCAN [49]), and IDFD [47] which combines instance discrimination and feature decorrelation, as well as MiCE [48] which mixed separate contrastive experts in a probabilistic way. According to the results demonstrated in Table 2, benefiting from the ICNNC loss, specifically in three datasets (CIFAR-10, CIFAR-100 and STL-10), our approach shows better results than others on three evaluation metrics. In particular, SSCN surpasses MiCE by 1.0% on CIFAR-10, and 4.8% on CIDAR-100 in terms of ACC and achieves improvements of 1.4% on CIFAR-10, and 4% on CIFAR-100 in terms of NMI. On STL-10 datasets, our method can surpass the supervised results.
One significant factor contributing to our method’s superiority over supervised approaches on the STL-10 dataset is the pre-trained self-supervised semantic embedding space, we can see that if we remove the pre-train stage, the clustering performance on STL-10 dataset has dropped 25%, resulting in only 56.1% on ACC metric. And if we use the pretrain model and followed by simply K-Means clustering, it can get a 9.7% leverage on ACC. What’s more, SCAN [49] can also get a close performance on STL-10 dataset, we based on SCAN, incorporates instance-level cohesion and cluster-level repulsion, considering neighbors mined batch-wise and epoch-wise, constraining by orthogonal layer, can also exceeded supervised method. To be clear, the last row supervised model in Table 2 is not built upon pre-trained weights.
4.5 Ablation Study
Ablation studies are conducted in this section to explore the impact of different choices in our approach.
4.5.1 Effectiveness of Proposed Loss
We individually evaluate the impact of each proposed loss function on dataset STL-10 and show the results in Table 3, we can find that instance level \(\mathcal {L}_{{\textit{instan}}}\) and class level \(\mathcal {L}_{{\textit{class}}}\) are both essential, while instance level \(\mathcal {L}_{{\textit{instan}}}\) has a more significant impact. With \(\mathcal {L}_{{\textit{instan}}}\), we can have a 7.4% improvement on ACC. What’s more, the entropy loss \(\mathcal {L}_{{\textit{entropy}}}\) effectively prevents the network fall into the trivial solution.
4.5.2 Effectiveness of Clustering Heads
In order to get robust predictions, we choose to implement the multiple clustering heads strategy. The corresponding results are illustrated in Table 4. As the quantity of clustering heads increased from 1 to 5, the performance improved to a certain degree, however, when the number continues to grow, it may either remain steady or degenerate (shown in CIFAR-10 and CIFAR-100 rows). All different clustering heads share the same global features, the excessive heads may produce more unstable results. Empirically, we set the clustering heads’ number as 5.
4.6 Qualitative Study
This section comprises several studies to investigate our work directly and visually, including class confusion matrices and top-3 most confident instances.
4.6.1 Confusion Matrices
We present the visualization of confusion matrices on three different datasets in Fig. 5. It is clear that all three confusion matrices exhibit a noticeable block diagonal structure, indicating that our methods effectively cluster different instances into their corresponding semantic classes. The common mis-grouping categories in CIFAR-10 (Fig. 5a) and STL-10 (Fig. 5c) are ‘cat’ and ‘dog’. ‘household furniture’ and ‘household electrical devices’ in CIFAR-100 (Fig. 5b) are two categories which may be mis-clustered. Our work tried to tackle the challenges by introducing cluster level constraints to the framework from a theoretical perspective. And as the ACC/NMI/ARI values increased, the occurrences of mis-clustered cases diminished. Achieving a complete elimination of mis-clustered issue requires 100% clustering accuracy. While some instances of this phenomenon may still be observable in our evaluations, but it’s certainly mitigated according to the ACC/NMI/ARI values. The blurriness of pictures in CIFAR-10/100 may contribute to the unsatisfactory mis-clustering results, as the network mainly focuses on the semantic features and might ignore detailed fine-grained differences.
4.6.2 Confident Images
In order to demonstrate more directly, we show the three most confident instances on CIFAR-10 and STL-10 in Fig. 6. As the results on CIFAR-100 are not well-performing, we do not report its confident instances. Each column represents each class, and the top-3 confident instances are presented in three rows. For CIFAR-10 (Fig. 6a), the 3 most confident instances clearly represent each cluster. However, for STL-10 (Fig. 6b), there is a ‘cat’ instance misclassified as a ‘dog’ in the last row of the ‘dog’ column. And in the ‘deer’ column, the second row instance which belongs to ‘cat’ has a body shape that roughly resembles a deer. This phenomenon further indicates that our work lacks the ability to distinguish fine-grained details between figures.
5 Conclusion
We have proposed a semantic spectral clustering with contrastive learning and neighbor mining (SSCN) framework, which performs instance level pulling and cluster level pushing cooperatively. Different from previous methods, our proposed method fuses contrastive strategy and neighbor mining and spectral pattern in a whole framework. Experiment results on three real datasets show the effectiveness of our proposed method. Additionally, we conduct ablation studies on the number of clustering heads and the effectiveness of different separate loss functions.
Data availability
All datasets used in this paper are publicly-available datasets.
References
An L, Gao X, Li X, Tao D, Deng C, Li J (2012) Robust reversible watermarking via clustering and enhanced pixel-wise masking. IEEE Trans Image Process 21(8):3598–3611. https://doi.org/10.1109/TIP.2012.2191564
Min R, Garnier C, Septier F, Klein J (2022) State space partitioning based on constrained spectral clustering for block particle filtering. Signal Process 201:108727. https://doi.org/10.1016/j.sigpro.2022.108727
Alshammari MA, Takatsuka M (2019) Approximate spectral clustering with eigenvector selection and self-tuned k. Pattern Recognit Lett 122:31–37. https://doi.org/10.1016/j.patrec.2019.02.006
MacQueen J (1965) Some methods for classification and analysis of multivariate observations. In: Proc. 5th Berkeley Symposium on Math., Stat., and Prob, p 281
Cai D, He X, Wang X, Bao H, Han J (2009) Locality preserving nonnegative matrix factorization. In: Twenty-first international joint conference on artificial intelligence
Zhao Y, Li X (2023) Spectral clustering with adaptive neighbors for deep learning. IEEE Trans Neural Netw Learn Syst 34(4):2068–2078. https://doi.org/10.1109/TNNLS.2021.3105822
Duan L, Ma S, Aggarwal C, Sathe S (2021) Improving spectral clustering with deep embedding, cluster estimation and metric learning. Knowl Inf Syst 63:675–694
Duan L, Aggarwal C, Ma S, Sathe S (2019) Improving spectral clustering with deep embedding and cluster estimation. In: 2019 IEEE International conference on data mining (ICDM). IEEE, pp 170–179
Li F, Qiao H, Zhang B (2018) Discriminatively boosted image clustering with fully convolutional auto-encoders. Pattern Recogn 83:161–173
Yang X, Deng C, Zheng F, Yan J, Liu W (2019) Deep spectral clustering using dual autoencoder network. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4066–4075
Ji X, Henriques JF, Vedaldi A (2019) Invariant information clustering for unsupervised image classification and segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 9865–9874
Hu W, Miyato, T, Tokui S, Matsumoto E, Sugiyama M (2017) Learning discrete representations via information maximizing self-augmented training. In: International conference on machine learning. PMLR, pp 1558–1567
Ye X, Zhao J, Chen Y, Guo L (2020) Bayesian adversarial spectral clustering with unknown cluster number. IEEE Trans Image Process 29:8506–8518. https://doi.org/10.1109/TIP.2020.3016491
Zhang F, Zhao J, Ye X, Chen H (2022) One-step adaptive spectral clustering networks. IEEE Signal Process Lett 29:2263–2267. https://doi.org/10.1109/LSP.2022.3217441
Ye X, Wang C, Imakura A, Sakurai T (2021) Spectral clustering joint deep embedding learning by autoencoder. In: 2021 International joint conference on neural networks (IJCNN). IEEE, pp 1–7
Ng A, Jordan M, Weiss Y (2001) On spectral clustering: analysis and an algorithm. In: Advances in neural information processing systems, vol 14
Von Luxburg U (2007) A tutorial on spectral clustering. Stat Comput 17:395–416
Yang Y, Shen F, Huang Z, Shen HT, Li X (2017) Discrete nonnegative spectral clustering. IEEE Trans Knowl Data Eng 29(9):1834–1845. https://doi.org/10.1109/TKDE.2017.2701825
Huang J, Nie F, Huang H (2013) Spectral rotation versus k-means in spectral clustering. In: desJardins M, Littman ML (eds) Proceedings of the twenty-seventh AAAI conference on artificial intelligence, July 14–18, 2013, Bellevue, Washington, USA. AAAI Press
Zhan K, Nie F, Wang J, Yang Y (2019) Multiview consensus graph clustering. IEEE Trans Image Process 28(3):1261–1270. https://doi.org/10.1109/TIP.2018.2877335
Li X, Hu W, Shen C, Dick AR, Zhang ZM (2014) Context-aware hypergraph construction for robust spectral clustering. IEEE Trans Knowl Data Eng 26(10):2588–2597. https://doi.org/10.1109/TKDE.2013.126
Nie F, Chang W, Wang R, Li X (2021) Learning an optimal bipartite graph for subspace clustering via constrained Laplacian rank. IEEE Trans Cybern
Fan J, others Tu Y, Zhang Z, Zhao M, Zhang H (2022) A simple approach to automated spectral clustering. In: NeurIPS. http://papers.nips.cc/paper_files/paper/2022/hash/407fb8c5f3fda374c57d1bb18313ea5d-Abstract-Conference.html
Ma X, Zhang S, Pena-Pena K, Arce GR (2021) Fast spectral clustering method based on graph similarity matrix completion. Signal Process 189:108301. https://doi.org/10.1016/j.sigpro.2021.108301
Shaham U, Stanton KP, Li H, Basri R, Nadler B, Kluger Y (2018) Spectralnet: spectral clustering using deep neural networks. In: 6th International conference on learning representations, ICLR 2018, Vancouver, BC, Canada, April 30–May 3, 2018, conference track proceedings
Huang S, Ota K, Dong M, Li F (2019) Multispectralnet: spectral clustering using deep neural network for multi-view data. IEEE Trans Comput Soc Syst 6(4):749–760
Yang L, Cheung N-M, Li J, Fang J (2019) Deep clustering by gaussian mixture variational autoencoders with graph embedding. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 6440–6449
Zhang J, Li C-G, You C, Qi X, Zhang H, Guo J, Lin Z (2019) Self-supervised convolutional subspace clustering network. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5473–5482
Affeldt S, Labiod L, Nadif M (2020) Spectral clustering via ensemble deep autoencoder learning (SC-EDAE). Pattern Recogn 108:107522
Chen T, Kornblith S, Norouzi M, Hinton G (2020) A simple framework for contrastive learning of visual representations. In: International conference on machine learning. PMLR, pp 1597–1607
He K, Fan H, Wu Y, Xie S, irshick R (2020) Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9729–9738
Tan Z, Zhang Y, Yang J, Yuan Y (2023) Contrastive learning is spectral clustering on similarity graph. CoRR arXiv:2303.15103
Oord Avd, Li Y, Vinyals O (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748
Li Y, Hu P, Liu Z, Peng D, Zhou JT, Peng X (2021) Contrastive clustering. In: Proceedings of the AAAI conference on artificial intelligence, vol 35, pp 8547–8555
Franti P, Virmajoki O, Hautamaki V (2006) Fast agglomerative clustering using a k-nearest neighbor graph. IEEE Trans Pattern Anal Mach Intell 28(11):1875–1881
Vincent P, Larochelle H, Lajoie I, Bengio Y, Manzagol P-A, Bottou L (2010) Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J Mach Learn Res 11(12)
Bengio Y, Lamblin P, Popovici D, Larochelle H (2006) Greedy layer-wise training of deep networks. In: Advances in neural information processing systems, vol 19
Radford A, Metz L, Chintala S (2016) Unsupervised representation learning with deep convolutional generative adversarial networks. In: Bengio Y, LeCun Y (eds) 4th International conference on learning representations, ICLR 2016, San Juan, Puerto Rico, May 2–4, 2016, conference track proceedings. arXiv:1511.06434
Zeiler MD, Krishnan D, Taylor GW, Fergus R (2010) Deconvolutional networks. In: 2010 IEEE Computer society conference on computer vision and pattern recognition. IEEE, pp 2528–2535
Yang J, Parikh D, Batra D (2016) Joint unsupervised learning of deep representations and image clusters. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5147–5156
Xie J, Girshick RB, Farhadi A (2016) Unsupervised deep embedding for clustering analysis. In: Balcan M, Weinberger KQ (eds) Proceedings of the 33nd international conference on machine learning, ICML 2016, New York City, NY, USA, June 19–24. JMLR Workshop and conference proceedings, vol 48, pp 478–487
Chang J, Wang L, Meng G, Xiang S, Pan C (2017) Deep adaptive image clustering. In: Proceedings of the IEEE international conference on computer vision, pp 5879–5887
Haeusser P, Plapp J, Golkov V, Aljalbout E, Cremers D (2019) Associative deep clustering: training a classification network with no labels. In: Pattern recognition: 40th German conference, GCPR 2018, Stuttgart, Germany, October 9–12, 2018, proceedings, vol 40. Springer, Berlin, pp 18–32
Chang J, Guo Y, Wang L, Meng G, Xiang S, Pan C (2019) Deep discriminative clustering analysis. arXiv preprint arXiv:1905.01681
Wu J, Long K, Wang F, Qian C, Li C, Lin Z, Zha H (2019) Deep comprehensive correlation mining for image clustering. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 8150–8159
Huang J, Gong S, Zhu X (2020) Deep semantic clustering by partition confidence maximisation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8849–8858
Tao Y, Takagi K, Nakata K (2021) Clustering-friendly representation learning via instance discrimination and feature decorrelation. In: 9th International conference on learning representations, ICLR 2021, Virtual Event, Austria, May 3–7 (2021)
Tsai TW, Li C, Zhu J (2021) Mice: mixture of contrastive experts for unsupervised image clustering. In: International conference on learning representations
Van Gansbeke W, Vandenhende S, Georgoulis S, Proesmans M, Van Gool L (2020) Scan: learning to classify images without labels. In: Computer vision—ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, proceedings, Part X. Springer, Berlin, pp 268–285
Johnson J, Douze M, Jégou H (2019) Billion-scale similarity search with GPUs. IEEE Trans Big Data 7(3):535–547
Cubuk ED, Zoph B, Shlens J, Le QV R (2020) Practical automated data augmentation with a reduced search space. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp 702–703
Funding
This work is supported by the National Natural Science Foundation of China under Grants 62006131 and 62071260, the National Natural Science Foundation of Zhejiang Province under Grants LQ21F020009 and LQ22F020020.
Author information
Authors and Affiliations
Contributions
Conceptualization, N.W., X.Y., J.Z. and Q.W.; methodology, N.W., X.Y. and J.Z.; writing—original draft preparation, N.W.; writing—review and editing, N.W. and X.Y.; supervision, X.Y. and J.Z. All authors have read and agreed to the published version of the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no Conflict of interest.
Ethical approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Consent to participate
Informed consent to participate was obtained from all individual participants included in the study.
Consent for publication
Informed consent for publication was obtained from all individual participants included in the study.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Wang, N., Ye, X., Zhao, J. et al. Semantic Spectral Clustering with Contrastive Learning and Neighbor Mining. Neural Process Lett 56, 141 (2024). https://doi.org/10.1007/s11063-024-11597-x
Accepted:
Published:
DOI: https://doi.org/10.1007/s11063-024-11597-x