1 Introduction

As one of the most essential unsupervised learning techniques, spectral clustering (SC) finds extensive applications in the fields of machine learning and pattern recognition [1,2,3]. Although conventional shallow spectral clustering methods [4,5,6] have an elegant theory foundation and achieve promising clustering accuracy, they have problems concerning scalability and generalization. To overcome this issue, two-stage and end-to-end one-stage deep spectral clustering methods have been proposed. In the two-stage framework [7,8,9,10], deep SC methods first embed the spectral embedding with a deep framework and then achieve the cluster indicator within a conventional clustering algorithm. Specifially, Yang and Li et al. utilize dual autoencoder network or fully convolutional auto-encoders in the initial stage for feature learning, and then simply cluster them or cluster centers by soft k-means scores [9, 10]. Duan et al. generate the deep embedding via learning a deep autoencoder, then estimate the cluster number by a softmax autoencoder, and combine metric learning to build more powerful similarity graph [7, 8]. In the end-to-end framework, deep SC methods couple the cluster indicator learning with the spectral embedding learning, and then improves each other iteratively in a unified deep algorithm [11,12,13,14,15]. For example, Ji and Hu et al. propose to maximize the mutual information between an instance and its data augmentations [11, 12]; and our previously proposed methods integrate the cluster indicator learning with generative adversarial feature learning via a Bayesian framework [13, 14]. By exploiting these deep frameworks, spectral clustering can attain scalability and gain access to out-of-sample unseen data points, nevertheless, despite these advantages, certain limitations still exist: (1) due to the overemphasized importance of low-level features, autoencoder-based methods still lack semantic discriminative feature space; (2) in end-to-end framework, the network training is very sensitive to the initialization, which leads to biased clustering results. (3) previous approaches only leverage the instance level similarity without considering the information of semantic differences cluster-wise which help to enhance the disparity among classes.

To handle the above shortcomings, we integrate contrastive learning and spectral clustering into one deep learning framework. To begin with, we pre-train the unsupervised representation learning model to extract semantic feature embedding. Then, we search the nearest neighbor instances both globally and locally according to the similarity between the previous semantically meaningful features. Specifically, we fuse the data augmentation and neighbor searching together in our loss. What’s more, besides the intra-class instances pulling, our work also adopts the inter-class pushing strategy. We apply our Instance and Cluster level Nearest Neighbor Comparing (ICNNC) loss to optimize spectral clustering results while enforcing orthogonality. The main contributions of our work are listed as follows:

  • We propose a deep spectral clustering framework by fusing contrastive learning pattern and neighbor relation mining in spectral clustering. Revealing the core goal of contrastive loss and neighbor utilizing as well as spectral clustering loss are consistent.

  • Our proposed model not only considers intra-class cohesion (instance-level) but also inter-class separation (cluster-level), dragging each instance closer as well as pushing each cluster away.

  • We provide a novel contrastive strategy built-in spectral clustering pattern.

  • Augmentation information as well as neighbors collaboratively contribute in our contrastive strategy.

2 Related Works

Spectral Clustering (SC) [16, 17] has a solid theoretical foundation. Constructing a good similarity graph is crucial for SC, and usually the spectral clustering methods can be categorized into two distinct frameworks: traditional methods and deep methods. Traditional ones often use mathematical approaches [18,19,20,21] such as using simple Euclidean distances to construct similarity graphs or using low-rank constraints [22] on graphs that have limited improvements. Fan et al. [23] focus on how to find reliable affinity matrices through a varied set of affinity-matrix-construction methods. Yet it’s still hard to achieve superior performance than deep methods due to the high cost of constructing affinity matrix and performing eigen decomposition when dealing with large-scale real-world datasets. Some work [24] utilize a matrix completion algorithm that rapidly calculates the similarity matrix to improve computational efficiency. Since spectral clustering does not explicitly compute the mapping function, deep spectral clustering learns a non-linear embedding function using deep neural networks rather than linear eigenvalue decomposition. SpetralNet [25] adopts the siamese network to construct the similarity matrix. Its performance heavily relies on the quality of the similarity graph. To address this issue, Yang et al. [10] use the embedding layer features of an autoencoder as the input to SpectralNet. It can have joint optimization cause they integrate the process of learning autoencoder embedding features and the process of spectral embedding features. Huang et al. [26] extend SpectralNet to multi-view scenarios, where each view is processed by an individual SpectralNet and the total loss functions obtained by weighting and summing the individual SpectralNet loss functions, achieving joint training of multiple views. Yang et al. [27] extend the spectral embedding approach by minimizing the posterior probability distribution between instances rather than minimizing the Euclidean distance in the embedding space, then constrain the hidden variables of the variational autoencoder with the Gaussian mixture model. Zhang et al. [28] use spectral clustering to obtain preliminary clustering results, which are then treated as pseudo-labels and used to train a neural network in a supervised manner. In addition to directly using spectral clustering loss to train neural networks, some work [8, 29] use spectral clustering as a post-processing method. Duan et al. [8] first use an autoencoder to learn the embedding features of the instances, and then directly use spectral clustering to map the learned embedding features on the spectral embedding space. Affeldt et al. [29] use multiple autoencoders to learn embedding features separately, then merge multiple sets of embedding features to calculate the adjacency matrix, and lastly perform spectral clustering on the merged adjacency matrix to obtain clustering results.

Most deep spectral clustering utilize autoencoders [8, 10, 29] to get feature embedding, causing the limited quality of features. Thus, we choose the contrasting frame to extract meaningful semantic embeddings for deep spectral clustering.

While spectral clustering has undergone a diverse range of developments that have yielded impressive experimental results, it is still limited to the instance level. From the perspective of the ultimate goal of clustering (sufficient intra-class cohesion and sufficient inter-class separation), the above spectral clustering methods lack considering cluster level disparity.

Contrastive Learning has emerged as one of the most efficient unsupervised learning paradigms and has witnessed significant progress in representation learning fields. The central principle of contrastive learning is to find an embedding space where can get the minimized similarity between negative pairs as well as the maximized similarity between positive pairs. A breakthrough in this field is SimCLR [30], which leverages instance discrimination as its pretext task. It generates two views of each instance through a diverse range of data augmentations. Then maximizes the similarity between two augmentations of the same instance while minimizing the similarities with views from other instances. Besides, MoCo [31] builds a dynamic, large-scale dictionary of negative examples by maintaining a queue of past examples. This dictionary is used to provide negative pairs for contrastive loss optimization. It uses a moving-averaged encoder to ensure dictionary consistency.

Rarely has research explored the integration of contrastive paradigms into spectral clustering. Due to the strong semantic feature learning ability of contrastive learning frameworks, which enable the learned features to have high discriminability, we adopt a contrastive learning framework as our pre-training model. Getting a semantic feature embedding in deep spectral clustering is as important as constructing a good similarity graph in conventional spectral clustering.

Moreover, Tan et al. [32] provide a theoretical insight that the original InfoNCE loss [33] in contrastive learning is equivalent to spectral clustering on the similarity graph, they use Markov random field (MRF) method to convert graph into a distribution of subgraphs, instead of directly comparing two graphs, they use cross entropy to compare the adjacency matrix of subgraphs. By doing so, the cross entropy loss acts as a bridge, connecting the InfoNCE loss and spectral clustering. Therefore, we adopt InfoNCE loss at the clustering phase to imitate the process of spectral clustering.

Fig. 1
figure 1

a Simple illustration of the occurrence of similar instances across classes. b In real-world images, strong instance similarity can be observed within and across classes. As demonstrated, a slim deer can resemble a Whippet, and a regular deer with its head down eating grass can be similar to a caracal, which belongs to the cat family. Solely relying on instance level pulling without cluster level pushing can exacerbate the challenge of separating different classes

Fig. 2
figure 2

Illustration of the framework of our proposed Semantic Spectral Clustering with Contrastive Learning and Neighbor Mining (SSCN) method. The input data and its augmentation data are fed into the unsupervised learning model’s backbone to extract semantic features. Based on the semantic embeddings, search neighbors through datasets locally and globally. ICNNC loss is applied from both instance-wise perspectives and cluster-wise perspectives

3 Our Proposed Method

A common scenario is that strong instance similarity can appear not only within a class but also across classes. Traditional spectral clustering solely relies on instance similarity, which may pull in two false similar pairs. For example, two instances (a white dog and a white slim deer) located at the edge of their corresponding classes, look alike and have high values of similarity, shown in Fig. 1. Concentrating solely on the instance level would blur the boundaries between the two classes and ultimately resulting in clustering degeneration. Therefore, it’s essential to incorporate cluster level pushing to make the boundaries of classes clear.

We introduce our proposed method in the following sections. Initially, we train the unsupervised feature embedding learning model with a contrastive loss. Subsequently, utilizing the trained model, we retrieve nearest neighbors globally and locally based on the similarity between the extracted semantically meaningful features. Afterwards, we employ Instance and Cluster-level Nearest Neighbor Comparing (ICNNC) loss to optimize spectral clustering results while enforcing orthogonality. The framework is briefly outlined in Fig. 2, and the key notations and descriptions are summarized in Table 1.

To sum up, we employ contrastive learning as our pre-trained model for feature extraction due to its strong ability in generating semantically meaningful embeddings. Additionally, we employ the InfoNCE loss function during the clustering phase based on the theoretical equivalence between contrastive learning and spectral clustering [32]. The incorporation of neighbors is aimed at enhancing intra-class cohesion. The integration of augmentation and neighbor information is driven by the idea that augmentation strengthens a feature’s representation by introducing additional, self-transformed information, while neighbors contribute to the aggregation of local context information.

Table 1 Main notations and descriptions

Spectralnet [25] approximates spectral clustering by training a mapping function that embeds input data into the eigenspace of the corresponding graph Laplacian matrix and then clusters them. In our method, we enhance the embedding phase as the first step, and the clustering network includes constrained optimization where the output orthogonality is achieved by enforcing a linear layer. The weights of this layer are determined through the QR decomposition.

3.1 Unsupervised Feature Learning Model

Contrastive learning has emerged as an outstanding framework for unsupervised learning, exhibiting remarkable performance in representation learning. To overcome the limitations of end-to-end clustering methods, which are often sensitive to network initialization, we choose to train an unsupervised feature embedding learning model first and then cluster over those semantic meaningful features. A contrastive learning framework (i.e. SimCLR) is adopted to fully pre-train the network. Specifically, given a data instance \(x_i\), a random data augmentation \(T^{\prime }\), we can get the positive pair \(x_i\) and \(x_i^{\prime }=T^{\prime }\left( x_i\right) \), and therefore the respective features \(h_i\) and \(h_i^{\prime }\), for SimCLR framework, the loss function is as follows:

$$\begin{aligned} \mathcal {L}_{ {{\textit{simclr}} }}=-\log \frac{\exp \left( {{\textit{sim}}}\left( h_i, h_i^{\prime }\right) / \tau \right) }{\sum _{j=1}^{2 B} \textbf{1}_{j \ne i} \exp \left( {{\textit{sim}}}\left( h_i, h_j^{\prime }\right) / \tau \right) } \end{aligned}$$
(1)

B denotes the batch size, \({\textit{sim}}()\) is the similarity function (i.e. cosine similarity), \(\tau \) denotes temperature parameter. \(\textbf{1}_{j \ne i}\) is an indicator function that equals 1 when \({j \ne i}\). In this loss, we have one positive pair and treat the remaining \((2B-1)\) augmented sample pairs within a batch as negative pairs. Following the completion of training for the pre-trained model, we proceed to extract features directly from the backbone and mine the top-K nearest neighbors based on these extracted features.

In addition, the contrastive loss strategy is not only applied to the pre-trained unsupervised feature extraction model, but also extended to the subsequent stage of cluster-level neighbor pushing.

3.2 Clustering While Matching Nearest Neighbors

After searching the nearest neighbors, from another perspective, those meaningful semantic features can be regarded as supervised information and serve as positive instances for the original instances. In general, we aim to leverage the nearest neighbor information to its fullest extent.

Due to an insight that the feature matrix’s rows can function as instances’ soft labels [34], as well as the columns can be considered as cluster representations, we jointly optimize the instance and cluster level losses. Thus we can have two goals (pull in intra-class instances and push apart different classes) both achieved in one framework.

3.2.1 Instance-Level

We utilize the weights of trained feature extraction model \(\varPhi _{{\textit{pre}}}\), which is exactly the backbone of the contrastive model. Then in clustering network \(\varPhi _{{\textit{clu}}}\), we feed those instance pairs into the network. By applying the Instance and Cluster level Nearest Neighbor Comparing (ICNNC) loss and orthogonal constraint, we construct the spectral clustering network.

Considering in InfoNCE [33] loss, there is only one definite positive pair, and the remaining samples in a batch are all treated as negative samples. However, this setup introduces a potential issue when instances of the same category appear in the same batch but are actually used as negative samples, which causes the existence of false negative pairs, resulting in performance degeneration. So a straightforward and effective way to pull instances in the same class closer is to simply maximize the similarity between instances and their neighbors:

$$\begin{aligned} \mathcal {L}_{{\textit{instan}}}=-\log \langle p, \mathcal {N}_{{\textit{pre}} }(p)\rangle , \end{aligned}$$
(2)

The clustering assignments \(p \in \mathbb {R}^{B \times C}\), note that C is the number of classes and B is the batch size, p and \(\mathcal {N}_{{\textit{pre}} }(p)\) are the clustering assignments of the instance’s feature h and its neighbors’s feature \(\mathcal {N}_{{\textit{pre}} }(h)\). All the neighbors are searched on the basis of the feature representation, and subsequently feed the features into the the clustering network. \(\mathcal {N}_{{\textit{pre}} }\) denotes the neighbors are searched from pre-trained unsupervised representation model \(\varPhi _{{\textit{pre}}}\). The \(\langle \cdot , \cdot \rangle \) operator denotes a dot product that is utilized for evaluating the similarity between two items.

Local and Global Nearest Instance Pulling Assume that \(h \in \mathbb {R}^{B \times D}\) is the embedding features (prior to clustering network) of p, and D is the feature’s dimension. Locally, we select the nearest neighbors \(\mathcal {N}_{{\textit{local}}}(h)\) per batch from feature h. By tracking the batch index of \(\mathcal {N}_{{\textit{local}}}(h)\), we can get the corresponding clustering assignments \(\mathcal {N}_{{\textit{local}}}(p)\). Therefore, the loss in Eq. (2) can be expressed as follows. In addition, it is worth noting that the \(\mathcal {N}_{{\textit{eighbors}}}(p)\) can be substituted with \(\mathcal {N}_{{\textit{local}}}(p)\) batch-wise or \(\mathcal {N}_{{\textit{global}}}(p)\) epoch-wise, which will mention later:

$$\begin{aligned} \mathcal {L}_{{\textit{local}}}=-\log \langle p, \mathcal {N}_{{\textit{eighbors}}}(p)\rangle . \end{aligned}$$
(3)

We leverage the aforementioned neighbor \(\mathcal {N}_{{\textit{pre}} }(p)\), which is searched from the pre-trained contrastive model, and concatenate them together along the batch dimension. Furthermore, we take the augmentation version \({\textit{Aug}}(p)\) into consideration.

Note that our augmentation data is obtained by applying four of the following operations, and it’s randomly selected from horizontal shear, vertical shear, horizontal translation, vertical translation, rotation, auto-contrast, color inversion, solarization, posterization, brightness adjustment, sharpness adjustment, and histogram equalization. Then we can extend our local instance loss as follows:

$$\begin{aligned} \mathcal {L}_{{\textit{instan}}_{{\textit{local}}}}&=-\log \langle I_{{\textit{orign}}},I_{{\textit{neigh}}_{{\textit{local}}\_{\textit{pre}}}}\rangle \nonumber \\ I_{{\textit{orign}}}&=\left[ \begin{array}{c} p\\ {\textit{Aug}}(p) \end{array}\right] , I_{{\textit{neigh}}_{{\textit{local}}\_{\textit{pre}}}}=\left[ \begin{array}{c} \mathcal {N}_{ {{\textit{local}} }}(p) \\ \mathcal {N}_{{{\textit{pre}} }}(p) \end{array}\right] \end{aligned}$$
(4)

Here, \(\left[ \begin{array}{c} p\\ {\textit{Aug}}(p) \end{array}\right] \in \mathbb {R}^{2B \times C}\). Similarly, we can get the nearest neighbor globally to pull in intra-class distances. At the end of each epoch, we can extract the most recent global dataset features \(f \in \mathbb {R}^{N \times D}\) and its clustering assignments \(u \in \mathbb {R}^{N \times C}\), N denotes the whole size of train data. Globally, we search the nearest neighbors from f and \(\mathcal {N}_{{\textit{global}}}(u)\) indicate its predictions of every epoch. Then we have the predictions \(\mathcal {N}_{{\textit{global}}}(u)_{b } \in \mathbb {R}^{B \times C}\) by tracking the batch index. When the searched \(\mathcal {N}_{{\textit{global}}}(u)\) is in a specific batch, the global instance loss can be extended as follows:

$$\begin{aligned} \mathcal {L}_{{\textit{instan}}_{{\textit{global}}}}&=-\log \langle I_{{\textit{neigh}}_{{\textit{local}}\_{\textit{four}}}},I_{{\textit{neigh}}_{{\textit{global}}}}\rangle ,\nonumber \\ I_{{\textit{neigh}}_{{\textit{local}}\_{\textit{four}}}}&=\left[ \begin{array}{c} p\\ {\textit{Aug}}(p)\\ \mathcal {N}_{ {{\textit{local}} }}(p)\\ \mathcal {N}_{pre}(p) \end{array}\right] , \end{aligned}$$
(5)

where \(I_{{\textit{neigh}}_{{\textit{global}}}} \in \mathbb {R}^{4B \times C}\) is four \(\mathcal {N}_{ {{\textit{global}}}}(u)_{b}\) concatenate together in order to keep in consistent with \(I_{{\textit{neigh}}_{{\textit{local}}\_{\textit{four}}}}\). Hence, we integrate the semantic nearest neighbors partially and globally as well as the augmentation information.

3.2.2 Cluster-Level

When the features are projected to a space with dimensionality equal to the number of clusters, the i-th column of the feature matrix represents its likelihood of belonging to the i-th cluster. From a clustering perspective, we intuitively try to push each cluster (each column) away from each other. Our cluster-wise contrastive loss is as below:

$$\begin{aligned} \mathcal {L}_{ {{\textit{class}}}}=-\log \frac{\exp \left( {{\textit{sim}}}\left( q_i, \mathcal {N}_{{{\textit{pre}} }}(q)_i\right) / \tau \right) }{\sum _{j=1}^{C} \exp \textbf{1}_{j \ne i} \left( {{\textit{sim}}}\left( q_i, \mathcal {N}_{{{\textit{pre}} }}(q)_j\right) / \tau \right) } \end{aligned}$$
(6)

\(q \in \mathbb {R}^{C \times B}\) is the transpose of p, and \(\mathcal {N}_{{{\textit{pre}} }}(q) \in \mathbb {R}^{C \times B}\), specifically, \({q}_{i}\) and \(\mathcal {N}_{{{\textit{pre}} }}(q)_{i} \in \mathbb {R}^{B}\). In contrast to the original form of the contrastive loss, we replace the role of the initial positives with the nearest neighbors, thereby incorporating more relevant and meaningful instances. By leveraging both local and global neighbors, we effectively promote the separation of different classes.

Fig. 3
figure 3

Illustration of local cluster-wise loss

Local and Global Cluster Pushing Analogically, we have the \(\mathcal {N}_{{{\textit{pre}} }}(q) \in \mathbb {R}^{C \times B}\) from the pretrained contrastive model and \(\mathcal {N}_{{{\textit{local}} }}(q) \in \mathbb {R}^{C \times B}\) selected per batch, where each q row is the original p column. Therefore, the local level class contrastive loss (see in Fig. 3):

$$\begin{aligned}&\mathcal {L}_{{\textit{class}}_{{\textit{local}}}}= -\log \frac{\exp ({{\textit{sim}}}([C_{{\textit{orign}}}]_i,\left[ C_{{\textit{neigh}}_{{\textit{local}}\_{\textit{pre}}}}\right] _i) / \tau )}{\sum \limits _{j=1}^C \exp \textbf{1}_{j \ne i} ({{\textit{sim}}}([C_{{\textit{orign}}}]_i,\left[ C_{{\textit{neigh}}_{{\textit{global}}}}\right] _j) / \tau )},\nonumber \\&{\left\{ \begin{array}{ll} \text {positive:} ([C_{{{\textit{orign}}}}]_i,[C_{{\textit{neigh}}_{{\textit{local}}\_{\textit{pre}}}}]_j), i=j \\ \text {negative:} ([C_{{{\textit{orign}}}}]_i,[C_{{\textit{neigh}}_{{\textit{local}}\_{\textit{pre}}}}]_j), i \ne j \end{array}\right. }\nonumber \\&C_{{\textit{orign}}}=\left[ \begin{array}{c} q \\ {\textit{Aug}}(q) \end{array}\right] , C_{{\textit{neigh}}_{{\textit{local}}\_{\textit{pre}}}}=\left[ \begin{array}{c} \mathcal {N}_{{{\textit{local}} }}(q) \\ \mathcal {N}_{{{\textit{pre}} }}(q) \end{array}\right] \end{aligned}$$
(7)
Fig. 4
figure 4

Illustration of global cluster-wise loss

When the global level nearest neighbors are searched, located by the batch index, we can obtain \(\mathcal {N}_{ {{\textit{global}}}}(u)_{b}\in \mathbb {R}^{B \times C}\), where \(u \in \mathbb {R}^{N \times C}\) and we can have \(\mathcal {N}_{ {{\textit{global}}}}(v)_{b} \in \mathbb {R}^{C \times B}\), where \(v \in \mathbb {R}^{C \times N}\), the transpose of u. Therefore, the global level class contrastive loss (see in Fig. 4):

$$\begin{aligned}&\mathcal {L}_{{\textit{class}}_{{\textit{global}}}}= -\log \frac{\exp ( {{\textit{sim}}}(\left[ C_{{\textit{neigh}}_{{\textit{local}}\_{\textit{four}}}}\right] _i,\left[ C_{{\textit{neigh}}_{{\textit{global}}}}\right] _i) / \tau )}{\sum \limits _{j=1}^C \exp \textbf{1}_{j \ne i} ({{\textit{sim}}}(\left[ C_{{\textit{neigh}}_{{\textit{local}}\_{\textit{four}}}}\right] _i,\left[ C_{{\textit{neigh}}_{{\textit{global}}}}\right] _j) / \tau )},\nonumber \\&{\left\{ \begin{array}{ll} \text {positive:} ([C_{{\textit{neigh}}_{{\textit{local}}\_{\textit{four}}}}]_i,[C_{{\textit{neigh}}_{{\textit{global}}}}]_j), i=j \\ \text {negative:} ([C_{{\textit{neigh}}_{{\textit{local}}\_{\textit{four}}}}]_i,[C_{{\textit{neigh}}_{{\textit{global}}}}]_j), i \ne j \end{array}\right. }\nonumber \\&C_{{\textit{neigh}}_{{\textit{local}}\_{\textit{four}}}}=\left[ \begin{array}{c} q \\ {\textit{Aug}}(q)\\ \mathcal {N}_{{{\textit{local}}}}(q) \\ \mathcal {N}_{{{\textit{pre}}}}(q) \end{array}\right] \end{aligned}$$
(8)

where \(C_{{\textit{neigh}}_{{\textit{global}}}} \in \mathbb {R}^{C \times 4B}\) is four \(\mathcal {N}_{ {{\textit{global}}}}(u)_{b}\) concatenate together in order to keep in consistent with \(C_{{\textit{neigh}}_{{\textit{local}}\_{\textit{four}}}}\). Furthermore, by doing so, we can avoid the false negative pairs problem and performance degeneration, since there will not have two same clusters in one negative pair. Each column represents a different cluster, and pushing apart those different classes is exactly our target.

Besides, we utilize an entropy term to avoid assigning all samples into a single cluster. This term promotes a more uniform distribution of predictions across the clusters \(\mathcal {C}\), and \(M(p) \in \mathbb {R}^{1 \times C}\) represents the mean of p over the batch dimension.

$$\begin{aligned} \mathcal {L}_{{\textit{entropy}}}=-M(p)\log {M}(p). \end{aligned}$$
(9)

Therefore, we fully use the semantic nearest neighbors locally and globally at both instance level and cluster level. Consequently, we seek to minimize the overall loss as follows:

$$\begin{aligned} \mathcal {L}_{{\textit{total}} }&=\mathcal {L}_{{\textit{instan}}_{{\textit{local}}}}+\mathcal {L}_{{\textit{instan}}_{{\textit{global}}}} \nonumber \\&\quad +\mathcal {L}_{{\textit{class}}_{{\textit{local}}}}+\mathcal {L}_{{\textit{class}}_{{\textit{global}}}}+\lambda \mathcal {L}_{{\textit{entropy}}}. \end{aligned}$$
(10)

\(\lambda \) denotes the hyper-parameter. The collaborative influence of \(\mathcal {N}_{{{\textit{pre}}}}\), \(\mathcal {N}_{{{\textit{local}}}}\), \(\mathcal {N}_{{{\textit{global}} }}\) and its augmentation information through our work can allow us to jointly take advantage of the two-stage methods (sperate feature learning and clustering) and end-to-end methods. Although end-to-end methods can obtain clustering-oriented features, the quality of the extracted features can not be well ensured and could be limited by the network structure. In our work, first we extracted meaningful semantic features which avoid the above shortcoming and then during the clustering iteration, \(\mathcal {N}_{{{\textit{local}}}}\) and \(\mathcal {N}_{{{\textit{global}}}}\) can be replaced with the most proper clustering oriented semantic features. The training process of SSCN is summarized in Algorithm 1.

Algorithm 1
figure a

Contrastive Deep Spectral Clustering.

Table 2 Evaluation on three image benchmarks

3.3 Deep Spectral Clustering

Spectral clustering consists of three parts: First, Construct similarity graph W and the Laplacian matrix \(L=D-W\), D denotes the degree matrix, then compute the first k eigenvectors of the Laplacian matrix, and finally apply k-means on the matrix, which is composed of the k eigenvectors. And spectral clustering’s loss function can be formulated as:

$$\begin{aligned} \mathcal {L}_{{c }}(\theta )=\frac{1}{m^2} \sum _{i, j=1}^m W_{i, j}\left\| y_i-y_j\right\| ^2, \end{aligned}$$
(11)

\(y_{i}\) is the output of the spectral clustering network \(F_\theta \), and m is the size of minibatch. \(W_{i, j}=w\left( x_i, x_j\right) \), it captures the similarity between \(x_{i}\) and \(x_{j}\). As for w, our goal is to have similar points \(x, x^{\prime }\) (i.e., those with large value of \(w\left( x, x^{\prime }\right) \)) mapped into an embedding space where they are close to each other.

Analogously, we would like instance \(x_{i}\) and its neighbors \(\mathcal {N}_{{{\textit{pre}}}}\), \(\mathcal {N}_{{{\textit{local}}}}\), \(\mathcal {N}_{{{\textit{global}}}}\) as well as its augmentation information to be embedded close to each other. Derived from the same intention, we have the loss function on instance level mentioned previously Eqs. (2), (3), (4) and (5), which not only achieved the same goal but also considered more thoroughly.

Furthermore, spectral clustering imposes an orthogonal constraint to prevent trivial solutions where all data is mapped to the same output vector:

$$\begin{aligned} \frac{1}{m} Y^T Y=I_{k \times k} \end{aligned}$$
(12)

Y is a \(m \times k\) output matrix where i-th row is represented by \(y_{i}^T\). In order to enforce orthogonality, we utilize the final layer of the network to achieve it. The final layer receives input from k units, and k outputs are produced through the layer which functions as a linear layer, and the weights are set to produce the orthogonalized results Y per batch. The \(m \times k\) matrix \(\tilde{Y}\) represents the inputs to this layer, we have a linear map using QR decomposition to achieve column-wise orthogonality of \(\tilde{Y}\)’s columns. To be specific, the Cholesky decomposition can be employed to obtain the QR decomposition of any matrix A that satisfies \(A^T A\) being full rank:

$$\begin{aligned} A^T A=C C^T, \end{aligned}$$
(13)

wherein C is a lower triangular matrix, then Q is obtained by setting \(Q=A\left( C^{-1}\right) ^T\). Therefore, The last layer performs a right multiplication of matrix \(\tilde{Y}\) by \(\sqrt{m}(\tilde{L}^{-1})^T\) to orthogonalize it. The \(\tilde{L}\) is derived from the Cholesky decomposition of \(\tilde{Y}\) and the \(\sqrt{m}\) term is introduced to fulfill Eq. (12). Every orthogonalization step involves adjusting the weights of the final layer using the QR decomposition. After completing the training of the neural network, all weights are frozen, including the last layer’s weights that work solely as a linear layer. This layer also helps to cultivate more distinguishable clustering assignments.

What’s more, generally in contrastive learning, in order to capture the substantially distinguishing features, a vast number of negative pairs are fed into the network, and data augmentation is needed in constructing positive pairs. In our work, we collaboratively use data augmentation and nearest neighbor mining [(Eqs. (4), 5, (7), (8)], which helps reduce the cost of using large number of negative pairs. And from the perspective of optimization objectives, spectral clustering and contrastive learning can be integrated into a unified optimization objective.

4 Experiments

4.1 Datasets

During the experiments section, we evaluate our method on three image datasets that are widely utilized:

4.1.1 CIFAR-10

An image dataset for training and testing encompasses 50,000/10,000 RGB images, each of which measures \(32\times 32\) pixels in size. The dataset is divided into 10 classes, with each class containing 6000 images. Each class has 5000 training samples and 1000 testing samples.

4.1.2 CIFAR-100

Another image dataset extends CIFAR-10 by including 100 classes. It has 50,000/10,000 RGB images with size \(32\times 32\) for training and testing. Each class has 500 training samples and 100 testing samples, and there is no overlap between the classes. The 20 superclasses on CIFAR-100 are regarded as ground truth.

4.1.3 STL-10

An ImageNet sourced dataset contains 13,000 samples with size \(96\times 96\) from 10 classes. Each class contains 500 training images and 800 test images. Unlike CIFAR-10 and CIFAR-100, the images are not cropped or rescaled. The 10 classes are airplane, bird, car, cat, deer, dog, horse, monkey, ship, and truck.

4.2 Evaluation Metrics

We adopt three widely-used clustering performance metrics in our experiments, including Accuracy (ACC), Normalized Mutual Information (NMI), and Adjusted Rand Index (ARI). The dominant class label determines the assigned predicted label. Values for these metrics range from 0 to 1, with better performance indicated by higher values.

4.3 Experimental Setup

We implement our work on PyTorch 1.4.0 and apply Adam optimizer with a learning rate of \(10^{-4}\), and decay for \(10^{-4}\). For these small datasets CIFAR-10/100, and STL-10, we choose the SimCLR for the pre-trained model, the network’s backbone is a standard ResNet18, and the 20 nearest neighbors are searched based on the contrastive feature learning framework. Faiss library [50] is used to mine the neighbors. In the clustering network, we adopt 100 epochs over all models and 100 batch size on STL10, 200 batch size on CIFAR-10, and 250 batch size on CIFAR-100. To accelerate the training process, we set K to 1 for both local and global K nearest neighbor searching. And the weight \(\lambda \) in entropy loss is set to 5. Our strong data augmentation forms are achieved by applying four randomly selected RandAugment [51] transformations.

4.4 Clustering Performance Comparison

We present a comparison of the clustering results of the following baselines and our SSCN methods, including conventional methods (K-Means [4], SC [16], AC [35], NMF [5]) and deep networks methods (AE [37], DAE [36], DCGAN [38], DeCNN [39], JULE [40], DEC [41], DAC [42], ADC [43], DDC [44], DCCM [45], IIC [11], PICA [46]) and recently pre-trained feature-based methods (SCAN [49]), and IDFD [47] which combines instance discrimination and feature decorrelation, as well as MiCE [48] which mixed separate contrastive experts in a probabilistic way. According to the results demonstrated in Table 2, benefiting from the ICNNC loss, specifically in three datasets (CIFAR-10, CIFAR-100 and STL-10), our approach shows better results than others on three evaluation metrics. In particular, SSCN surpasses MiCE by 1.0% on CIFAR-10, and 4.8% on CIDAR-100 in terms of ACC and achieves improvements of 1.4% on CIFAR-10, and 4% on CIFAR-100 in terms of NMI. On STL-10 datasets, our method can surpass the supervised results.

One significant factor contributing to our method’s superiority over supervised approaches on the STL-10 dataset is the pre-trained self-supervised semantic embedding space, we can see that if we remove the pre-train stage, the clustering performance on STL-10 dataset has dropped 25%, resulting in only 56.1% on ACC metric. And if we use the pretrain model and followed by simply K-Means clustering, it can get a 9.7% leverage on ACC. What’s more, SCAN [49] can also get a close performance on STL-10 dataset, we based on SCAN, incorporates instance-level cohesion and cluster-level repulsion, considering neighbors mined batch-wise and epoch-wise, constraining by orthogonal layer, can also exceeded supervised method. To be clear, the last row supervised model in Table 2 is not built upon pre-trained weights.

4.5 Ablation Study

Ablation studies are conducted in this section to explore the impact of different choices in our approach.

Table 3 Effectiveness of the proposed loss on STL-10 dataset

4.5.1 Effectiveness of Proposed Loss

We individually evaluate the impact of each proposed loss function on dataset STL-10 and show the results in Table 3, we can find that instance level \(\mathcal {L}_{{\textit{instan}}}\) and class level \(\mathcal {L}_{{\textit{class}}}\) are both essential, while instance level \(\mathcal {L}_{{\textit{instan}}}\) has a more significant impact. With \(\mathcal {L}_{{\textit{instan}}}\), we can have a 7.4% improvement on ACC. What’s more, the entropy loss \(\mathcal {L}_{{\textit{entropy}}}\) effectively prevents the network fall into the trivial solution.

Table 4 Effect of number of clustering heads on three datasets

4.5.2 Effectiveness of Clustering Heads

In order to get robust predictions, we choose to implement the multiple clustering heads strategy. The corresponding results are illustrated in Table 4. As the quantity of clustering heads increased from 1 to 5, the performance improved to a certain degree, however, when the number continues to grow, it may either remain steady or degenerate (shown in CIFAR-10 and CIFAR-100 rows). All different clustering heads share the same global features, the excessive heads may produce more unstable results. Empirically, we set the clustering heads’ number as 5.

Fig. 5
figure 5

Confusion matrices of three datasets

Fig. 6
figure 6

Three most confident instances on CIFAR-10 and STL-10

4.6 Qualitative Study

This section comprises several studies to investigate our work directly and visually, including class confusion matrices and top-3 most confident instances.

4.6.1 Confusion Matrices

We present the visualization of confusion matrices on three different datasets in Fig. 5. It is clear that all three confusion matrices exhibit a noticeable block diagonal structure, indicating that our methods effectively cluster different instances into their corresponding semantic classes. The common mis-grouping categories in CIFAR-10 (Fig. 5a) and STL-10 (Fig. 5c) are ‘cat’ and ‘dog’. ‘household furniture’ and ‘household electrical devices’ in CIFAR-100 (Fig. 5b) are two categories which may be mis-clustered. Our work tried to tackle the challenges by introducing cluster level constraints to the framework from a theoretical perspective. And as the ACC/NMI/ARI values increased, the occurrences of mis-clustered cases diminished. Achieving a complete elimination of mis-clustered issue requires 100% clustering accuracy. While some instances of this phenomenon may still be observable in our evaluations, but it’s certainly mitigated according to the ACC/NMI/ARI values. The blurriness of pictures in CIFAR-10/100 may contribute to the unsatisfactory mis-clustering results, as the network mainly focuses on the semantic features and might ignore detailed fine-grained differences.

4.6.2 Confident Images

In order to demonstrate more directly, we show the three most confident instances on CIFAR-10 and STL-10 in Fig. 6. As the results on CIFAR-100 are not well-performing, we do not report its confident instances. Each column represents each class, and the top-3 confident instances are presented in three rows. For CIFAR-10 (Fig. 6a), the 3 most confident instances clearly represent each cluster. However, for STL-10 (Fig. 6b), there is a ‘cat’ instance misclassified as a ‘dog’ in the last row of the ‘dog’ column. And in the ‘deer’ column, the second row instance which belongs to ‘cat’ has a body shape that roughly resembles a deer. This phenomenon further indicates that our work lacks the ability to distinguish fine-grained details between figures.

5 Conclusion

We have proposed a semantic spectral clustering with contrastive learning and neighbor mining (SSCN) framework, which performs instance level pulling and cluster level pushing cooperatively. Different from previous methods, our proposed method fuses contrastive strategy and neighbor mining and spectral pattern in a whole framework. Experiment results on three real datasets show the effectiveness of our proposed method. Additionally, we conduct ablation studies on the number of clustering heads and the effectiveness of different separate loss functions.