1 Introduction

As a fundamental problem in machine learning, clustering aims at grouping data instances into several clusters, where instances from the same cluster share similar semantics and instances from different clusters are dissimilar. Clustering could reveal the inherent semantic structure underlying the data, which benefits the down-stream analysis such as anomaly detection [84], person re-identification [113], community detection [94], and domain adaption [109], etc.

In the early stage, various classic clustering methods are developed, such as centroid-based clustering [62], density-based clustering [19], hierarchical clustering [69], and so on. These shallow methods are grounded in theory and enjoy high interpretability. Later on, some works extend shallow clustering methods to diverse data types, such as multi-view [73, 74, 96, 115] and graph data [71, 85]. Other efforts have been made to improve the scalability [116] of shallow clustering methods.

However, shallow clustering methods partition instances based on the similarity [62] or density [19] of the given raw or linear transformed data. Due to the limited feature extraction ability, shallow clustering methods would achieve sub-optimal results when confronted with complex, high-dimensional, and non-linear data in the real world. To tackle this challenge, deep clustering techniques are proposed to incorporate neural networks into clustering methods. In other words, deep clustering simultaneously learns discriminative representations and performs clustering on the learned features, progressively benefiting each other.

Over the past few years, many efforts have been devoted to improving the clustering performance from various aspects, such as network architectures [8, 72], training strategies [67], and loss functions [39, 122]. However, we would like to highlight that the fundamental challenge of deep clustering is the absence of data annotations. Consequently, the key to deep clustering lies in introducing proper prior knowledge to construct the supervision signals. From the early data structure assumption to the recent data augmentation invariance, the development of deep clustering methods intrinsically corresponds to the evolution of prior knowledge. In this survey, we provide a comprehensive review of deep clustering methods from the perspective of prior knowledge.

Inspired by traditional clustering and dimensionality reduction approaches [4, 83], the early deep clustering methods [32, 77, 89] build upon the structure prior of data. Based on the assumption that the inherent data structure could reflect the semantic relation, these methods incorporate classic manifold [83] or subspace learning [99] objectives to optimize the neural network for feature extraction and clustering. The second type of prior knowledge is the distribution prior, which assumes that instances from different clusters follow distinct distributions. Based on such a prior, several generative deep clustering methods [39, 67] propose to learn the latent distribution of samples for the data partition. In the past few years, the success of contrastive learning spawns a new category of prior knowledge, namely, augmentations invariance. Instead of mining data priors, researchers turn to constructing additional priors with the data augmentation technique. Leveraging the invariance across different augmented samples at the instance representation and clustering assignment levels, numerous contrastive clustering methods [38, 51] significantly improve the feature discriminability and clustering performance. Further, researchers find that instances of the same semantics are likely to be mapped into nearby points in the latent space, and accordingly propose the neighborhood consistency prior. Specifically, by encouraging neighboring samples to have similar cluster assignments, several works [95, 122] alleviate the false-negative problem in the contrastive clustering paradigm, thus advancing the clustering results. Another branch of progress is made based on the pseudo label prior, namely, cluster assignments with high confidence are likely to be correct. By selecting confident predictions as pseudo labels, several studies further boost the clustering performance through pseudo-labeling [52, 79] and semi-supervised learning [75]. Very recently, instead of pursuing internal priors from the data itself, some works [7, 53] attempt to introduce abundant external knowledge such as textual descriptions to guide clustering.

In summary, the essence of deep clustering lies in how to find and leverage effective prior knowledge, for both feature extraction and cluster assignment. To provide an overview of the development of deep clustering, in this paper, we categorize a series of state-of-the-art approaches according to the taxonomy of prior knowledge. We hope such a new perspective for deep clustering could inspire future research in the community. The rest of this paper is organized as follows: First, Section 2 introduces the preliminaries on deep clustering. Section 3 reviews the existing deep clustering methods from the prior knowledge perspective. Then, Section 4 provides experimental analyses of deep clustering methods. After that, Section 5 briefly introduces some applications of deep clustering in the vicinagearth security. Lastly, Section 6 summarizes some notable trends and challenges for deep clustering.

1.1 Related surveys

We notice that several surveys on deep clustering have been proposed in recent years. Briefly, Min et al. [64] categorizes deep clustering methods according to the network architecture. Dong et al. [18] focuses on applications of deep clustering. Ren et al. [82] summarizes existing methods from the view of data types, such as single- and multi-view. Zhou et al. [123] discusses various interactions between representation learning and clustering. Distinct from existing surveys, this work systematically provides a new perspective from the prior knowledge, which plays a more intrinsic and essential role in deep clustering.

2 Problem definition

In this section, we introduce the pipeline of deep clustering, including the notation and problem definition. Unless specially notified, in this paper, we use bold uppercase and lowercase to denote matrices and vectors, respectively. The commonly used notations are summarized in Table 1.

Table 1 Commonly used mathematical notations

The deep clustering problem is formally defined as follows: given a set of instances \(\mathcal {D}=\left\{ \textbf{x}_i\right\} _{i=1}^{N}\in \mathcal {X}\) that belongs to C classes, deep clustering aims to learn discriminative features and group the instances into C clusters according to their semantics. Specifically, deep clustering methods first learn a deep neural network \(f:\mathcal {X}\rightarrow \mathcal {Z}\) for feature extraction, i.e., \(\textbf{z}_i=f(\textbf{x}_i)\). Given instance features in the latent space, clustering results could be obtained in two ways. The most straightforward way is to apply classic algorithms such as K-means [62] and DBSCAN [19] on the learned features. The other solution is to train an additional cluster head \(h:\mathcal {Z}\rightarrow \mathbb {R}^{C}\) to produce soft cluster assignment \(\textbf{p}_i=\text {softmax}(h(\textbf{z}_i))\) which satisfies \(\sum \nolimits _{j=0}^{K}\textbf{p}_{ij}=1\). The hard cluster assignment for the i-th instance could be computed by \(\arg \max\) operation, namely,

$$\begin{aligned} \tilde{y}_i=\arg \max _j\ \textbf{p}_{ij}, 1\le j\le C. \end{aligned}$$
(1)

The cluster assignments provide the inherent semantic structure underlying the data, which could be utilized in various downstream analyses.

3 Priors for deep clustering

In this section, we review existing deep clustering methods from the perspective of prior knowledge. The priors are illustrated in Fig. 1 and the method categorization is summarized in Table 2.

Fig. 1
figure 1

Six categories of prior knowledge for deep clustering. a Structure Prior: data structure could reflect the semantic relation between instances. b Distribution Prior: instances from different clusters follow distinct data distributions. c Augmentation Invariance: samples augmented by the same instance have similar features. d Neighborhood Consistency: neighboring samples have consistent cluster assignments. e Pseudo Label: cluster assignments with high confidence are likely to be correct. f External Knowledge: abundant knowledge favorable to clustering exists in open-world data and models

3.1 Structure prior

Structure prior is mostly inspired by traditional clustering methods. Traditional cluster is mainly rooted in assumptions about the structural characteristics of clusters in data space. For example, K-means [62] aims to learn k cluster centroids, which assumes that instances in each cluster form a spherical structure around its centroid. DBSCAN [19] is based on the assumption that a cluster in data space is a contiguous region of high point density, separated from other such clusters by regions of low point density. Spectral clustering [4] assumes data lies on a locally linear manifold so that the local neighborhoods’ relation should be preserved in latent space. Those methods partition instances according to the graph Laplacian. Agglomerative clustering [24] considers the hierarchical structure of data and performs clustering with merging and splitting. Motivated by the success of classic clustering methods, the early exploration of deep clustering mainly focuses on adapting mature structure priors as objective functions to optimize neural networks.

Given well-structured data in the latent space, ABDC [93] iteratively optimizes the data representation and clustering centers motivated by K-means. As the deep extension of classic spectral clustering, DEN [32], SpectralNet [89], and MvLNet [33, 34] compute the graph Laplacian in the latent space learned by auto-encoder [5] and SiameseNets [27, 88], respectively. Likewise, DCC [87] extends the core idea of RCC [86] by performing a relation matching based on the similarity between latent features. The auto-encoder is then optimized by minimizing the distance of paired instances in the latent space. PARTY [77] is the first deep subspace clustering method, which introduces the sparsity prior and self-representation property in subspace learning to optimize neural networks. Motivated by the hierarchical structure of clusters, JULE [108] achieves agglomerative deep clustering by progressively merging clusters and optimizing the features.

3.2 Distribution prior

Distribution prior refers to instances of different semantics following distinct data distributions. Such a prior arouses the generative deep clustering paradigm, which employs variational autoencoder [42] (VAE) and generative adversarial network [23] (GAN) to learn the underlying distribution. Instances generated from similar distributions are then grouped together to achieve clustering.

VaDE [39] is the first deep generative clustering method, which computes different data distributions by fitting the Gaussian mixture model (GMM) in the latent space. To generate an instance, VaDE first samples a cluster distribution \(p\left( c\right)\) to generate a latent vector \(p\left( z\mid c\right)\), and then reconstructs the instance in the input space \(p\left( x \mid z\right)\). The cluster assignment and neural network are jointly optimized by maximizing the log-likelihood of instance, i.e.,

$$\begin{aligned} \log p({x})=\log \int _{{z}} \sum \limits _c p({x} \mid {z}) p({z} \mid c) p(c) \text d{z}. \end{aligned}$$
(2)

Since directly computing Eq. 2 is intractable, the optimization is approximated by the evidence lower bound (ELBO) of variational inference objective, namely,

$$\begin{aligned} \mathcal {L}=\mathbb {E}_{q({z}, c \mid {x})}\left[ \log \frac{p({x, z, c})}{q({z}, c \mid {x})}\right] , \end{aligned}$$
(3)

where \(q({z}, c\mid {x})\) is variational posterior, which approximates the real posterior. The reparameterization trick introduced in VAE [42] is adopted to make the sampling process differentiable.

Though GMM could effectively distinguish distributions, Gaussian components are proved to be redundant, which harms the discriminability between different clusters [26]. As an improvement, ClusterGAN, DCGAN [67, 80] proposes to adopt GAN to implicitly learn the latent distributions. Specifically, as shown in Fig. 2, in addition to the continuous latent variable \(\textbf{z}_n\), it introduces a one-hot encoding \(\textbf{z}_c\) to capture cluster distribution during the generation. The objective function of ClusterGAN is formulated as follows:

$$\begin{aligned} \mathcal {L}= & {} \underset{\textbf{x} \sim p_X(\textbf{x})}{\mathbb {E}} q(\mathcal {D}(\textbf{x}))+\underset{\textbf{z} \sim \mathbb {P}_\textbf{z}}{\mathbb {E}} q(1-\mathcal {D}(\mathcal {G}(\textbf{z})))\nonumber \\{} & {} \quad +\beta _n \underset{p_{\mathcal {Z}}(\textbf{z})}{\mathbb {E}}\left\| \textbf{z}_n-\mathcal {E}\left( \mathcal {G}\left( \textbf{z}_n\right) \right) \right\| _2^2\nonumber \\{} & {} \quad +\beta _c \underset{p_{\mathcal {Z}}(\textbf{z})}{\mathbb {E}} \mathcal {H}\left( \textbf{z}_c, \mathcal {E}\left( \mathcal {G}\left( \textbf{z}_c\right) \right) \right) , \end{aligned}$$
(4)

where \(\textbf{z}=(\textbf{z}_n,\textbf{z}_c)\) is the mixed latent variable, \(\mathcal {E}\) is the inverse network which maps data from the raw to latent space, \(\mathcal {H}\left( \cdot ,\cdot \right)\) denotes the cross-entropy, and \(\beta _n\), \(\beta _c\) are the weight parameters. The first two terms are consistent with standard GAN. The last two clustering-specific terms encourage a more distinct cluster distribution, as well as map inputs to the latent space to achieve clustering.

Fig. 2
figure 2

The framework of distribution prior based methods. In addition to the standard continuous latent variable \(\textbf{z}_n\), generative deep clustering methods further introduce a discrete variable \(\textbf{z}_c\) to capture the cluster information

3.3 Augmentation invariance

In recent years, image augmentation methods [91] have gained widespread attention, grounded in the prior that augmentations of the same instance could preserve consistent semantic information. This augmentation-invariance character inspires exploration of how to leverage the positive pairs (i.e., different augmentations of the same image) with similar semantic information, as shown in Fig. 3. Notably, mutual-information-based methods and contrastive-learning-based methods have emerged as pioneers in the realm of deep clustering. In this section, we delve into the fundamental concepts and related works of both mutual-information-based and contrastive-learning-based methods.

Fig. 3
figure 3

The framework of augmentation invariance based methods. Diverse transformations are first applied to augment the input data x, after which the shared deep neural network is utilized to extract features. The augmented samples of the same instance are encouraged to have similar features and cluster assignments

Firstly, mutual information is a measure of dependence between two continuous random variables X and Y, formally,

$$\begin{aligned} I(X ; Y)=\int _Y \int _X p(x, y) \log \left( \frac{p(x, y)}{p(x) p(y)}\right) {\textrm d} x {\textrm d} y, \end{aligned}$$
(5)

where p(xy) is the joint probability mass function of X and Y, p(x) and p(y) are the marginal probability mass functions of X and Y respectively. In the context of information theory, leveraging the mutual information between variables of positive instances could enhance the optimization of clustering-related information.

IMSAT [30] stands as a typical information-theoretic approach to deep clustering. Its fundamental concept includes enforcing invariance on pair-wise augmented instances and achieving unambiguous and uniform cluster assignments. Specifically, IMSAT encourages the representations of augmented instances to closely match the representations of the original instances, i.e.,

$$\begin{aligned} \mathcal {L}=-\sum \limits _{i, k} \textbf{p}_{i k} \log \textbf{p}^{\prime }_{i k} \end{aligned}$$
(6)

where \(\textbf{p}^{\prime }\) is the prediction representations of augmented instances. This aspect can be viewed as exploring the maximization of mutual information between data and its augmentations. Besides, IMSAT implements regularized information maximization for deep clustering inspired by RIM [43] to keep the cluster assignments unambiguous and uniform. Specifically, IMSAT seeks to maximize the mutual information between instances and their cluster assignments, expressed as:

$$\begin{aligned} I(X ; Y)= & {} H(Y)-H(Y \mid X)\nonumber \\= & {} -\sum \limits _k \textbf{p}_{\cdot k} \log \textbf{p}_{\cdot k}+\frac{1}{N} \sum \limits _{i, k} \textbf{p}_{i k} \log \textbf{p}_{i k}, \end{aligned}$$
(7)

where \(H(\cdot )\) and \(H(\cdot |\cdot )\) the entropy and conditional entropy, and \(\textbf{p}_{\cdot k}=\frac{1}{N} \sum \nolimits _{i} \textbf{p}_{i k}\). Increasing the first term (marginal entropy H(Y)) encourages uniform cluster assignments, i.e., the number of instances in each cluster tends to be the same. Conversely, decreasing the second term (conditional entropy \(H(Y\mid X)\)) encourages each instance to be unambiguously assigned to a certain cluster.

IIC [38] and Completer [56, 57] take a further step in exploring the mutual information between instances and their augmentations. The fundamental concept is to maximize the mutual information between the cluster assignments of pair-wise augmented instances. Specifically, IIC achieves semantically meaningful clustering and avoids trivial solutions by maximizing the mutual information between the cluster assignments,

$$\begin{aligned} \mathcal {L}= & {} I\left( Z, Z^{\prime }\right) =\sum \limits _i^N I\left( \textbf{z}_i, \textbf{z}_i^{\prime }\right) =I(\textbf{P}),\nonumber \\= & {} \sum \limits _{c=1}^C \sum \limits _{c^{\prime }=1}^C \textbf{P}_{c c^{\prime }} \cdot \ln \frac{\textbf{P}_{c c^{\prime }}}{\textbf{P}_c \cdot \textbf{P}_{c^{\prime }}}, \end{aligned}$$
(8)

where \(\textbf{z}\) and \(\textbf{z}^{\prime }\) are the representations of the original instance x and its augmentation \(\textbf{x}^{\prime }\), respectively. The conditional joint distribution of \(\textbf{z}\) and \(\textbf{z}^{\prime }\) is given by the matrix \(\textbf{P} \in \mathbb {R}^{C \times C}\) which is constituted by,

$$\begin{aligned} \textbf{P}=\frac{1}{n} \sum \limits _{i=1}^n \textbf{z}_i \cdot \left( \textbf{z}_i^{\prime }\right) ^{\top }, \end{aligned}$$
(9)

where \(\textbf{P}_{c c^{\prime }}=P\left( z=c, z^{\prime }=c^{\prime }\right)\) denotes the element of c-th row and \(c^{\prime }\)-th column. Additionally, the marginals \(\textbf{P}_c=P(z=c)\) and \(\textbf{P}_{c^{\prime }}=P\left( z^{\prime }=c^{\prime }\right)\) can be obtained by summing over the rows and columns of this matrix. Notably, IIC stands out as one of the earliest deep frameworks designed entirely under the framework of information theory, distinguishing itself from IMSAT.

Similar to mutual-information-based methods, contrastive-learning-based methods treat instances augmented from the same instance as positive samples and the rest as negative samples. Let \(\textbf{z}_{2i}\) and \(\textbf{z}_{2i-1}\) represent two augmented representation of the i-th instance, the contrastive loss is formulated as:

$$\begin{aligned} \mathcal {L}= & {} \sum \limits _i^N\left( \ell \left( 2i,2i-1\right) +\ell \left( 2i-1,2i\right) \right) ,\nonumber \\ \ell \left( i,j\right)= & {} -\log \frac{\exp \left( {s} \left( \textbf{z}_{i} \cdot \textbf{z}_{j}\right) / \tau \right) }{\sum \nolimits _{j=1}^{2 N} \textbf{1}_{[j \ne i]} \exp \left( {s} \left( \textbf{z}_{i} \cdot \textbf{z}_{j} / \right) \tau \right) }, \end{aligned}$$
(10)

where \(\ell \left( i, j\right)\) represents the pairwise contrastive loss and \(\tau\) controls the temperature of the softmax. The function \(\text {s} \left( \textbf{z}_{i}, \textbf{z}_{j}\right)\) denotes the similarity between representations \(\textbf{z}_{i}\) and \(\textbf{z}_{j}\). This loss encourages representations of positive instances to be closer while being separated from negative instances, encouraging meaningful clustering patterns.

Notably, some theoretical works [58, 66, 76] have demonstrated that contrastive learning is equivalent to maximizing the mutual information from the instance level. Motivated by this observation, researchers have further explored the application of contrastive loss at the cluster level, proving beneficial for deep clustering. PICA [31] is one of the pioneer works of this domain. The fundamental concept behind it is to maximize the similarity between the cluster assignment of original and augmented data. This objective can be likened to conducting contrastive learning [59] at the cluster level. Motivated by PICA, CC [51] and DRC [121] conduct contrastive learning on both instance level and cluster level. Specifically, cluster-level contrastive loss helps learn discriminative cluster assignment, which is the key to the clustering task. Formally, the cluster-level contrastive loss is,

$$\begin{aligned} \mathcal {L}= & {} \frac{1}{2 C} \sum \limits _{i=1}^C\left( \ell \left( 2i-1,2i\right) \! +\! \ell \left( 2i,2i-1\right) \right) \! -\! H(\textbf{Y}),\nonumber \\ \ell \left( i,j\right)= & {} -\log \frac{\exp \left( s\left( \textbf{y}_i, \textbf{y}_i\right) / \tau \right) }{\sum \nolimits _{j=1}^{2 C} \textbf{1}_{[j \ne i]}\left[ \exp \left( s\left( \textbf{y}_{i}, \textbf{y}_j\right) / \tau \right) \right] }, \end{aligned}$$
(11)

where \(\textbf{y}_i \in \mathbb {R}^{1\times N}\) is the cluster-level assignment and \(\tau\) is the cluster-level temperature parameter. \(H(\textbf{Y}) = H(\textbf{Y}^1)+H(\textbf{Y}^2)\) is the cluster assignment probabilities entropy of two augmentations. The inclusion of \(H(\textbf{Y})\) helps avoid the trivial solution where most instances are assigned to the same cluster. Notably, the utilization of contrastive learning at the cluster level in CC and DRC has inspired subsequent works in the field.

TCC [90] takes a step further in exploring the interaction between instance-level and cluster-level representations. The core idea is to leverage a unified representation combined of the cluster semantics and instances, enhancing the representation with cluster information to facilitate clustering tasks. Formally, for an instance representation \(\textbf{z}_i\), the enhanced representation is given by:

$$\begin{aligned} \hat{\textbf{z}}_i=\left( \textbf{z}_i+ \text {NN}_{\theta }\left( \textbf{c}_i \right) \right) /\Vert \textbf{z}_i+ \text {NN}_{\theta }\left( \textbf{c}_i \right) \Vert _{2}, \end{aligned}$$
(12)

where \(\textbf{c}_i\) represents the cluster assignment of i-th instance after Gumbel Softmax. \(\text {NN}_{\theta }\left( \cdot \right)\) denotes a single fully connected network, which is the learnable cluster representation. Different from CC which performs contrastive loss on cluster assignment, TCC conducts contrastive loss on the unified representation to better capture cluster semantics. Inspired by TCC, some works [49, 106] explore the fusion of instance-level and cluster-level representation in various domains. and then conduct contrastive loss on the unified representation, which further explores its effectiveness.

3.4 Neighborhood consistency

Thanks to the advancements in self-supervised representation learning, the features acquired through discriminative pretext tasks can unveil high-level semantics in the latent space. This provides a crucial prior for clustering, as instances and their neighborhoods in the latent space are likely to belong to the same semantic cluster. Leveraging neighborhood-consistent semantics can further enhance clustering, as shown in Fig. 4.

Fig. 4
figure 4

The framework of neighborhood consistency-based methods. Such a paradigm encourages neighboring samples \(z_{i}\) and \(z_{p}\) in the latent space to have consistent features and cluster assignments, which improves the compactness of clusters

SCAN [95] first observes that similar instances will be mapped closely in latent space through self-supervised pretext tasks. Motivated by this observation, SCAN trains a cluster head based on the cluster neighborhood consistency within neighbors. Specifically, SCAN first obtains an encoder f by a pretext task [22, 29, 102, 117]. It then optimizes a cluster head h by requiring it to make consistent predictions between instances and their nearest neighbors:

$$\begin{aligned} \mathcal {L}=-\frac{1}{B}\sum \limits _{i=1}\sum \limits _{j\in \mathcal {N}^k_i}\log \langle \textbf{p}_{i}, \textbf{p}_j\rangle - \lambda H(Y). \end{aligned}$$
(13)

Here \(\mathcal {N}^{k}_{i}\) denotes the k-nearest neighbors of the i-th instance. The second term in Eq. 13 prevents h from assigning all instances to a single cluster which is also used in Eq. 11.

NNM [15] and GCC [122] incorporate neighbor information into the framework of contrastive learning to group instances within neighborhoods. In particular, NNM aligns the clustering assignment of an instance with its neighbors through cluster-level contrastive learning:

$$\begin{aligned} \mathcal {L}=-\frac{1}{C}\sum \limits _{i=1}^C\log \frac{\exp ({s}(\textbf{q}_i,\textbf{q}_{\mathcal {N}i}))}{\sum \nolimits _{j=1}^{C}\exp ({s}(\textbf{q}_i,\textbf{q}_j))}, \end{aligned}$$
(14)

where \(\textbf{q},~\textbf{q}_{\mathcal {N}}\in \mathbb {R}^{C\times B}\) represent the transpose matrix of \(\textbf{p}\) and \(\textbf{p}_{\mathcal {N}}\), respectively. In contrast, GCC introduces the graph structure of the latent space to modify the vanilla instance-level contrastive loss. It constructs a normalized symmetric graph Laplacian \(\textbf{L}\) based on the K-nn graph:

$$\begin{aligned} \textbf{L}= & {} \textbf{I}-\textbf{D}^{-\frac{1}{2}}\textbf{A}\textbf{D}^{-\frac{1}{2}},\nonumber \\ \text {with}\ \textbf{A}_{ij}= & {} \begin{array}{ll} 1, &{}\text {if}\ j\in \mathcal {N}^{k}_{i}\text { or } i\in \mathcal {N}^{k}_{j}\\ 0, &{}\text {otherwise} \end{array}. \end{aligned}$$
(15)

Then, the loss function is given by the following form:

$$\begin{aligned} \mathcal {L}=-\frac{1}{N}\sum \limits _{i=1}^N\log \frac{\sum \nolimits _{\textbf{L}_{ij}<0}-\textbf{L}_{ij}\exp ({s} (\textbf{z}_i,\textbf{z}_j)/\tau )}{\sum \nolimits _{\textbf{L}_{ij}=0}\exp ({s}(\textbf{z}_i,\textbf{z}_j)/\tau )}, \end{aligned}$$
(16)

where \(\tau\) is the temperature. The Graph Laplacian guides the model to attract instances within neighborhoods rather than just augmentation of themselves so that the influence of potential false negative samples [110, 112] can be mitigated. As a result, GCC can better minimize the intra-cluster variance and maximize the inter-cluster variance. The success of this approach has inspired numerous contrastive learning methods [37, 61] in various domains to leverage neighbor relationships that effectively address the false negative challenge.

3.5 Pseudo-labeling

As a prevalent paradigm of semi-supervised classification [6, 47, 92], pseudo-labeling has been extended to deep clustering in recent years. The fundamental assumption of pseudo-labeling is that the predictions on unlabeled data, especially the confident ones, can provide reliable supervision and guide model training. Motivated by this, recent deep clustering works leverage confident predictions to boost clustering performance.

DEC [104] is a pioneering work that utilizes labels generated by itself to simultaneously enhance feature representations and optimize clustering assignments. DEC initializes with a pre-trained auto-encoder and C learnable cluster centroids. The soft assignment is calculated using the Student’s t-distribution, based on the distance between the representation \(\textbf{z}_i\) and centroid \(\textbf{c}_j\):

$$\begin{aligned} \textbf{q}_{ij}=\frac{(1+\Vert \textbf{z}_i-\textbf{c}_j\Vert ^2 / \alpha )^{-\frac{\alpha +1}{2}}}{\sum \nolimits _{k}(1+\Vert \textbf{z}_i-\textbf{c}_{k}\Vert ^2/ \alpha )^{-\frac{\alpha +1}{2}}}, \end{aligned}$$
(17)

where \(\alpha\) is the hyper-parameter and \(\textbf{q}_{ij}\) denotes the probability of assigning the instances i to the cluster j. DEC refines the clusters by emphasizing the high-confidence assignments and making predictions more confident. Specifically, DEC uses the second power of \(\textbf{q}_i\) as a sharpened assignment to guide the training, i.e.,

$$\begin{aligned} \textbf{p}_{ij}=\frac{\textbf{q}_{ij}^2 / \text {freq}_j}{\sum \nolimits _{k}\textbf{q}_{ik}^2/\text {freq}_{k}}, \end{aligned}$$
(18)

where \(\text {freq}_j=\sum \nolimits _i \textbf{q}_{ij}\) is the soft cluster frequency and the sharpened assignment is normalized by \(f_j\) to prevent feature collapse. Finally, a KL divergence loss between \(\textbf{p}\) and \(\textbf{q}\) minimizes the distances between the two distributions, i.e., \(\mathcal {L}=\text {KL}(\textbf{p}|\textbf{q})\).

Another notable method of pseudo-labeling is DeepCluster [8]. As ilustrate in Fig. 5, this approach employs K-means clustering on the learned representations to obtain cluster assignments as pseudo-labels. DeepCluster iteratively performs representation learning and clustering in a mutually beneficial manner to bootstrap each other. However, DeepCluster faces limitations in achieving outstanding performance, primarily due to the restricted semantics of the initial representation. Similar to DeepCluster, ProPos [35] proposes an EM framework of pseudo-labeling, iteratively performing K-means to obtain pseudo labels (E step) and the representation updating (M step). Notably, ProPos significantly outperforms DeepCluster and other methods because ProPos performs K-means on the learned feature of state-of-the-art self-supervised paradigm BYOL [25]. This observation has demonstrated that the semantics of the representation is vital to pseudo-label generation and clustering. Low-quality features would introduce potential noise in pseudo-labels, impact subsequent pseudo-label generation, and mislead representation learning, which accumulates the error in the process.

Fig. 5
figure 5

The framework of pseudo-labeling based methods. Given features in the latent space, clustering algorithms such as K-means are performed to get pseudo labels. The pseudo labels, usually filtered by confidence, are then used as supervision signals to guide clustering

In addition to the progression of self-supervised paradigms, researchers are actively investigating strategies to alleviate the issue of error accumulation in pseudo-labeling. To be specific, the challenges in the realm of pseudo-labeling deep clustering remain two-fold: enhancing the accuracy of generating pseudo-labels and maximizing the utility of these pseudo-labels for effective clustering. On the one hand, inaccurate pseudo-labels pose a risk of degradation in clustering performance. On the other hand, determining how to effectively leverage these pseudo-labels for clustering is a critical consideration. These two challenges underscore the ongoing efforts in the pseudo-labeling learning of deep clustering.

The first challenge has been addressed by many works through carefully designing selection methods. For instance, SCAN [95] empirically observed that instances exhibiting highly confident predictions (i.e., \(\max (\textbf{p}_{i})\approx 1\)) tend to be correctly clustered by the cluster head. Building on this insight, SCAN opts to choose instances with the most confident predictions as labeled data to fine-tune the model using the cross-entropy loss,

$$\begin{aligned} \mathcal {L}= & {} \frac{1}{|Y|} \sum \limits _{i\in Y}-\tilde{y}_i\log (\textbf{p}_i),\nonumber \\ Y= & {} \left\{ i\mid \text {conf}_i\ge \eta \right\} , \text {with conf}_i=\max (\textbf{p}_{i}) \end{aligned}$$
(19)

where \(\eta\) is the threshold hyper-parameter to filter the uncertain instances. TCL [52] and SPICE [75] have devised more effective selection strategies to enhance the accuracy of pseudo-labeling. Specifically, TCL selects the most confident predictions as pseudo labels from each cluster c:

$$\begin{aligned} Y^{c}= & {} \left\{ \text {topK}(\text {conf}_i)\mid \tilde{y}_i=c\right\} \nonumber \\ Y= & {} \bigcup _{c=1}^{C}Y^{c} \end{aligned}$$
(20)

where \(\text {topK}(\cdot )\) returns the indices of the top K confident instances and \(\bigcup\) denotes the union of the pseudo labels from all clusters. Here \(K=\gamma N/C\) and \(\gamma\) is the selection ratio. The cluster-wise selection leads to more class-balanced pseudo labels compared to threshold-based criteria. It improves the clustering performance, especially for challenging classes.

SPICE introduces a prototype-based pseudo-labeling approach. Specifically, it first re-computes the centroids of each cluster only using the instances with confident predictions, then re-assign each instance with new pseudo labels according to the similarity to the new centroids, formally:

$$\begin{aligned} \textbf{c}_i^{\prime }= & {} \frac{1}{|Y^{c}|}\sum \limits _{i\in Y^{c}}\textbf{z}_i,\nonumber \\ \tilde{y}_i^{\prime }= & {} \arg \max _{j}\text {s}(\textbf{z}_i,\textbf{c}^{\prime }_{j}). \end{aligned}$$
(21)

This operation helps mitigate the influence of potentially incorrect pseudo labels used in calculating centroids, which might accumulate errors in the iterative self-training process.

To address the second challenge, i.e., better utilizing the confident labels, TCL removes negative pairs with the same label in contrastive loss, preventing intra-class instances from pushing apart, i.e., the false negative issue. Meanwhile, SPICE and TCL adopt some semi-supervised classification techniques like FixMatch [92] that impose the pseudo-label consistency for strong augmentations of the same instance. The marvelous results achieved by these works show the effectiveness of combining reliable pseudo-labeling methods and semi-supervised paradigms in clustering.

3.6 External knowledge

Most clustering approaches focus on grouping data based on inherent characteristics such as structural priors, distribution priors, and augmentation invariance priors. As shown in Fig. 6, instead of pursuing internal priors from the data itself, some recent works [7, 53] attempt to introduce abundant external knowledge such as textual descriptions to guide clustering. These methods prove effective because the utilization of semantic information from natural language offers valuable supervisory signals that enhance the quality of clustering.

Fig. 6
figure 6

The framework of external knowledge based methods. Instead of mining internal priors from the samples themselves, such a paradigm seeks external information like textual semantics to help distinguish the given samples

SIC [7] is one of the first works in incorporating external knowledge guidance into clustering. The fundamental concept revolves around generating image pseudo-labels from a textual space pre-trained by CLIP [81]. The process involves three main steps: i) Construction of Semantic Space: SIC selects meaningful texts resembling category names to build a semantic space. ii) Pseudo-labeling: Pseudo-labels are generated using text semantic centers \(\textbf{h}\) and image representations \(\textbf{z}_i\), formally,

$$\begin{aligned} \textbf{q}_i=\text {one-hot}\left( c, \arg \max _l \frac{\exp \left( \textbf{z}_i^T \textbf{h}_l\right) }{\Sigma _{l^{\prime }}^c \exp \left( \textbf{z}_i^T \textbf{h}_{l^{\prime }}\right) }\right) , \end{aligned}$$
(22)

where c is the number of semantic centers, \(\textbf{h}_l\) is the l-th center of semantic centers, one-hot operator will generate a c-bit one-hot vector. The pseudo-labels is utilized to guide the clustering similar to SCAN [95],

$$\begin{aligned} \mathcal {L}=\frac{1}{n} \sum \limits _{i=1}^n C E\left( \textbf{q}_i, \textbf{p}_i\right) , \end{aligned}$$
(23)

where \(CE\left( \cdot \right)\) is the cross entropy function. iii) Consistency learning: Enhancing clustering effect by enforcing the consistency between the images and their neighbors in the image space,

$$\begin{aligned} \mathcal {L}=-\frac{1}{n} \sum \limits _{i=1}^n \log \textbf{p}_i^T \textbf{p}_j, \end{aligned}$$
(24)

where j is an instance index randomly selected from the nearest neighbors \(\mathcal {N}_k\left( \textbf{z}_i\right)\) of i-th instance. Note that, SIC essentially pulls image embeddings closer to embeddings in semantic space, while ignoring the improvement of text semantic embeddings.

TAC [53] focuses on leveraging textual semantics to enhance the feature discriminability. Specifically, it retrieves a text counterpart among representative nouns for each image, which improves K-means performance without any additional training. Besides, TAC proposes a mutual distillation paradigm to incorporate the image and text modalities, which further improves the clustering performance. The cross-modal mutual distillation strategy is formulated as follows:

$$\begin{aligned} \mathcal {L}= & {} \sum \limits _{i=1}^C \mathcal {L}_i^{v\rightarrow t}+\mathcal {L}_i^{t\rightarrow v},\nonumber \\ L_i^{v \rightarrow t}= & {} -\log \frac{\exp \left( \text {sim}\left( \hat{\textbf{q}}_i, \hat{\textbf{p}}_i^{\mathcal {N}}\right) / \tau \right) }{\sum \nolimits _{k=1}^K \exp \left( \text {sim}\left( \hat{\textbf{q}}_i, \hat{\textbf{p}}_k^{\mathcal {N}}\right) / \tau \right) },\nonumber \\ L_i^{t \rightarrow v}= & {} -\log \frac{\exp \left( \text {sim}\left( \hat{\textbf{p}}_i, \hat{\textbf{q}}_i^{\mathcal {N}}\right) / \hat{\tau }\right) }{\sum \nolimits _{k=1}^K \exp \left( \text {sim}\left( \hat{\textbf{p}}_i, \hat{\textbf{q}}_k^{\mathcal {N}}\right) / \tau \right) }, \end{aligned}$$
(25)

where \(\tau\) is the softmax temperature parameter, \(\hat{\textbf{p}}_i,\hat{\textbf{q}}_i\in \mathbb {R}^{1 \times N}\) is the i-th column of image and text assignment matrix, \(\hat{\textbf{p}}_i^{\mathcal {N}}, \hat{\textbf{q}}_i^{\mathcal {N}}\in \mathbb {R}^{1 \times N}\) is the i-th column of image and text random nearest neighbor matrix. The mutual distillation strategy has two advantages. On the one hand, it generates more discriminative clusters through cluster-level contrastive loss. On the other hand, it encourages consistent clustering assignments between each sample and its cross-modal neighbors, which bootstraps the clustering performance in both modalities.

Table 2 The summary of deep clustering methods from the perspective of prior knowledge

4 Experiment

In this section, we introduce the evaluation of deep clustering. Briefly, we first present the evaluation metrics and common benchmarks. Then we analyze the results of the existing deep clustering methods.

4.1 Evaluation metrics

For clustering evaluation, three metrics are commonly used to measure how the predicted cluster assignments \(\tilde{y}\) match the ground truth labels y, including accuracy (ACC), normalized mutual information (NMI), and adjusted rand index (ARI). A higher value of the metrics corresponds to better clustering performance. The definitions of the three metrics are as follows:

  • ACC [1] indicates the correct rate of clustering predictions:

    $$\begin{aligned} \text {ACC}=\frac{1}{N}\sum \limits _{i=1}^N \textbf{1}\{y_i=\tilde{y}_i\}, \end{aligned}$$
    (26)

    where the Hungarian matching [45] is first applied to align the predictions and labels.

  • NMI [63] quantifies the mutual information between the predicted labels \(\tilde{\textbf{Y}}\) and ground truth labels \(\textbf{Y}\):

    $$\begin{aligned} \text {NMI}=\frac{I(\tilde{\textbf{Y}}; \textbf{Y})}{\frac{1}{2}[H(\tilde{\textbf{Y}})+H(\textbf{Y})]}, \end{aligned}$$
    (27)

    where \(H(\textbf{Y})\) denotes the entropy of Y and \(I(\tilde{\textbf{Y}}; \textbf{Y})\) denotes the mutual information between \(\tilde{\textbf{Y}}\) and \(\textbf{Y}\).

  • ARI [36] is the normalization of the rand index (RI), which counts the number of instances pairs in the same cluster and different clusters:

    $$\begin{aligned} \text {RI}=\frac{\text {TP}+{\text {TN}}}{\text {C}^2_N}, \end{aligned}$$
    (28)

    where \(\text {TP}\) and \(\text {TN}\) refer to the number of true positive pairs and true negative pairs, \(\text {C}^2_N\) is the number of possible instance pairs. ARI is computed by adding the following normalization:

    $$\begin{aligned} \text {ARI}=\frac{\text {RI}-\mathbb {E}(\text {RI})}{\text {max}(\text {RI})-\mathbb {E}\left( \text {RI}\right) }, \end{aligned}$$
    (29)

    where \(\mathbb {E}(\text {RI})\) denotes the expectation of RI.

4.2 Datasets

In the early stage, deep clustering methods are evaluated on relatively small and low-dimensional datasets (e.g. COIL-20 [70], YaleB [21]). Recently, with the rapid development of deep clustering methods, it has become more popular to evaluate clustering performance on more complex and challenging datasets. There are five widely used benchmark datasets:

  • CIFAR-10 [44] consists of 60,000 colored images from 10 different classes including airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck.

  • CIFAR-100 [44] contains 100 classes grouped into 20 superclasses. Each image comes with a “fine” class label and a “coarse” superclass label.

  • STL-10 [13] contains 13,000 labeled images from 10 object classes. Besides, it provides 100,000 unlabeled images for self-supervised learning to enhance the clustering performance.

  • ImageNet-10 [9] is a subset of the ImageNet dataset [17]. It contains 10 classes, each with 1,300 high-resolution images.

  • ImageNet-Dog [9] is another subset of ImageNet. It consists of images belonging to 15 dog breeds, which is suitable for fine-grained clustering tasks.

Apart from them, some recent works employ two more challenging large-scale datasets, Tiny-ImageNet [48] and ImageNet-1K [17], to evaluate the effectiveness and efficiency. A brief description of these datasets is summarized in Table 3.

Table 3 A summary of datasets commonly used for deep clustering

4.3 Performance comparisons

The clustering performance on five widely used datasets is shown in Table 4. Thanks to the feature extraction ability of deep neural networks, early deep clustering methods based on structure and distribution priors achieve much better performance than the classic K-means. Then, a series of contrastive clustering methods significantly improve the performance by introducing additional priors through data augmentation. After that, more advanced methods boost the performance by further considering the neighborhood consistency (GCC compared with CC) and utilizing pseudo labels (SCAN compared with SCAN\(^*\)). Notably, the performance gains of different priors are independent. For example, ProPos remarkably outperforms DEC and CC by additionally utilizing the augmentation invariance or pseudo-labeling priors, respectively. Very recently, external-knowledge-based methods achieved state-of-the-art performance, which proves the promising prospect of such a new deep clustering paradigm. In addition, clustering becomes more challenging when the category number grows (from CIFAR-10 to CIFAR-100) or the semantics becomes more complex (from CIFAR-10 to ImageNet-Dogs). Such results indicate that more challenging datasets such as full ImageNet-1K are expected to benchmark in future works.

Table 4 Clustering performance on five widely-used image clustering datasets. SCAN\(^*\) denotes the clustering results using only neighborhood consistency loss without the self-labeling step. \(\dagger\) denotes using the train and test split for training and testing respectively, instead of using both splits for training and testing. Horizontal lines separate methods with different priors. From top to bottom are structure prior, distribution prior, augmentation invariance, neighborhood consistency, pseudo-labeling, and external knowledge

5 Application in Vicinagearth

In this section, we explore some typical applications of deep clustering within the domain of Vicinagearth, a term crafted from the fusion of "Vicinage" and "Earth." Vicinagearth represents the critical spatial expanse ranging from 1,000 meters below sea level (the depth at which sunlight ceases to penetrate) to 10,000 meters above sea level (the typical cruising altitude of commercial aircraft). This zone is of great importance as it encompasses the core regions of human activity including areas of habitation and production. Recently, deep clustering has emerged as an indispensable analytical tool within Vicinagearth, instrumental in unveiling complex patterns and structures of data within the vicinal space. The diverse applications of deep clustering in this zone include anomaly detection, environmental monitoring, community detection, person re-identification, and more.

Anomaly Detection, also known as Outlier Detection [14] or Novelty Detection [19], attempts to identify abnormal instances or patterns. In the context of Vicinagearth, deep clustering proves valuable for analyzing sensor data obtained from diverse sources like underwater monitoring systems, aerial sensors, or ground-based sensors [10]. Through the analysis of the patterns and typical behaviors from the sensor data, the system becomes adept at detecting anomalies, which may signal security threats or irregular activities.

Environmental Monitoring involves the analysis of data collected from environmental sensors [103], such as monitoring air quality, water conditions, and geological factors. The primary goal is to ensure the health of ecosystems [101] and detect potential environmental threats, such as pollution events or natural disasters. Deep clustering techniques play a crucial role in grouping similar environmental patterns, facilitating the identification of abnormalities. This application contributes to real-time environmental monitoring [46], enhancing the ability to respond promptly to environmental challenges.

Community Detection [20, 40] involves evaluating how groups of nodes are clustered or partitioned and their tendency to strengthen or break apart within a network. In the context of Vicinagearth, this technique is applied to identify groups of species [68] that interact closely or share similar ecological niches. Deep clustering plays a pivotal role in the analysis of complex ecological networks [65], contributing to a deeper understanding of ecological communities and their dynamics.

Person Re-identification [100, 113] is a crucial task that involves recognizing and matching individuals across different camera views [111]. This technology plays a significant role in public safety and law enforcement initiatives, as it helps to monitor densely populated areas for including potential threats or subjects on the watchlist. The integration of deep clustering algorithms has remarkedly improved the scalability and efficiency [107] of person re-identification systems. Deep clustering effectively enables the management of the complexities presented by large and dynamically changing crowds. Furthermore, the adaptability of deep clustering techniques broadens their use to include the monitoring of natural habitats and the tracking of wildlife in diverse and uncontrolled settings.

6 Future challenges

Although existing works achieve remarkable performance, some practical challenges and emerging requirements have yet to be fully addressed. In this section, we delve into some future directions of modern deep clustering.

6.1 Fine-grained clustering

The objective of fine-grained clustering is to discern subtle and intricate variations within data, which is particularly advantageous in research like the identification of biological subspecies [54, 55]. The primary challenge is that fine-grained classes exhibit a high degree of similarity, where distinctions often lie in coloration, markings, shape, or other subtle characteristics. In such scenarios, traditional coarse-grained clustering priors frequently prove inadequate. For instance, color and shape augmentations in augmentation invariance prior would become ineffective. Recently, C3-GAN [41] employs contrastive learning within adversarial training to generate lifelike images, enabling the nuanced capture of fine-grained details and ensuring the separability between clusters.

6.2 Non-parametric clustering

Many clustering methods typically require a predefined and fixed number of clusters. However, real-world datasets often present a challenge with an unknown number of clusters, reflecting situations closer to reality. Only a few works [11, 87, 98, 120] have been devoted to solving this problem. These methods often rely on calculating global similarity and introduce huge computational costs, especially in large-scale datasets. Therefore, efficiently determining the optimal value of cluster number C remains an open challenge, often involving the incorporation of human priors. Among existing works, DeepDPM introduces Dirichlet Process Gaussian Mixture Models (DPGMM) [3] that utilize the Dirichlet Process as the prior distribution over mixture components. DeepDPM dynamically adjusts the number of clusters C through split and merge operations guided by the Metropolis-Hastings framework [28].

6.3 Fair clustering

Collecting Real-world datasets from diverse sources with various acquisition methods can enhance the generalization of machine learning models. However, these datasets frequently manifest inherent biases, notably in sensitive attributes such as gender, race, and ethnicity. These biases would introduce disparities among individuals and minority groups, leading to cluster partitions that deviate from the underlying objective characteristics of the data. The pursuit of fairness is particularly pertinent in applications where unbiased and equitable analyses are crucial, such as employment, healthcare, and education. To tackle this challenge, fair clustering seeks to mitigate the influence of these biases given the biased attributes for each sample.

To address this daunting task, [12] first introduces a data pre-processing method known as fairlet decomposition. Recent advancements address this issue on large-scale data through adversarial training [50] and mutual information maximization [114]. Notably, [114] designs a novel metric that assesses both clustering quality and fairness from the perspective of information theory. Despite these developments, there is still room for improvement, and the establishment of better evaluation metrics is a continuing area of this research.

6.4 Multi-view clustering

Multi-view data [60, 105] is common in real-world situations where information is captured from a variety of sensors or observed from multiple angles. This data is inherently rich, offering diverse yet consistent information. For example, an RGB view would provide color details while the depth view reveals spatial information, which represents the complementary aspects of the views. Simultaneously, there exists a level of view consistency as the same object possesses common attributes across different views. To deal with multi-view data, multi-view clustering [16, 60] is proposed to exploit both the complementary and consistent characters. The goal is to integrate information from all views to produce a unified and insightful clustering result.

Over recent years, several deep-learning approaches [2, 78, 97, 119] have been developed to address this challenge. Binary multi-view clustering [118] simultaneously refines binary cluster structures alongside discrete data representations, ensuring cohesive clustering. In pursuit of view consistency, Lin et al. [56, 57] maximize mutual information across views, thus aligning common properties. SURE [112] aims to strengthen the consistency of shared features between views by utilizing robust contrastive loss. Recently, Li et al. [49] performs bound contrastive loss to preserve the view complementary at the cluster level. These innovative methodologies demonstrate the significant strides made in the field of multi-view analysis, where clustering continues to play a pivotal role in enhancing the synergistic exploitation of multi-view data.

7 Conclusion

The key to deep clustering or unsupervised learning is to seek effective supervision to guide representation learning. Different from traditional taxonomies from the network structure or data type, this survey offers a comprehensive review from the perspective of prior knowledge. With the evolution of clustering technologies, there is a discernible trend shifting from exploring priors within the data itself to external knowledge like natural language guiding. The exploration of external pre-trained models like ChatGPT or GPT-4V(ision) might emerge as a promising avenue. This survey potentially provides some valuable insight and inspires further exploration and advancements in deep clustering.