A Survey on Deep Clustering: From the Prior Perspective

Facilitated by the powerful feature extraction ability of neural networks, deep clustering has achieved great success in analyzing high-dimensional and complex real-world data. The performance of deep clustering methods is affected by various factors such as network structures and learning objectives. However, as pointed out in this survey, the essence of deep clustering lies in the incorporation and utilization of prior knowledge, which is largely ignored by existing works. From pioneering deep clustering methods based on data structure assumptions to recent contrastive clustering methods based on data augmentation invariances, the development of deep clustering intrinsically corresponds to the evolution of prior knowledge. In this survey, we provide a comprehensive review of deep clustering methods by categorizing them into six types of prior knowledge. We find that in general the prior innovation follows two trends, namely, i) from mining to constructing, and ii) from internal to external. Besides, we provide a benchmark on five widely-used datasets and analyze the performance of methods with diverse priors. By providing a novel prior knowledge perspective, we hope this survey could provide some novel insights and inspire future research in the deep clustering community.


Introduction
As a fundamental problem in machine learning, clustering aims at grouping data instances into several clusters, where instances from the same cluster share similar semantics and instances from different clusters are dissimilar.Clustering could reveal the inherent semantic structure underlying the data, which benefits the down-stream analysis such as anomaly detection (Saeedi Emadi and Mazinani, 2018), person re-identification (Ye et al, 2021), community detection (Su et al, 2022), and domain adaption (Yang et al, 2023), etc.
In the early stage, various classic clustering methods are developed, such as centroid-based clustering (MacQueen et al, 1967), density-based clustering (Ester et al, 1996a), hierarchical clustering (Murtagh and Contreras, 2012), and so on.These shallow methods are grounded in theory and enjoy high interpretability.Later on, some works extend shallow clustering methods to diverse data types, such as multi-view (Zhang et al, 2015;Nie et al, 2016Nie et al, , 2017;;Wang et al, 2018) and graph data (Newman and Girvan, 2004;Schaeffer, 2007).Other efforts have been made to improve the scalability (Zhang et al, 2023) of shallow clustering methods.
However, shallow clustering methods partition instances based on the similarity (MacQueen et al, 1967) or density (Ester et al, 1996a) of the given raw or linear transformed data.Due to the limited feature extraction ability, shallow clustering methods would achieve sub-optimal results when confronted with complex, high-dimensional, and non-linear data in the real world.To tackle this challenge, deep clustering techniques are proposed to incorporate neural networks into clustering methods.In other words, deep clustering simultaneously learns discriminative representations and performs clustering on the learned features, progressively benefiting each other.
Over the past few years, many efforts have been devoted to improving the clustering performance from various aspects, such as network architectures (Caron et al, 2018;Nguyen et al, 2021), training strategies (Mukherjee et al, 2019), and loss functions (Jiang et al, 2016;Zhong et al, 2021).However, we would like to highlight that the fundamental challenge of deep clustering is the absence of data annotations.Consequently, the key to deep clustering lies in introducing proper prior knowledge to construct the supervision signals.From the early data structure assumption to the recent data augmentation invariance, the development of deep clustering methods intrinsically corresponds to the evolution of prior knowledge.In this survey, we provide a comprehensive review of deep clustering methods from the perspective of prior knowledge.
Inspired by traditional clustering and dimensionality reduction approaches (Roweis and Saul, 2000;Belkin and Niyogi, 2001), the early deep clustering methods (Huang et al, 2014;Peng et al, 2016;Shaham et al, 2018) build upon the structure prior of data.Based on the assumption that the inherent data structure could reflect the semantic relation, these methods incorporate classic manifold (Roweis and Saul, 2000) or subspace learning (Wright et al, 2010) objectives to optimize the neural network for feature extraction and clustering.The second type of prior knowledge is the distribution prior, which assumes that instances from different clusters follow distinct distributions.Based on such a prior, several generative deep clustering methods (Jiang et al, 2016;Mukherjee et al, 2019) propose to learn the latent distribution of samples for the data partition.In the past few years, the success of contrastive learning spawns a new category of prior knowledge, namely, augmentations invariance.Instead of mining data priors, researchers turn to constructing additional priors with the data augmentation technique.Leveraging the invariance across different augmented samples at the instance representation and clustering assignment levels, numerous contrastive clustering methods (Ji et al, 2019;Li et al, 2021) significantly improve the feature discriminability and clustering performance.Further, researchers find that instances of the same semantics are likely to be mapped into nearby points in the latent space, and accordingly propose the neighborhood consistency prior.Specifically, by encouraging neighboring samples to have similar cluster assignments, several works (Van Gansbeke et al, 2020; Zhong et al, 2021) alleviate the false-negative problem in the contrastive clustering paradigm, thus advancing the clustering results.Another branch of progress is made based on the pseudo label prior, namely, cluster assignments with high confidence are likely to be correct.By selecting confident predictions as pseudo labels, several studies further boost the clustering performance through pseudo-labeling (Li et al, 2022;Qian, 2023) and semi-supervised learning (Niu et al, 2022).Very recently, instead of pursuing internal priors from the data itself, some works (Cai et al, 2023;Li et al, 2023b) attempt to introduce abundant external knowledge such as textual descriptions to guide clustering.
In summary, the essence of deep clustering lies in how to find and leverage effective prior knowledge, for both feature extraction and cluster assignment.To provide an overview of the development of deep clustering, in this paper, we categorize a series of state-of-the-art approaches according to the taxonomy of prior knowledge.We hope such a new perspective for deep clustering could inspire future research in the community.The rest of this paper is organized as follows: First, Section 2 introduces the preliminaries on deep clustering.Section 3 reviews the existing deep clustering methods from the prior knowledge perspective.Then, Section 4 provides experimental analyses of deep clustering methods.After that, Section 5 briefly introduces some applications of deep clustering in the vicinagearth security.Lastly, Section 6 summarizes some notable trends and challenges for deep clustering.

Related Surveys
We notice that several surveys on deep clustering have been proposed in recent years.Briefly, Min et al (2018) categorizes deep clustering methods according to the network architecture.Dong et al (2021) focuses on applications of deep clustering.Ren et al (2022) summarizes existing methods from the view of data types, such as single-and multi-view.Zhou et al (2022) discusses various interactions between representation learning and clustering.Distinct from existing surveys, this work systematically provides a new perspective from the prior knowledge, which plays a more intrinsic and essential role in deep clustering.

Problem Definition
In this section, we introduce the pipeline of deep clustering, including the notation and problem definition.Unless specially notified, in this paper, we use bold uppercase and lowercase to denote matrices and vectors, respectively.The commonly used notations are summarized in Table 1.
The deep clustering problem is formally defined as follows: given a set of instances D = {x i } N i=1 ∈ X that belongs to C classes, deep clustering aims to learn discriminative features and group the instances into C clusters according to their semantics.Specifically, deep clustering methods first learn a deep neural network f : X → Z for feature extraction, i.e., z i = f (x i ).Given instance features in the latent space, clustering results could be obtained in two ways.The most straightforward way is to apply classic algorithms such as K-means (MacQueen et al, 1967) and DBSCAN (Ester et al, 1996a) on the learned features.The other solution is to train an additional cluster head h : Z → R C to produce soft cluster assignment p i = softmax(h(z i )) which satisfies K j=0 p ij = 1.The hard cluster assignment for the i-th instance could be computed by arg max operation, namely, ỹi = arg max (1) The cluster assignments provide the inherent semantic structure underlying the data, which could be utilized in various downstream analyses.
Temperature coefficient of contrastive loss

Priors for Deep Clustering
In this section, we review existing deep clustering methods from the perspective of prior knowledge.
The priors are illustrated in Figure 1 and the method categorization is summarized in Table2.

Structure Prior
Structure prior is mostly inspired by traditional clustering methods.Traditional cluster is mainly rooted in assumptions about the structural characteristics of clusters in data space.For example, K-means (MacQueen et al, 1967) aims to learn k cluster centroids, which assumes that instances in each cluster form a spherical structure around its centroid.DBSCAN (Ester et al, 1996a) is based on the assumption that a cluster in data space is a contiguous region of high point density, separated from other such clusters by regions of low point density.Spectral clustering (Belkin and Niyogi, 2001) assumes data lies on a locally linear manifold so that the local neighborhoods' relation should be preserved in latent space.Those methods partition instances according to the graph Laplacian.Agglomerative clustering (Gowda and Krishna, 1978) considers the hierarchical structure of data and performs clustering with merging  Given well-structured data in the latent space, ABDC (Song et al, 2013) iteratively optimizes the data representation and clustering centers motivated by K-means.As the deep extension of classic spectral clustering, DEN (Huang et al, 2014), SpectralNet (Shaham et al, 2018), and MvLNet (Huang et al, 2019, 2021) compute the graph Laplacian in the latent space learned by auto-encoder (Bengio et al, 2013) and Siame-seNets (Hadsell et al, 2006;Shaham and Lederman, 2018), respectively.Likewise, DCC (Shah and Koltun, 2018) extends the core idea of RCC (Shah and Koltun, 2017) by performing a relation matching based on the similarity between latent features.The auto-encoder is then optimized by minimizing the distance of paired instances in the latent space.PARTY (Peng et al, 2016) is the first deep subspace clustering method, which introduces the sparsity prior and selfrepresentation property in subspace learning to optimize neural networks.Motivated by the hierarchical structure of clusters, JULE (Yang et al, 2016) achieves agglomerative deep clustering by progressively merging clusters and optimizing the features.

Distribution Prior
Distribution prior refers to instances of different semantics following distinct data distributions.Such a prior arouses the generative deep clustering paradigm, which employs variational autoencoder (Kingma and Welling, 2013) (VAE) and generative adversarial network (Goodfellow et al, 2014) (GAN) to learn the underlying distribution.Instances generated from similar distributions are then grouped together to achieve clustering.
VaDE (Jiang et al, 2016) is the first deep generative clustering method, which computes different data distributions by fitting the Gaussian mixture model (GMM) in the latent space.To generate an instance, VaDE first samples a cluster distribution p (c) to generate a latent vector p (z | c), and then reconstructs the instance in the input space p (x | z).The cluster assignment and neural network are jointly optimized by maximizing the log-likelihood of instance, i.e., Fig. 2 The framework of distribution prior based methods.In addition to the standard continuous latent variable zn, generative deep clustering methods further introduce a discrete variable zc to capture the cluster information.
Since directly computing Eq. 2 is intractable, the optimization is approximated by the evidence lower bound (ELBO) of variational inference objective, namely, where q(z, c | x) is variational posterior, which approximates the real posterior.The reparameterization trick introduced in VAE (Kingma and Welling, 2013) is adopted to make the sampling process differentiable.
Though GMM could effectively distinguish distributions, Gaussian components are proved to be redundant, which harms the discriminability between different clusters (Gurumurthy et al, 2017).As an improvement, ClusterGAN, DCGAN (Mukherjee et al, 2019;Radford et al, 2015) proposes to adopt GAN to implicitly learn the latent distributions.Specifically, in addition to the continuous latent variable z n , it introduces a one-hot encoding z c to capture cluster distribution during the generation.The objective function of ClusterGAN is formulated as follows: where z = (z n , z c ) is the mixed latent variable, E is the inverse network which maps data from the raw to latent space, H (•, •) denotes the cross-entropy, and β n , β c are the weight parameters.The first Fig. 3 The framework of augmentation invariance based methods.Diverse transformations are first applied to augment the input data x, after which the shared deep neural network is utilized to extract features.The augmented samples of the same instance are encouraged to have similar features and cluster assignments.
two terms are consistent with standard GAN.The last two clustering-specific terms encourage a more distinct cluster distribution, as well as map inputs to the latent space to achieve clustering.

Augmentation Invariance
In recent years, image augmentation methods (Shorten and Khoshgoftaar, 2019) have gained widespread attention, grounded in the prior that augmentations of the same instance could preserve consistent semantic information.This augmentation-invariance character inspires exploration of how to leverage the positive pairs (i.e., different augmentations of the same image) with similar semantic information.Notably, mutual-information-based methods and contrastive-learning-based methods have emerged as pioneers in the realm of deep clustering.In this section, we delve into the fundamental concepts and related works of both mutual-informationbased and contrastive-learning-based methods.
Firstly, mutual information is a measure of dependence between two continuous random variables X and Y , formally, (5) where p(x, y) is the joint probability mass function of X and Y , p(x) and p(y) are the marginal probability mass functions of X and Y respectively.In the context of information theory, leveraging the mutual information between variables of positive instances could enhance the optimization of clustering-related information.
IMSAT (Hu et al, 2017) stands as a typical information-theoretic approach to deep clustering.Its fundamental concept includes enforcing invariance on pair-wise augmented instances and achieving unambiguous and uniform cluster assignments.Specifically, IMSAT encourages the representations of augmented instances to closely match the representations of the original instances, i.e., where p ′ is the prediction representations of augmented instances.This aspect can be viewed as exploring the maximization of mutual information between data and its augmentations.Besides, IMSAT implements regularized information maximization for deep clustering inspired by RIM (Krause et al, 2010) to keep the cluster assignments unambiguous and uniform.Specifically, IMSAT seeks to maximize the mutual information between instances and their cluster assignments, expressed as: where H(•) and H(•|•) the entropy and conditional entropy, and p •k = 1 N i p ik .Increasing the first term (marginal entropy H(Y )) encourages uniform cluster assignments, i.e., the number of instances in each cluster tends to be the same.Conversely, decreasing the second term (conditional entropy H(Y | X)) encourages each instance to be unambiguously assigned to a certain cluster.
IIC (Ji et al, 2019) and Completer (Lin et al, 2021, 2022) take a further step in exploring the mutual information between instances and their augmentations.The fundamental concept is to maximize the mutual information between the cluster assignments of pair-wise augmented instances.Specifically, IIC achieves semantically meaningful clustering and avoids trivial solutions by maximizing the mutual information between the cluster assignments, where z and z ′ are the representations of the original instance x and its augmentation x ′ , respectively.The conditional joint distribution of z and z ′ is given by the matrix P ∈ R C×C which is constituted by, where P cc ′ = P (z = c, z ′ = c ′ ) denotes the element of c-th row and c ′ -th column.Additionally, the marginals P c = P (z = c) and P c ′ = P (z ′ = c ′ ) can be obtained by summing over the rows and columns of this matrix.Notably, IIC stands out as one of the earliest deep frameworks designed entirely under the framework of information theory, distinguishing itself from IMSAT.Similar to mutual-information-based methods, contrastive-learning-based methods treat instances augmented from the same instance as positive samples and the rest as negative samples.Let z 2i and z 2i−1 represent two augmented representation of the i-th instance, the contrastive loss is formulated as: where ℓ (i, j) represents the pairwise contrastive loss and τ controls the temperature of the softmax.The function s (z i , z j ) denotes the similarity between representations z i and z j .This loss encourages representations of positive instances to be closer while being separated from negative instances, encouraging meaningful clustering patterns.
Notably, some theoretical works (Oord et al, 2018;Moskalev et al, 2022;Lin et al, 2023) have demonstrated that contrastive learning is equivalent to maximizing the mutual information from the instance level.Motivated by this observation, researchers have further explored the application of contrastive loss at the cluster level, proving beneficial for deep clustering.PICA (Huang et al, 2020) is one of the pioneer works of this domain.The fundamental concept behind it is to maximize the similarity between the cluster assignment of original and augmented data.This objective can be likened to conducting contrastive learning (Liu et al, 2022) at the cluster level.Motivated by PICA, CC (Li et al, 2021) and DRC (Zhong et al, 2020) conduct contrastive learning on both instance level and cluster level.Specifically, cluster-level contrastive loss helps learn discriminative cluster assignment, which is the key to the clustering task.Formally, the cluster-level contrastive loss is, where y i ∈ R 1×N is the cluster-level assignment and τ is the cluster-level temperature parameter.
) is the cluster assignment probabilities entropy of two augmentations.The inclusion of H(Y) helps avoid the trivial solution where most instances are assigned to the same cluster.Notably, the utilization of contrastive learning at the cluster level in CC and DRC has inspired subsequent works in the field.
TCC (Shen et al, 2021) takes a step further in exploring the interaction between instance-level and cluster-level representations.The core idea is to leverage a unified representation combined of the cluster semantics and instances, enhancing the representation with cluster information to facilitate clustering tasks.Formally, for an instance representation z i , the enhanced representation is given by: ẑi where c i represents the cluster assignment of i-th instance after Gumbel Softmax.NN θ (•) denotes a single fully connected network, which is the learnable cluster representation.Different from Embedding Maximize Similarity Input Fig. 4 The framework of neighborhood consistency-based methods.Such a paradigm encourages neighboring samples z i and zp in the latent space to have consistent features and cluster assignments, which improves the compactness of clusters.
CC which performs contrastive loss on cluster assignment, TCC conducts contrastive loss on the unified representation to better capture cluster semantics.Inspired by TCC, some works (Xu et al, 2022;Li et al, 2023a) explore the fusion of instance-level and cluster-level representation in various domains.and then conduct contrastive loss on the unified representation, which further explores its effectiveness.

Neighborhood Consistency
Thanks to the advancements in self-supervised representation learning, the features acquired through discriminative pretext tasks can unveil high-level semantics in the latent space.This provides a crucial prior for clustering, as instances and their neighborhoods in the latent space are likely to belong to the same semantic cluster.Leveraging neighborhood-consistent semantics can further enhance clustering.
SCAN (Van Gansbeke et al, 2020) first observes that similar instances will be mapped closely in latent space through self-supervised pretext tasks.Motivated by this observation, SCAN trains a cluster head based on the cluster neighborhood consistency within neighbors.Specifically, SCAN first obtains an encoder f by a pretext task (Gidaris et al, 2018;Zhang et al, 2019;Wu et al, 2018;He et al, 2020).It then optimizes a cluster head h by requiring it to make consistent predictions between instances and their nearest neighbors: Here N k i denotes the k-nearest neighbors of the ith instance.The second term in Eq. 13 prevents h from assigning all instances to a single cluster which is also used in Eq. 11.
NNM (Dang et al, 2021) and GCC (Zhong et al, 2021) incorporate neighbor information into the framework of contrastive learning to group instances within neighborhoods.In particular, NNM aligns the clustering assignment of an instance with its neighbors through cluster-level contrastive learning: where q, q N ∈ R C×B represent the transpose matrix of p and p N , respectively.In contrast, GCC introduces the graph structure of the latent space to modify the vanilla instance-level contrastive loss.It constructs a normalized symmetric graph Laplacian L based on the K-nn graph: Then, the loss function is given by the following form: ) where τ is the temperature.The Graph Laplacian guides the model to attract instances within neighborhoods rather than just augmentation of themselves so that the influence of potential false negative samples (Yang et al, 2022b(Yang et al, , 2021) can be mitigated.As a result, GCC can better minimize the intra-cluster variance and maximize the intercluster variance.The success of this approach has inspired numerous contrastive learning methods (Huynh et al, 2022;Lu et al, 2024) in various domains to leverage neighbor relationships that effectively address the false negative challenge.DEC (Xie et al, 2016) is a pioneering work that utilizes labels generated by itself to simultaneously enhance feature representations and optimize clustering assignments.DEC initializes with a pre-trained auto-encoder and C learnable cluster centroids.The soft assignment is calculated using the Student's t-distribution, based on the distance between the representation z i and centroid c j :

Pseudo-Labeling
where α is the hyper-parameter and q ij denotes the probability of assigning the instances i to the cluster j.DEC refines the clusters by emphasizing the high-confidence assignments and making predictions more confident.Specifically, DEC uses the second power of q i as a sharpened assignment to guide the training, i.e., where freq j = i q ij is the soft cluster frequency and the sharpened assignment is normalized by f j to prevent feature collapse.Finally, a KL divergence loss between p and q minimizes the distances between the two distributions, i.e., L = KL(p|q).
Another notable method of pseudo-labeling is DeepCluster (Caron et al, 2018).This approach employs K-means clustering on the learned representations to obtain cluster assignments as pseudo-labels.DeepCluster iteratively performs representation learning and clustering in a mutually beneficial manner to bootstrap each other.However, DeepCluster faces limitations in achieving outstanding performance, primarily due to the restricted semantics of the initial representation.Similar to DeepCluster, ProPos (Huang et al, 2022) proposes an EM framework of pseudo-labeling, iteratively performing K-means to obtain pseudo labels (E step) and the representation updating (M step).Notably, ProPos significantly outperforms DeepCluster and other methods because ProPos performs K-means on the learned feature of state-of-the-art self-supervised paradigm BYOL (Grill et al, 2020).This observation has demonstrated that the semantics of the representation is vital to pseudo-label generation and clustering.Low-quality features would introduce potential noise in pseudo-labels, impact subsequent pseudo-label generation, and mislead representation learning, which accumulates the error in the process.
In addition to the progression of selfsupervised paradigms, researchers are actively investigating strategies to alleviate the issue of error accumulation in pseudo-labeling.To be specific, the challenges in the realm of pseudo-labeling deep clustering remain two-fold: enhancing the accuracy of generating pseudo-labels and maximizing the utility of these pseudo-labels for effective clustering.On the one hand, inaccurate pseudo-labels pose a risk of degradation in clustering performance.On the other hand, determining how to effectively leverage these pseudo-labels for clustering is a critical consideration.These two challenges underscore the ongoing efforts in the pseudo-labeling learning of deep clustering.
The first challenge has been addressed by many works through carefully designing selection methods.For instance, SCAN (Van Gansbeke et al, 2020) empirically observed that instances exhibiting highly confident predictions (i.e., max(p i ) ≈ 1) tend to be correctly clustered by the cluster head.Building on this insight, SCAN opts to choose instances with the most confident predictions as labeled data to fine-tune the model using the cross-entropy loss, where η is the threshold hyper-parameter to filter the uncertain instances.TCL (Li et al, 2022) and SPICE (Niu et al, 2022) have devised more effective selection strategies to enhance the accuracy of pseudo-labeling.Specifically, TCL selects the most confident predictions as pseudo labels from each cluster c: where topK(•) returns the indices of the top K confident instances and denotes the union of the pseudo labels from all clusters.Here K = γN/C and γ is the selection ratio.The clusterwise selection leads to more class-balanced pseudo labels compared to threshold-based criteria.It improves the clustering performance, especially for challenging classes.
SPICE introduces a prototype-based pseudolabeling approach.Specifically, it first re-computes the centroids of each cluster only using the instances with confident predictions, then reassign each instance with new pseudo labels according to the similarity to the new centroids, formally: This operation helps mitigate the influence of potentially incorrect pseudo labels used in calculating centroids, which might accumulate errors in the iterative self-training process.
To address the second challenge, i.e., better utilizing the confident labels, TCL removes negative pairs with the same label in contrastive loss, preventing intra-class instances from pushing apart, i.e., the false negative issue.Meanwhile, SPICE and TCL adopt some semi-supervised classification techniques like FixMatch (Sohn et  Fig. 6 The framework of external knowledge based methods.Instead of mining internal priors from the samples themselves, such a paradigm seeks external information like textual semantics to help distinguish the given samples. 2020) that impose the pseudo-label consistency for strong augmentations of the same instance.The marvelous results achieved by these works show the effectiveness of combining reliable pseudolabeling methods and semi-supervised paradigms in clustering.

External Knowledge
Most clustering approaches focus on grouping data based on inherent characteristics such as structural priors, distribution priors, and augmentation invariance priors.Instead of pursuing internal priors from the data itself, some recent works Cai et al ( 2023); Li et al (2023b) attempt to introduce abundant external knowledge such as textual descriptions to guide clustering.These methods prove effective because the utilization of semantic information from natural language offers valuable supervisory signals that enhance the quality of clustering.SIC (Cai et al, 2023) is one of the first works in incorporating external knowledge guidance into clustering.The fundamental concept revolves around generating image pseudo-labels from a textual space pre-trained by CLIP (Radford et al, 2021).The process involves three main steps: i) Construction of Semantic Space: SIC selects meaningful texts resembling category names to build a semantic space.ii) Pseudolabeling: Pseudo-labels are generated using text semantic centers h and image representations z i , formally, where c is the number of semantic centers, h l is the l-th center of semantic centers, one-hot operator will generate a c-bit one-hot vector.The pseudolabels is utilized to guide the clustering similar to SCAN (Van Gansbeke et al, 2020), where CE (•) is the cross entropy function.iii) Consistency learning: Enhancing clustering effect by enforcing the consistency between the images and their neighbors in the image space, where j is an instance index randomly selected from the nearest neighbors N k (z i ) of i-th instance.Note that, SIC essentially pulls image embeddings closer to embeddings in semantic space, while ignoring the improvement of text semantic embeddings.TAC (Li et al, 2023b) focuses on leveraging textual semantics to enhance the feature discriminability.Specifically, it retrieves a text counterpart among representative nouns for each image, which improves K-means performance without any additional training.Besides, TAC proposes a mutual distillation paradigm to incorporate the image and text modalities, which further improves the clustering performance.The crossmodal mutual distillation strategy is formulated as follows: where τ is the softmax temperature parameter, pi , qi ∈ R 1×N is the i-th column of image and text assignment matrix, pN i , qN i ∈ R 1×N is the i-th column of image and text random nearest neighbor matrix.The mutual distillation strategy has two advantages.On the one hand, it generates more discriminative clusters through cluster-level contrastive loss.On the other hand, it encourages consistent clustering assignments between each sample and its cross-modal neighbors, which bootstraps the clustering performance in both modalities.

Experiment
In this section, we introduce the evaluation of deep clustering.Briefly, we first present the evaluation metrics and common benchmarks.Then we analyze the results of the existing deep clustering methods.

Evaluation Metrics
For clustering evaluation, three metrics are commonly used to measure how the predicted cluster assignments ỹ match the ground truth labels y, including accuracy (ACC), normalized mutual information (NMI), and adjusted rand index (ARI).A higher value of the metrics corresponds to better clustering performance.The definitions of the three metrics are as follows: • ACC (Amigó et al, 2009) indicates the correct rate of clustering predictions: where the Hungarian matching (Kuhn, 1955) is first applied to align the predictions and labels.• NMI (McDaid et al, 2011) quantifies the mutual information between the predicted labels Ỹ and ground truth labels Y: where H(Y) denotes the entropy of Y and I( Ỹ; Y) denotes the mutual information between Ỹ and Y. • ARI (Hubert and Arabie, 1985) is the normalization of the rand index (RI), which counts the number of instances pairs in the same cluster and different clusters: where TP and TN refer to the number of true positive pairs and true negative pairs, C 2 N is the number of possible instance pairs.ARI is computed by adding the following normalization: where E(RI) denotes the expectation of RI.

Datasets
In the early stage, deep clustering methods are evaluated on relatively small and low-dimensional datasets (e.g.COIL-20 (Nene et al, 1996)  Apart from them, some recent works employ two more challenging large-scale datasets, Tiny-ImageNet (Le and Yang, 2015) and ImageNet-1K (Deng et al, 2009), to evaluate the effectiveness and efficiency.A brief description of these datasets is summarized in Table 3.

Performance Comparisons
The clustering performance on five widely used datasets is shown in Table 4. Thanks to the feature extraction ability of deep neural networks, early deep clustering methods based on structure and distribution priors achieve much better performance than the classic K-means.Then, a series of contrastive clustering methods significantly improve the performance by introducing additional priors through data augmentation.After that, more advanced methods boost the performance by further considering the neighborhood consistency (GCC compared with CC) and utilizing pseudo labels (SCAN compared with SCAN * ).Notably, the performance gains of different priors are independent.For example, ProPos remarkably outperforms DEC and CC by additionally utilizing the augmentation invariance or pseudo-labeling priors, respectively.Very recently, external-knowledge-based methods achieved state-of-the-art performance, which proves the promising prospect of such a new deep clustering paradigm.In addition, clustering becomes more challenging when the category number grows (from CIFAR-10 to CIFAR-100) or the semantics becomes more complex (from CIFAR-10 to ImageNet-Dogs).Such results indicate that more challenging datasets such as full ImageNet-1K are expected to benchmark in future works.

Application in Vicinagearth
In this section, we explore some typical applications of deep clustering within the domain of Vicinagearth, a term crafted from the fusion of "Vicinage" and "Earth."Vicinagearth represents the critical spatial expanse ranging from 1,000 meters below sea level (the depth at which sunlight ceases to penetrate) to 10,000 meters above sea level (the typical cruising altitude of commercial aircraft).This zone is of great importance as it encompasses the core regions of human activity including areas of habitation and production.Recently, deep clustering has emerged as an indispensable analytical tool within Vicinagearth, instrumental in unveiling complex patterns and structures of data within the vicinal space.The diverse applications of deep clustering in this zone include anomaly detection, environmental monitoring, community detection, person re-identification, and more.
Anomaly Detection, also known as Outlier Detection (Comaniciu and Meer, 2002) or Novelty Detection (Ester et al, 1996b), attempts to identify abnormal instances or patterns.In the context of Vicinagearth, deep clustering proves valuable for analyzing sensor data obtained from diverse sources like underwater monitoring systems, aerial sensors, or ground-based sensors (Chatterjee and Ahmed, 2022).Through the analysis of the patterns and typical behaviors from the sensor data, the system becomes adept at detecting anomalies, which may signal security threats or irregular activities.
Environmental Monitoring involves the analysis of data collected from environmental sensors (Xia and Vlajic, 2007), such as monitoring air quality, water conditions, and geological factors.The primary goal is to ensure the health of ecosystems (Wu et al, 2016) and detect potential environmental threats, such as pollution events or natural disasters.Deep clustering techniques play a crucial role in grouping similar environmental patterns, facilitating the identification of abnormalities.This application contributes to real-time environmental monitoring (Kumar et al, 2012), enhancing the ability to respond promptly to environmental challenges.
Community Detection (Fortunato, 2010;Jin et al, 2021) involves evaluating how groups of nodes are clustered or partitioned and their tendency to strengthen or break apart within a network.In the context of Vicinagearth, this technique is applied to identify groups of species (Murdock and Yaeger, 2011) that interact closely or share similar ecological niches.Deep clustering plays a pivotal role in the analysis of complex ecological networks (Montoya et al, 2006), contributing to a deeper understanding of ecological communities and their dynamics.
Person Re-identification (Wu et al, 2019;Ye et al, 2021) is a crucial task that involves recognizing and matching individuals across different camera views (Yang et al, 2022a).This technology plays a significant role in public safety and law enforcement initiatives, as it helps to monitor densely populated areas for including potential threats or subjects on the watchlist.The integration of deep clustering algorithms has remarkedly improved the scalability and efficiency (Yan et al, 2023) of person re-identification systems.Deep clustering effectively enables the management of the complexities presented by large and dynamically changing crowds.Furthermore, the adaptability of deep clustering techniques broadens their use to include the monitoring of natural habitats and the tracking of wildlife in diverse and uncontrolled settings.

Future Challenges
Although existing works achieve remarkable performance, some practical challenges and emerging requirements have yet to be fully addressed.In this section, we delve into some future directions of modern deep clustering.

Fine-grained Clustering
The objective of fine-grained clustering is to discern subtle and intricate variations within data, which is particularly advantageous in research like the identification of biological subspecies (Li et al, 2023c,d).The primary challenge is that fine-grained classes exhibit a high degree of similarity, where distinctions often lie in coloration, markings, shape, or other subtle characteristics.In such scenarios, traditional coarse-grained clustering priors frequently prove inadequate.For instance, color and shape augmentations in augmentation invariance prior would become ineffective.Recently, C3-GAN (Kim and Ha, 2021) employs contrastive learning within adversarial training to generate lifelike images, enabling the nuanced capture of fine-grained details and ensuring the separability between clusters.

Non-parametric Clustering
Many clustering methods typically require a predefined and fixed number of clusters.However, real-world datasets often present a challenge with an unknown number of clusters, reflecting situations closer to reality.Only a few works (Chen, 2015;Shah and Koltun, 2018;Zhao et al, 2019;Wang et al, 2021) have been devoted to solving this problem.These methods often rely on calculating global similarity and introduce huge computational costs, especially in large-scale datasets.Therefore, efficiently determining the optimal value of cluster number C remains an open challenge, often involving the incorporation of human priors.Among existing works, DeepDPM introduces Dirichlet Process Gaussian Mixture Models (DPGMM) (Antoniak, 1974) that utilize the Dirichlet Process as the prior distribution over mixture components.DeepDPM dynamically adjusts the number of clusters C through split and merge operations guided by the Metropolis-Hastings framework (Hastings, 1970).

Fair Clustering
Collecting Real-world datasets from diverse sources with various acquisition methods can enhance the generalization of machine learning models.However, these datasets frequently manifest inherent biases, notably in sensitive attributes such as gender, race, and ethnicity.These biases would introduce disparities among individuals and minority groups, leading to cluster partitions that deviate from the underlying objective characteristics of the data.The pursuit of fairness is particularly pertinent in applications where unbiased and equitable analyses are crucial, such as employment, healthcare, and education.To tackle this challenge, fair clustering seeks to mitigate the influence of these biases given the biased attributes for each sample.
To address this daunting task, Chierichetti et al first introduces a data pre-processing method known as fairlet decomposition.Recent advancements address this issue on large-scale data through adversarial training (Li et al, 2020) and mutual information maximization (Zeng et al, 2023).Notably, Zeng et al designs a novel metric that assesses both clustering quality and fairness from the perspective of information theory.Despite these developments, there is still room for improvement, and the establishment of better evaluation metrics is a continuing area of this research.

Multi-view Clustering
Multi-view data (Xu et al, 2013;Liu et al, 2019b) is common in real-world situations where information is captured from a variety of sensors or observed from multiple angles.This data is inherently rich, offering diverse yet consistent information.For example, an RGB view would provide color details while the depth view reveals spatial information, which represents the complementary aspects of the views.Simultaneously, there exists a level of view consistency as the same object possesses common attributes across different views.To deal with multi-view data, multi-view clustering (Deng et al, 2015;Liu et al, 2019a) is proposed to exploit both the complementary and consistent characters.The goal is to integrate information from all views to produce a unified and insightful clustering result.
Over recent years, several deep-learning approaches (Andrew et al, 2013;Wang et al, 2016;Zhao et al, 2016;Peng et al, 2019) have been developed to address this challenge.Binary multiview clustering Zhang et al (2018) simultaneously refines binary cluster structures alongside discrete data representations, ensuring cohesive clustering.In pursuit of view consistency, Lin et al (2021,2022) maximize mutual information across views, thus aligning common properties.SURE (Yang et al, 2022b) aims to strengthen the consistency of shared features between views by utilizing robust contrastive loss.Recently, Li et al (2023a) performs bound contrastive loss to preserve the view complementary at the cluster level.These innovative methodologies demonstrate the significant strides made in the field of multi-view analysis, where clustering continues to play a pivotal role in enhancing the synergistic exploitation of multi-view data.

Conclusion
The key to deep clustering or unsupervised learning is to seek effective supervision to guide representation learning.Different from traditional taxonomies from the network structure or data type, this survey offers a comprehensive review from the perspective of prior knowledge.With the evolution of clustering technologies, there is a discernible trend shifting from exploring priors within the data itself to external knowledge like natural language guiding.The exploration of external pretrained models like ChatGPT or GPT-4V(ision) might emerge as a promising avenue.This survey potentially provides some valuable insight and inspires further exploration and advancements in deep clustering.

Fig. 1
Fig. 1 Six categories of prior knowledge for deep clustering.(a) Structure Prior: data structure could reflect the semantic relation between instances.(b) Distribution Prior: instances from different clusters follow distinct data distributions.(c) Augmentation Invariance: samples augmented by the same instance have similar features.(d) Neighborhood Consistency: neighboring samples have consistent cluster assignments.(e) Pseudo Label: cluster assignments with high confidence are likely to be correct.(f) External Knowledge: abundant knowledge favorable to clustering exists in open-world data and models.
and splitting.Motivated by the success of classic clustering methods, the early exploration of deep clustering mainly focuses on adapting mature structure priors as objective functions to optimize neural networks.

Fig. 5
Fig. 5 The framework of pseudo-labeling based methods.Given features in the latent space, clustering algorithms such as K-means are performed to get pseudo labels.The pseudo labels, usually filtered by confidence, are then used as supervision signals to guide clustering.

Table 1
Commonly used mathematical notations. al,

Table 2
The summary of deep clustering methods from the perspective of prior knowledge.

Table 3
A summary of datasets commonly used for deep clustering.

Table 4
Clustering performance on five widely-used image clustering datasets.SCAN * denotes the clustering results using only neighborhood consistency loss without the self-labeling step.† denotes using the train and test split for training and testing respectively, instead of using both splits for training and testing.Horizontal lines separate methods with different priors.From top to bottom are structure prior, distribution prior, augmentation invariance, neighborhood consistency, pseudo-labeling, and external knowledge.
Lin Y, Jiang L, et al (2022)Improve interpretability of neural networks via sparse contrastive coding.In: Findings of the Association for Computational Linguistics: EMNLP 2022, pp 460-470 Liu X, Zhu X, Li M, et al (2019a) Multiple kernel k k-means with incomplete kernels.IEEE transactions on pattern analysis and machine intelligence 42(5):1191-1204 Liu X, Zhu X, Li M, et al (2019b) Multiple kernel k k-means with incomplete kernels.IEEE transactions on pattern analysis and machine intelligence 42(5):1191-1204 Lu Y, Lin Y, Yang M, et al (2024) Decoupled contrastive multi-view clustering with high-order random walks.Proceedings of the AAAI Conference on Artificial Intelligence 38(13):14193-14201 MacQueen J, et al (1967) Some methods for classification and analysis of multivariate observations.In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Oakland, CA, USA, pp 281-297 face and visual landmark recognition.In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 10842-10851, https: //doi.org/10.1109/CVPR46437.2021.01070Nie F, Li J, Li X, et al (2016) Parameter-free autoweighted multiple graph learning: a framework for multiview clustering and semi-supervised classification.In: IJCAI Nie F, Li J, Li X, et al (2017) Self-weighted multiview clustering with multiple graphs.