Neural clustering based on implicit maximum likelihood

Clustering is one of the most fundamental unsupervised learning tasks with numerous applications in various fields. Clustering methods based on neural networks, called deep clustering methods, leverage the representational power of neural networks to enhance clustering performance. ClusterGan constitutes a generative deep clustering method that exploits generative adversarial networks (GANs) to perform clustering. However, it inherits some deficiencies of GANs, such as mode collapse, vanishing gradients and training instability. In order to tackle those deficiencies, the generative approach of implicit maximum likelihood estimation (IMLE) has been recently proposed. In this paper, we present a clustering method based on generative neural networks, called neural implicit maximum likelihood clustering, which adopts ideas from both ClusterGAN and IMLE. The proposed method has been compared with ClusterGAN and other neural clustering methods on both synthetic and real datasets, demonstrating promising results.


Introduction
Unsupervised learning has gained wide interest due to the emergence of big data collections and the cost of label acquisition.Clustering is one of the most important and fundamental unsupervised learning tasks with numerous applications in computer science and many other scientific fields [1,2].It is defined as the process of partitioning a set of objects into groups (called clusters) so that the data in the same group share common characteristics and differ from data in other groups.While the clustering definition is simple, it is a hard machine learning problem.Its difficulty arises from several factors, e.g., data prepossessing and representation, clustering criterion, optimization algorithm, and parameter initialization.In addition, clustering evaluation is also challenging due to the unsupervised nature of the problem [3,4].
Due to its particular importance, clustering is a wellstudied problem with numerous proposed approaches.Generally, they can be classified as hierarchical (divisive or agglomerative), model-based (e.g., k-means [5], mixture models [3]) and density-based (e.g., DBSCAN [6], Den-sityPeaks [7]).Most methods are effective when the data space is low dimensional and not complex.Various feature extraction and feature transformation methods have been proposed to map the original complex data to a more ''cluster-friendly'' feature space as a prepossessing step to address those limitations.Some of the methods include Principal Component Analysis [8], Non-negative Matrix Factorization [9], Spectral methods [10], and Minimum Density Hyperplanes [11].
Neural networks have been employed for clustering in the context of deep learning.Deep neural networks have been used to learn rich and useful data representations from data collections without heavily relying on human-engineered features.They can improve the performance of supervised and unsupervised learning tasks because of their excellent nonlinear mapping capability and flexibility [12][13][14][15][16].Although clustering has not initially been the primary goal of deep learning, several clustering methods have been proposed that exploit the representational power of neural networks; thus, the deep clustering category of methods has emerged [17][18][19][20][21].Such methods aim to improve the quality of clustering results by appropriately training neural networks to transform the input data and generate cluster-friendly representations [22][23][24][25][26].
In this work, we propose a neural clustering method called Neural Implicit Maximum Likelihood Clustering (NIMLC).It is a generative clustering method that relies on the recently proposed method of Implicit Maximum Likelihood Estimation (IMLE) [27].This is an alternative approach to GANs [28]: given a set of data objects, the IMLE method uses a generator network that takes random input vectors and learns to produce synthetic samples.By minimizing an appropriate objective, the network is trained so that the distribution of samples resembles the data distribution.It has been shown that this training procedure maximizes the likelihood of the dataset without explicitly computing the likelihood.
In analogy with the ClusterGan [29] method which exploits the GAN methodology to perform clustering, we have developed the NIMLC method, which relies on the IMLE methodology to perform clustering.NIMLC utilizes two neural networks, the generator and the encoder.In contrast to ClusterGAN, the discriminator network is not needed.The generator network is fed by appropriately selected random samples (latent vectors) z belonging to K clusters and is trained to produce synthetic samples that resemble the objects of dataset X.The encoder network provides the partition of the dataset X into K clusters by learning the inverse map from the data space X to the latent space Z. Training of both networks is achieved by minimizing an appropriately defined objective function that involves the IMLE loss (data generation) and the reconstruction loss for the latent vectors z.
Note that the IMLE method does not suffer from mode collapse, vanishing gradients or training instability that are frequently encountered in GAN training.Moreover, it does not require large datasets for training.Our aim is to exploit those nice IMLE properties for solving clustering problems through the development of the proposed NIMLC method.
The organization of the paper is the following.In Sect.2, related work is presented.In Sect.3, the IMLE method is first described and then the proposed NIMLC clustering method is presented and explained.Section 4 presents comparative experimental results on various datasets, while Sect. 5 provides conclusions and directions for future research.

Autoencoder based clustering
Inspired by the t-SNE [52] algorithm, the Deep Embedding Clustering (DEC) [53] method has been proposed that optimizes both the reconstruction objective and a clustering objective.DEC transforms the data in the embedded space using an autoencoder and then optimizes a clustering loss defined by the KL divergence between two distributions P and Q: Q is the soft clustering assignment of the data based on the distances in the embedded space between data points and cluster centers, and P is an adjusted target distribution aiming to enhance the clustering quality by leveraging the soft cluster assignments.The initial cluster centers are computed by the k-means algorithm.The Improved Deep Clustering with local structure preservation (IDEC) [54] has also been proposed to improve data representation by maintaining their local structure.The optimized objective function is where f and g are the encoder and decoder, respectively (with learnable parameters), n is the number of data points and K the number of clusters, while k !0 is the regularization parameter that balances the reconstruction loss and the clustering error.Similar to DEC, the Deep Clustering Network (DCN) [55] jointly learns the embeddings and the cluster assignments by optimizing the k-means clustering loss on the embedded space (Eq.2).The optimized objective function is: where M is a matrix that contains the k cluster centers in the embedded space, and s i is the cluster assignment vector for data point x i which has only one nonzero element.

Generative neural clustering
The second category of neural clustering methods includes techniques that are based on models for synthetic data generation and are typically based on the GAN [28] methodology.ClusterGan [29] is the most well-known method of this category that utilizes the GANs methodology to achieve data clustering and synthetic data generation.ClusterGan takes as input a set of random input vectors z that belong to K clusters.An input vector of cluster k is defined as z ¼ ðz n ; z c Þ where z n $ N ð0; r 2 I d n Þ and z c ¼ e k where e k is kth standard unit vector of length K.
Besides the generator G and the discriminator network D, ClusterGAN includes an additional network, the encoder E that provides the cluster assignments of its input x.Clus-terGan trains the typical generator-discriminator architecture jointly with the encoder to achieve clustering and synthetic data generation by optimizing the objective function 3: where Hð:; :Þ is the cross-entropy loss, b n and b c the regularization coefficients q(.) the quality function, given as qðxÞ ¼ logðxÞ for vanilla GAN [28], and qðxÞ ¼ x for Wasserstein GAN (WGAN) [56].

Neural implicit maximum likelihood clustering
The proposed NIMLC method relies on the data generation capabilities of the IMLE algorithm, which is summarized next.

Implicit maximum likelihood estimation
Given a dataset X ¼ fx 1 ; . ..; x n g of d-dimensional vectors, the IMLE algorithm [27] trains a generative neural network G h with m inputs, d outputs and parameter vector (weights) h.This generator takes as input a random vector z 2 R m usually sampled from an m-dimensional Normal distribution and produces a sample s h 2 R d , i.e., s h ¼ G h ðzÞ (see Fig. 2a).IMLE trains the generator to generate synthetic samples s h that resemble the real data x i .It is a simple generative method that, under certain conditions, implicitly maximizes the likelihood of the dataset, although the IMLE objective does not explicitly contain any log-likelihood term, and training neural networks using maximum likelihood is considered a difficult task [57].
In each IMLE iteration, a sampling procedure takes place where a set of L random input vectors z i (called latent vectors) are drawn from the Normal distribution z i $ N ð0; r 2 I m Þ and used for the computation of the corresponding synthetic samples s h i ¼ G h ðz i Þ (i ¼ 1; . ..; L).Then, for each real data example x i ði ¼ 1; . ..; NÞ, its representative sample r h i 2 S h is determined through nearest neighbor search (NNS) in S h based on Euclidean distance, i.e., r h i ¼ NNSðx i ; S h Þ.The generator parameters h are updated in order to minimize the following IMLE objective function: Figure 1 provides an illustration of the IMLE behavior.
The IMLE method exhibits several nice properties: it does not suffer from mode collapse, vanishing gradients, or training instability, unlike popular deep generative methods such as, for example, GANs [28].Mode collapses do not occur since the loss ensures that each data example is represented by at least one sample.Gradients do not vanish because the gradient of the distance between a data example and its representative sample does not become zero unless they coincide.Training is stable because the IMLE estimator is the solution to a simple minimization problem.Finally, it can be used both in the case of small and large datasets.

Cluster friendly input distribution
In the original IMLE method, the random input (latent) vectors z belong to a single cluster since they are drawn from a multivariate m-dimensional Normal distribution.This is not convenient for clustering.If we assume that the input vectors z are drawn from a mixture model, i.e., from K distinct distributions, then a clustering of the original dataset X could be obtained: each data point x i can be assigned to the cluster to which its corresponding input vector z i belongs to.Therefore in the proposed method, the single Normal distribution is replaced by K non-overlapping distributions, with the kth distribution responsible for the generation of the subset Z k of input vectors assigned to cluster k.The most obvious first choice is a mixture of K mdimensional Gaussian distributions.However, this choice requires the specification of the means and covariances of K Gaussian distributions so that they are well separated.
A more sophisticated mechanism for generating m-dimensional random vectors that belong to K disjoint clusters has been proposed in ClusterGan [29], where input vector z consists of two parts, i.e., z ¼ ðz n ; z c Þ.The first part z n is random vector (of dimension d n ) drawn from the Gaussian distribution: z n $ N ð0; r 2 I d n Þ.The second part z c , is deterministic and specifies the cluster k to which z is assigned.Specifically, z c is the one-hot encoding of the corresponding cluster k.Thus, for K clusters, the dimension of z c is equal to K and, if z belongs to the kth cluster, then z c ¼ e k where e k is the kth standard unit vector.Note that r should be set to a small value so that clusters do not overlap.
In summary, in order to generate an input vector z ¼ ðz c ; z n Þ belonging cluster k, we set the z c part equal to the one-hot encoding of k and draw the z n part from N ð0; r 2 I d n Þ.By sampling an equal number of vectors for each cluster k, the set of random input vectors Z is created at each iteration which is partitioned into disjoint subsets Z k , each one containing the random input vectors for cluster k (k ¼ 1; . ..; K).
Additionally, since s h ¼ G h ðzÞ, the set S h of computed samples is partitioned into K disjoint clusters S h k .Consequently, the original dataset X can be partitioned into K clusters by assigning each x i to the cluster of its representative r h i , i.e., if r h i 2 S h k then x i is assigned to cluster k.

The IMLE loss from a clustering perspective
If we examine the IMLE objective function, we can observe its similarities with the k-means clustering loss.Specifically, if we generate exactly K samples S h K ¼ fs h 1 ; . ..; s h K g in each training epoch, where K is the number of clusters, we can treat those synthetic samples as cluster representatives (centroids).In this case, the IMLE objective coincides with the k-means objective (1 C k is the indicator function): and IMLE can be considered as a clustering procedure that trains the generator to produce the cluster centers.The major difference between k-means and IMLE is that the kmeans updates the centroids directly in order to minimize the clustering loss; on the contrary, the IMLE method updates the parameters h of the generator.An issue to be considered is how to specify the k input vectors z k that will be used to generate the K samples so that each sample represents a different cluster.Since z k ¼ ðz nk ; z ck Þ, a straightforward solution is to set r ¼ 0, thus z nk ¼ 0 for all k and z ck ¼ e k for all k ¼ 1; . ..; K. Then by feeding those z k vectors as inputs to the generator, the synthetic samples s k are provided as outputs which can be treated as cluster representatives.Training the generator this way using IMLE, we observed clustering behavior similar to k-means and that the generated K samples resembled the average data point of each cluster.

The NIMLC architecture
The proposed NIMLC approach is a modification of the IMLE method in order to achieve not only synthetic data generation but also clustering of the original dataset X. NIMLC combines ideas from IMLE and ClusterGAN.More specifically, it exploits the IMLE generator network that is fed with clustered input vectors z that follow the ðz n ; z c Þ representation proposed in ClusterGAN.Additionally, it employs a second network called encoder (originally proposed in ClusterGAN) that is trained to provide the cluster assignment for a data point x.It should be noted that, unlike ClusterGAN, NIMLC does not make use of a discriminator network since it is based on IMLE for synthetic data generation.The NIMLC architecture is presented in Fig. 2b.
The generator G is trained to produce synthetic samples that resemble the real data x i by minimizing the IMLE objective (Eq.4).It provides a mapping from the latent space to the data space.The encoder E is trained jointly Fig. 1 The data points are represented by squares and the samples by circles.a For each data point the nearest sample is found.b The generator is updated at each iteration so that the generated samples minimize the IMLE objective with the generator to implement the inverse mapping from the data space to the latent space.Thus for an input x, it provides estimates of ẑn and ẑc .The latter (ẑ c ) is computed using the softmax activation function (with K outputs) and provides a soft clustering assignment of the input x into K clusters.
In summary, the NIMLC architecture feeds an input vector z ¼ ðz n ; z c Þ to the generator, which produces a synthetic sample s ¼ GðzÞ.This sample is subsequently fed to the encoder, which provides the output ẑ ¼ EðsÞ.Note that the NIMLC network is actually an autoencoder since it takes an input z and provides as output an estimate ẑ of z.
After training, the encoder implements a clustering model providing soft clustering assignments ẑc for any data point x.

The NIMLC objective function
The objective function used to train the NIMLC architecture consists of two parts.The first part concerns the generative process and is the IMLE error equal to P n i¼1 jjr h G i À x i jj 2 (Eq.4).Since NIMLC is an autoencoder, the second part of the objective function is the reconstruction loss of the autoencoder.This loss can be split into two terms.The first term is the reconstruction loss for the z n part: The second term is the reconstruction loss for the z c part.Since z c has the form of onehot vector and ẑc are probability vectors provided by the softmax function, the cross-entropy Hðz c ; ẑc Þ between z c and ẑc is used as a loss function.
The complete objective function is presented below, where b n and b c are hyperparameters adjusting the importance of each term.
It should be noted that the first term depends only on the parameters h G of the generator, while the rest two terms depend on the parameters of both the generator h G and the encoder h E .

Slow paced learning
A critical hyperparameter of the NIMLC method is the standard deviation r of the noise distribution used to generate the random part z n of the input vectors.As mentioned earlier, when training the model with r ¼ 0, we have a very strict case with one generated sample per cluster.This sample can be considered as the representative of the corresponding cluster, and the obtained clustering results are on par with those of k-means.On the other hand, if r is relatively large (e.g., r ¼ 0:15), the random input vectors per cluster are not very close.Therefore it is possible for the generator to map the inputs of the same cluster to different regions in the data space, which negatively affects clustering performance.Moreover, we have observed that it is difficult to specify an appropriate value for r.
In order to tackle this problem we propose the following procedure: • Start training with a small value of sigma, preferably r ¼ 0. • In each training epoch increase r by a small amount Dr.
• Stop increasing when a max value r max is reached.
The intuition is that this slow-paced training procedure ( [58,59]) strives to learn and cluster the ''easier'' data points first, like those that are close to the cluster centers and then tries to learn and cluster ''more difficult'' data points away from the cluster centers.Thus, we initially start to explore the clustering solution space with no variability in the input space (r ¼ 0).This way, we enforce only K samples to be generated and used to train the model.Then, at each training epoch, we add variability to the inputs by slowly increasing r in order to incrementally capture complicated structures in the dataset.

The NIMLC algorithm
The NIMLC method is summarized in Algorithm 1.At each epoch, a set of input vectors Z ¼ fz 1 ; . ..; z L g is generated, belonging to K clusters Z k , k ¼ 1; . ..; K of equal size.Each input vector z i ¼ ðz ni ; z ci Þ of Z k is computed by sampling z ni from N ð0; r 2 I d n Þ and setting z ci ¼ e k .We then feed the generator with the set of input vectors Z and the set of synthetic samples S h G ¼ fs h G 1 ; . ..; s h G L g are generated at its output, i.e., s Then for each data batch X b , we compute the nearest synthetic sample r i for each x i 2 X b , ie. r Next, each r i is fed as input to the encoder that produces the reconstruction ẑi ¼ Eðr where ẑi ¼ ðẑ ni ; ẑci Þ.Then we update the parameters of the generator and the encoder using the gradients of the objective function (algorithm 1 steps 11 and 12).Finally, before proceeding to the next epoch, the standard deviation r is updated.
Fig. 3 The evolution of generated samples for the Moons synthetic dataset as r progressively increases It should be noted that using IMLE for clustering has been introduced in our previous work [60].However, that method did not make use of the encoder network.Instead, a two-stage nearest neighbor search was used to perform cluster assignments.Specifically, in the first stage, the centroid c k of each subset S h k was computed, and then the x i was assigned to the cluster l whose centroid c l is nearest to x i based on Euclidean distance.Additionally, in the second stage, instead of determining the representative sample for x i through the nearest neighbor search over the entire set of samples S h , the nearest neighbor search was executed only to the specific subset S h l that contains the samples of cluster l.The NIMLC method proposed herein includes two substantial improvements that lead to considerable performance enhancement.The first is the use of the encoder network that directly provides the cluster assignment for a given input x, while the second is the gradual increase in the noise variance r that allows for slow-paced learning.Additionally, we exploit the generalization capability of the encoder network to cluster those data points that the generator could not learn sufficiently.
A computational overhead of our approach compared to deep clustering methods is related to nearest neighbor search.The training process involves several epochs where the algorithm must find closest synthetic sample to each original data point, resulting in an OðNLÞ overhead in distance calculations.However, we observed that recalculating the nearest neighbors in every training epoch is unnecessary; reusing them for 5 to 10 epochs can significantly reduce the training time without compromising clustering performance.
In order to evaluate the proposed clustering method (NIMLC), we conducted an experimental study using several synthetic and real datasets.We have compared NIMLC against ClusterGan [29] and the two most popular deep clustering methods, namely DCN [55] and DEC [53].We also provide results using k-means [5,61] and the density-based method of i-DivClu-D [62].

Synthetic datasets
We have used three synthetic two-dimensional datasets (Table 1) with known ground truth and different structures in order to assess the clustering capability of our method.The Gaussians dataset consists of four clusters (Fig. 4a), while the Moons (Fig. 4b) and the Rings (Fig. 4c) consist of two clusters.The Gaussians dataset is easier to cluster compared to the other two datasets, whose structure is more complex.It should be emphasized that it is difficult for a parametric method to be able to cluster both cloud-shaped (Gaussians) and ring-shaped (Rings) datasets.

Real datasets
We further evaluated the method by including real datasets in our experimental study.For all datasets the number of clusters was set equal to the number of classes.As a preprocessing step, we used min-max normalization to map the attributes of each dataset to the [0, 1] interval in order to prevent attributes with large ranges from dominating the distance calculation and avoid numerical instabilities in the computation [63].The descriptions of the datasets that we included in our study are given below.For a summary, refer to Table 2.
• 10x_73k [64] dataset consists of 73,233 RNA-transcripts belonging to 8 different cell types.The dataset is sparse since the data matrix has about 40% zero values.Hence, we selected the 720 genes with the highest variances across the cells to reduce the data dimensionality similar to [29].
• Australian [65] two-class dataset is composed of 690 credit card applications.A 14-dimensional feature vector describes each sample.• CMU [65] contains grayscale facial images of twenty individuals captured with varying poses, expressions, and the presence or absence of glasses.The images are available in several resolutions, but for the purpose of our study, we have utilized the 128 Â 120 resolution images.
• Dermatology [65] is a six-class dataset containing 366 patient records that suffer from six different types of Eryhemato-Squamous disease.Each patient is described by a 34-dimensional vector containing clinical and histopathological features.• E. coli [65] includes 336 proteins from the E. coli bacterium, and seven attributes, calculated from the amino acid sequences, are provided.Proteins belong to eight classes according to their cellular localization sites.
• Iris [65] dataset contains three classes of 50 instances each, where each class refers to a type of iris plant.Each sample is described by a 4-dimensional vector, corresponding to the length and width of the sepals and petals in centimeters.
• Olivetti [66] is a face database of 40 individuals with ten 64Â64 grayscale images per individual.For some individuals, the images were taken at different times, varying the lighting, facial expressions (open/closed eyes, smiling/not smiling), and facial details (glasses/no glasses).All the images were taken against a dark homogeneous background with the subjects in an upright, frontal position (with tolerance for some side movement).• Optical Recognition of Handwritten Digits [65] dataset (ORHD) comprises a set of handwritten digits, with ten classes corresponding to each digit from 0 to 9. The resolution of each image is 8x8.For our experiment, we utilized the test set of this dataset, which consists of 1797 images.• Pendigits [65] dataset consists of 250 writing samples from 44 different writers, a total of 10,992 written samples.Each sample is a 16-dimensional vector containing pixel coordinates associated with a label from ten classes.• United States Postal Service [67] dataset (USPS) is a collection of hand-written digits consisting of 7291 grayscale images.The dataset is organized into ten classes, each representing a digit from 0 to 9. Each digit is represented by a set of images, each of size 16 Â 16 pixels.
• Wine [65] three-class dataset consists of 178 samples of chemical analysis of wines.A 13-dimensional feature vector describes each sample.

Evaluation measures
It is important to mention that since clustering is an unsupervised problem, we ensured that all algorithms were unaware of the true clustering of the data.In order to evaluate the results of the clustering methods, we use standard external evaluation measures [68], which assume that ground truth clustering is available.For all algorithms, the number of clusters is set to the number of ground-truth categories [25] and assumes ground truth that cluster labels coincide with class labels.The first evaluation measure is clustering accuracy (ACC): where 1ðxÞ is the indicator function, y i is the ground-truth label, c i is the cluster assignment generated by the clustering algorithm, and m is a mapping function which ranges over all possible one-to-one mappings between assignments and labels.This measure finds the best matching between cluster assignments from a clustering method and the ground truth.It is worth noting that the optimal mapping function can be efficiently computed by the Hungarian algorithm [69].The second evaluation measure is purity (PUR).The same equation formulates purity as clustering accuracy (Eq.7), but their key difference is in the mapping function m.In this case, the mapping function of m greedily assigns clustering labels to ground truth categories in each cluster in order to maximize purity.The third evaluation measure is the normalized mutual information (NMI) defined as [70]: where Y denotes the ground-truth labels, C denotes the clusters labels, I is the mutual information measure and H the entropy.The final evaluation metric is the adjusted Rand Index (ARI) [71,72], which computes a similarity measure between two clustering solutions defined as the proportion of object pairs that are either assigned to the same cluster in both clusterings or to different clusters in both clusterings.Both the generator and the encoder were trained using the Adam optimizer [73] with learning rate n ¼ 3 Â 10 À4 and coefficients b 1 ¼ 0:5 and b 2 ¼ 0:9.We set b n ¼ b c ¼ 1, and Dr ¼ 5 Â 10 À5 in all experiments.Additionally, the number of samples was set equal to 100 and 200 for small and big datasets, respectively.We used the same architectures for the two networks as the ClusterGan [29].Specifically, the dimension of z c is the set equal to the number of clusters.We used Leaky Relu activations (LRelu) with leak = 0.2 and Batch Normalization (BN).We used the same number of hidden layers and hidden neurons for all datasets.We present the detailed generator and encoder architectures in Tables 3 and 4, respectively.
For ClusterGan, we used the proposed architecture and hyperparameters.In the case of the DCN and DEC, an extensive search for an autoencoder model was required in order to obtain good results.We chose symmetrical encoder and decoder networks to simplify the architecture search problem.We resorted to an encoder architecture with three layers: d À ½2d; 3d À d z , where d is the data space dimension and d z is the latent space dimension.All layers are fully connected.
NIMLC and ClusterGan methods were executed for 5000 epochs, while DCN and DEC required 300 to 500 epochs of pretraining and 100 epochs of training with the clustering objective.Furthermore, for the methods that depend on initialization, we executed the neural approaches of NIMLC, ClusterGan, DCN, and DEC three times and the k-means algorithm ten times with k-meansþþ [74] initialization.Average performance results are provided.The i-DivClu-D method is deterministic and requires the number of nearest neighbors as a hyperparameter.In our experiments we set its value at the minimum number that resulted in a connected graph.

Results on synthetic datasets
In Table 5, we provide the average clustering performance of the compared methods for the synthetic datasets.All methods performed well when the dataset consisted of spherical, well-separated data clusters, as happens in the Gaussians dataset.In the Moons and Rings datasets, ClusterGan and DEC had a similar clustering performance as the k-means algorithm, while the DCN method performed better.On the other hand, the NIMLC method could perfectly solve the Moons dataset and had by a significant margin the best clustering performance on Rings, which is the most difficult of the three synthetic datasets.It should be stressed that the NIMLC method presents the unique capability of solving both the Gaussian and the Rings datasets by training the same neural architecture.It should be noted that the density-based i-DivClu-D method demonstrated perfect clustering performance in all three synthetic datasets.

Results on real datasets
Table 6 shows the clustering performance of the compared methods on tabular datasets, while Table 7 displays performance results on image datasets.
The NIMLC method achieved excellent clustering performance on 10x_73k, Australian, CMU, Olivetti-Faces, and Pendigits, outperforming all other methods.Moreover, on Dermatology, E. coli, Iris, ORHD, USPS, and Wine datasets, the NIMLC method demonstrated comparable results with the best-performing method.It should be stressed that the high dimensionality of data and the limited

Conclusions
We have proposed the NIMLC clustering method that is based on neural network training.NIMLC is a generative clustering approach that relies on the IMLE generative methodology to perform clustering.The NIMLC brings ideas from the ClusterGAN algorithm into the IMLE framework to overcome some of the GAN deficiencies.The method is based on a simple training objective, does not suffer from training instabilities, and performs well on small datasets.Experimental comparison against several deep clustering methods and conventional clustering mathods illustrates the potential of the approach.A notable characteristic of the method is that, as shown in the experiments with synthetic data, it is able to cluster both cloud-shaped and ring-shaped data using the same hyperparameter setting.Future research could focus on a more detailed experimental investigation of the performance of NIMLC and its use in real-world applications.Additionally, we aim to consider a modification of the method where the number of samples could vary at each epoch.In the same spirit, we could explore the possibility of adopting a self-paced approach similar to the one proposed in [59] for the online tuning of the parameter Dr. Finally, it is interesting to study how the NIMLC approach could be integrated into a general methodology for estimating the true number of clusters in the dataset.

Fig. 4
Fig.4The synthetic datasets used in our experiments

Table 1
Description of synthetic datasets

Table 2
Descriptions

Table 5
Experimental results on synthetic datasets

Table 6
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.The images or other third party material in this article are included in the article's Creative Commons licence, unless Experimental results on real datasets

Table 7
Experimental results on real datasetsBold numbers indicate the best average performance for each dataset.Results marked by ''-'' denotes the method was not able to learn the dataset indicated otherwise in a credit line to the material.If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.To view a copy of this licence, visit http://creativecommons. org/licenses/by/4.0/.