Joint optimization of an autoencoder for clustering and embedding

Deep embedded clustering has become a dominating approach to unsupervised categorization of objects with deep neural networks. The optimization of the most popular methods alternates between the training of a deep autoencoder and a k-means clustering of the autoencoder’s embedding. The diachronic setting, however, prevents the former to benefit from valuable information acquired by the latter. In this paper, we present an alternative where the autoencoder and the clustering are learned simultaneously. This is achieved by providing novel theoretical insight, where we show that the objective function of a certain class of Gaussian mixture models (GMM’s) can naturally be rephrased as the loss function of a one-hidden layer autoencoder thus inheriting the built-in clustering capabilities of the GMM. That simple neural network, referred to as the clustering module, can be integrated into a deep autoencoder resulting in a deep clustering model able to jointly learn a clustering and an embedding. Experiments confirm the equivalence between the clustering module and Gaussian mixture models. Further evaluations affirm the empirical relevance of our deep architecture as it outperforms related baselines on several data sets.


Introduction
Clustering is one of the oldest and most difficult problems in machine learning and a great deal of different clustering techniques have been proposed in the literature.
Perhaps the most prominent clustering approach is the k-means algorithm [39].Its simplicity renders the algorithm particularly attractive to practitioners beyond the field of data mining and machine learning [47,17,25,16].A variety of related algorithms and extensions have been studied [13,34,11].The model behind the standard k-means algorithm is an isotropic Gaussian mixture model [37,40] (GMM).As such, it can only produce linear partitions of the input space.
A dominating recent line of research aims to alleviate this issue by learning an embedding of the data with a deep autoencoder (DAE) and to have variants of Gaussian mixture models operate on the embedded instances [56,57,20,15,45].This can be referred to as deep embedded clustering.An important reason for the emergence of this modern line of research in clustering is that it combines elements (the deep autoencoder and the GMM clustering) that are theoretically well understood when considered separately.In practice, since the embedding map is nonlinear, the result is that the otherwise linear clustering translates to a nonlinear solution in input space.This approach resembles the rationale behind kernel methods [22].The difference lies in either computing a kernel matrix before learning the clustering versus learning the feature map as the function approximated by the encoder.Another advantage of an autoencoder (AE) based approach is provided by the decoder of the network that allows to map the centroids back into (an approximated version of the) input space.Depending on the kernel function/matrix, this is not always possible with traditional approaches.
However, even with advantages of deep embedded clustering as described above, the fact that the embedding procedure and the clustering procedure are decoupled means that the former cannot benefit from the latter, and vice versa.In the most naive approach, the deep autoencoder is first trained, and then the features output by the autoencoder are clustered.In [56], an improvement is made by alternatively shrinking the clusters in the embedded space around centroids for then to update the latter using k-means.The shrinking may however lead to an adaptation of the clustering to the embedding instead of the desired clustering-induced embedding.In general, these approaches cannot handle a random initialization which usually leads to noisy embeddings and, in turn, meaningless clusterings.
Despite the obvious need, few deep embedded clustering procedures have been proposed featuring an inherent capability of bringing out clustering structure in a joint and simultaneous procedure.This is clearly due to the lack of a theoretical foundation that could support such an objective.In this paper, we provide novel theoretical insight that enables us to simultaneously optimize both the embedding and the clustering by relying on a neural network interpretation of Gaussian mixture models.Our solution derives from the fact that during the EM, the objective function of a certain class of GMM can be rephrased as the loss function of an autoencoder, as we show as a key result.A network trained with that loss is thus guaranteed to learn a linear clustering.Our contributions are threefold.(1) We state and prove a theorem connecting Gaussian mixture models and autoencoders.(2) We define the clustering module (CM) as a one hidden layer AE with softmax activation trained with the loss resulting from the previous point.(3) We integrate the CM into a deep autoencoder to form a novel class of deep clustering architectures, denoted AE-CM.Building on the theoretical insight, the experiments confirm the empirical equivalence of the clustering module with GMM.As for the AE-CM, the model outperforms on several datasets baselines extending Gaussian mixture models to nonlinear clustering with deep autoencoders.
The remainder of the paper is organized as follows.Section 2 reviews related works and baselines and Section 3 provides the theory underpinning our contributions.The clustering module and the AE-CM are introduced in Section 4. We report empirical evaluations in Section 5. Section 6 concludes the paper.

Related Work
Recently, several deep learning models have been proposed to group data in an unsupervised manner [51,50,58,31,18,23,43,46,53,54].Since our main contribution unravels the connection between Gaussian mixture models and autoencoders, models based on associative [23,59], spectral [51,59,6] or subspace [61,45] clusterings are outside the scope of this paper.Furthermore, unlike [58,10,31,18,21,9,29,54], our model falls within the category of general purpose deep embedded clustering models which builds upon GMM [56,20,57,15].We mention that recent works related to specific topics such as perturbation robustness and non-redundant subspace clustering also utilize ideas that are to some degree related to the general concept of deep embedded clustering [60,45].Given the high generality of our method, we detail models that are also built without advanced mechanisms, but can still support them.
One of the first approaches in this latter category of dominating modern deep clustering algorithms is DEC [56], which consists of an embedding network coupled with an ad-hoc matrix representing the centroids.During training, the network is trained to shrink the data around their closest centroids in a way similar to t-SNE [41].However, instead of matching the distributions in the input and feature spaces, the target distribution is a mixture of Student-t models inferred from a k-means clustering of the embedding itself.The latter is updated every few iterations to keep track of evolution of the embedding and the centroids are stored in the ad-hoc matrix.In order to avoid sub-optimal solutions when starting from random initialization, the embedding network is pre-trained.Nevertheless, the difference between the two training phases can cause stability issues, such as collapsing centroids.IDEC [20] extends DEC and alleviates this issue to some degree by using the reconstruction loss of the DAE also during the training phase.Although IDEC is more robust than DEC, both are highly dependent on the initial embedding as the clustering phase quickly shrinks the data around the centroids.The Deep Clustering Network (DCN) [57] further strengthens the deep embedding clustering framework and is based on an architecture comparable to DEC and IDEC but includes hard clustering.The optimization scheme consists of three steps: 1) Optimize the DAE to reconstruct and bring the data-points closer to their closest centroids (similar to k-means); 2) update assignments to the closest clusters; 3) update the centroids iteratively using the assignments.Although alternating optimization helps prevent instabilities during the optimization, DCN still requires initialization schemes in order to reach good solutions.Our proposed model instead, while being theoretically-driven, resolves the need for alternative optimization schemes the previous methods rely on.Since we learn a variant of GMM using a neural network, the optimizations of both the embedding and the clustering profit from each other.In practice, our model is thus less reliant on pre-training schemes.
To some degree our approach is similar the very recent method known as DKM [15].The model shares the same architecture as DEC, namely, a deep autoencoder and an ad-hoc matrix representing the centroids.However, here, the model optimizes the network and the centroids simultaneously.To achieve that the loss function includes a term similar to the loss of fuzzy c-means [5], i.e, k-means with soft-assignments.The cluster coefficients are computed from the distances to the centroids and normalized using a parameterized softmax.In order to convert soft assignments into hard ones, the parameter follows an annealing strategy.In our model, the assignment probability function is indirectly related to the distance to the closest cluster representative, as used in DKM.Our the loss function aims to minimize the distance between the data points and their reconstruction as convex combinations of the centroids.This means that the learning of the centroid and of the assignment probabilities regularize each other, mitigating the need of an annealing strategy.
There are also other recent and promising approaches to clustering that are leveraging deep neural networks.However, these operate in a very different manner compared to deep embedded clustering, which our novel theoretical insight sheds new light on and which enables our new proposed autoencoder optimization for joint clustering and embedding.We nevertheless briefly review some of these approaches and also compare to representatives of such methods in our experiments.In particular, a great deal of works have focus on the generative aspect of Gaussian mixture models and studied variations based on deep generative models such as variational autoencoders (VAE) [33] and generative adversarial network (GAN) [19].

Adversarial approaches for clustering have been proposed with examples being
CatGAN [50] and adversarial autoencoders (AAE) [42].The former optimizes the mutual information and predicts a categorical distribution of classes in the data while maximizing at the same time the robustness against an adversarial generative model.Instead, AAE uses two adversarial networks to impose a Gaussian and a categorical distribution in the latent space.The recent ClusterGAN [46] makes an original use of an autoencoder as the input is not a data point but a sample drawn from the product of a multinomial and a Gaussian distributions.The decoded sample is then fed to the discriminator and the encoder.This method is all robust to random network initialization.
Variational approaches that enable clustering include methods like Gaussian Mixture VAE (GMVAE) [12] and Variational Deep Embedding (VaDE) [30].Although the former explicitly refers to GMMs, both methods are similar.The difference is that the parameters of the Gaussian distributions in embedded space depend only on the cluster index in VaDE whereas GMVAE involves a second random variable.Both models require a pre-trained autoencoder.Variational information bottleneck with Gaussian mixture model (VIB-GMM) [53] also assumes a mixture of Gaussian distributions on the embedding, however the model is trained accordingly to the deep variational information bottleneck (VIB) framework.The latter can be seen as a information theory-driven generalization of GAN and VAE [52,2].VIB-GMM reveals to be a robust alternative to previous approaches as it is able to produce meaningful clusterings from a randomly initialized autoencoder. .

Theoretical Groundwork
Throughout the paper the set of positive integers is denoted by N * .The set of the d-dimensional stochastic vectors is written as When the context allows, the ranges of the indices are abbreviated using the upper-bound, e.g., a vector x ∈ R d decomposes as x = x j 1≤j≤d = x j d .The zero vector of R d is written as 0 d .The notation extends to any number.The identity matrix of R d×d is denoted by I d .

From GMMs to Autoencoders
We aim to fit an isotropic Gaussian mixture model with K ∈ N * components and a Dirichlet prior on the average cluster responsibilities on a dataset The expected value of the complete-data log-likelihood function of the model, also called the Q-function [7] is: In this expression, z i ∈ N is the cluster assignment of the data point x i , γ ik := The co-variance matrices do not appear in this expression as they are all constant and equal to 1 2 I d due to the isotropic assumption.We summarize the parameters into matrices Γ ∈ R N ×K , Φ ∈ S K and µ ∈ R K×d .Note that the cluster responsibility vector of Although, the isotropic co-variances allow for great simplifications, they restrict the model to spherical Gaussian distributions.We later alleviate this constraint by introducing a deep autoencoder.The Dirichlet prior involves an extra parameter but also smoothens the optimization.Note that these two assumptions make the model a relaxation of the one underlying k-means [37,40].
Theorem 1 The maximization of the Q-function in Eq. (2) with respect to Φ yields a term that can be interpreted as the reconstruction loss of an autoencoder.
Proof During the EM-algorithm, the maximization with respect to Φ updates the latter as the average cluster responsibility vector: Using this result, the Q-function can be rephrased as The first term corresponds to the entropy of γk K which, given the Dirichlet prior, is constant and can thus be omitted.We expand now γ ik x i − µ k 2 by adding and subtracting the norm of xi = K k=1 γ ik µ k : For any given Note that this simplification grounds on the isotropic assumption.It is not obvious how to reach a similar result without it.The Q-function becomes: The variable xi is actually a function of x that factorizes into two functions F and G.The former computes γ i which, as a posterior probability, is a function of x i .The latter is the dot-product with µ.
The first term of Eq. (3) can thus be interpreted as the reconstruction term characteristic of a loss of an autoencoder, consisting of the encoder and decoder functions, which are F and G, respectively.

Analysis of the terms
The four terms of the expansion of the Q-function as Eq. ( 3) gives new insights into the training of GMMs.
E 1 : Reconstruction The term E 1 suggests an autoencoder structure to optimize Q.The decoder, G, is a linear function.However, without loss of generality, it can be treated in practice as an affine function, i.e. a single layer network with bias.
Regarding the encoder F, the architecture can be anything as long as it has a stochastic output.
E 2 : Sparsity and regularization The second term E 2 is related to the Gini impurity index [8] applied to γ i : where .F is the Frobenius norm.The Gini index is standard in decision tree theory to select branching features and is an equivalent of the entropy.It is nonnegative and null if, and only if, γ i is a one-hot vector.Hence, minimizing this term favors sparse γ i resulting in clearer assignments.The terms µ k 2 play a role similar to an 2 -regularization: they prevent the centroids from diverging away from the data-points.However, they may also favor the trivial solution where all the centroids are merged into zero.E 3 : Sparsity and cluster merging To study the behavior of E 3 during the optimization, let us consider a simple example with one observation and two clusters, i.e, Γ ≡ γ 1 = (γ, 1 − γ).If the observation is unambiguously assigned to one cluster, γ 1 is a one-hot vector and E 3 is null.If it is not the case, the difference between E 2 and E 3 factorizes as follows: The optimization will thus either push γ 1 toward a more sparse vector, or merge the two centroids.Appendix B.3 presents an analysis of the role of this term.
E 4 : Balancing The Dirichlet prior steers the distribution of the cluster assignments.
If none of the α k is null, the prior will push the optimization to use all the clusters, moderating thus the penchant of E 2 for the trivial clustering [57,20].Note that if α = 1 + 1 K 1 K , E 4 is, up to a constant, equal to the Kullback-Leibler (KL) divergence between a multinomial distribution with parameter γ and the uniform multinomial distribution:

Clustering and Embedding with Autoencoders
Theorem 1 says that an autoencoder could be involved during the EM optimization of an isotropic GMM.We go one step further and by-pass the EM to directly optimize the Q-function in Eq. ( 3)) using an autoencoder.

The Clustering Module
We define the Clustering Module (CM) as the one-hidden layer autoencoder with encoding and decoding functions F and G such as: where X ∈ R N ×d ∼ X , code representation/cluster responsibilities Γ = γ ik N ×K ∈ R N ×K s.t.γ i ∈ S K , and the reconstruction X ∈ R N ×d .The weight and bias pa- rameters of the encoder are W enc ∈ R d×K and B enc ∈ R K , respectively, and analogously for the decoder W dec ∈ R K×d and B dec ∈ R d .The softmax enforces the row-stochasticity of the code, ie., Γ .The associated loss function is the negative of Eq. ( 3): with Θ = W enc , B enc , W dec , B dec .The centroids of the underlying GMM correspond to the images of G of the canonical basis of R K .
Initialization The CM can be initialized using k-means or any initialization scheme thereof such as k-means++ [4].In such a case, the column-vectors of W dec are set equal to the desired centroids.The pseudo-inverse of this matrix becomes the encoder's weights, W enc , and both bias vectors are set to null.Averaging epoch In practice, the CM will be optimized by mini-batch learning with stochastic gradient descent.
In such a procedure, the optimizer updates the positions of the centroids given the current batch.A small batch-size relative to the size of X may cause dispersion of the intermediate centroids.Hence, choosing the final centroids based on the last iteration may be sub-optimal.
We illustrate this phenomenon in Figure 1.The data consists of N = 2, 000 points in R 2 drawn from a mixture of five bi-variate Gaussians (K = 5) (gray dots).The data is standardized before processing.A CM is trained in mini-batches of size 20 over 50 epochs using stochastic gradient descent.The concentration is set to α = 5 K .The dispersion of the centroids after each iteration of the last epoch (crosses) is significant.On the other hand, their average positions (squares) provide a good approximation of the true centers (circles).Therefore, we include one extra epoch to any implementation of the CM to compute the average position of the individual centroids over the last iterations.

Embedding with feature maps
The theory behind GMMs limits the CM to a linear decoder, thus enabling merely a linear partition of the input space.In addition, the isotropy assumption, specific to the CM, bars clusters to spread differently.We alleviate both limitations using a similar approach to that of kernel methods [22]: we non-linearly project the input into a feature space where it will be clustered.However, we do not learn the kernel matrices, providing an implicit feature map.Instead, we learn explicitly the feature maps using a deep autoencoder (DAE).The idea is to optimize the CM and the DAE simultaneously, in order to let the latter find distortions of the input space along the way that guides the CM toward a better optimum.Using a deep autoencoder architecture prevents the optimization to produce degenerate feature maps [20].It also preserves the generative nature of the model: points in the input space can be generated from a combination of centroids in the feature space.We refer to this model as the AE-CM.
The model consists of a clustering module nested into a deep autoencoder The architecture is illustrated in Figure 2. The first part of the DAE encodes an input x ∈ R d into a vector z ∈ R p .Note, CM now works on code representation z and not directly on the input x.The code z is fed to the CM and to the decoder of the DAE, yielding two outputs: z, the CM's reconstruction of z, and x, the DAE's reconstruction of x.

Adapting the loss function to a deep architecture
Empirical evaluation showed that current gradient descent optimizers (e.g., Adam [32]) often return sub-optimal solutions when the reconstruction of the deep autoencoder is simply added to the loss of the CM.To help the optimization to find better optima, we add the assumption that the centroids are orthonormal: Although the previous formula involves only µ which is learned by the nested clustering module, it affects the surrounding DAE.Indeed, the constraint encourages it to produce an embedding where the centroids can simultaneously be orthonormal and minimize CM's loss.As a consequence, E 2 simplifies and E 3 becomes null: Note that the constraint Eq. ( 8) is satisfied for the "ideal" clustering, in which case clusters will be mapped to the corners of a simplex in the embedding space, as discussed recently in [31].In this perspective, our inclusion of this constraint helps guide the clustering towards the ideal clustering, and at the same time simplifies the loss function.
We employ Lagrange multipliers to integrate the orthonormality constraint.That way the final loss can be stated as follows: where λ > 0 is the Lagrange multiplier and β > 0 weights the DAE's reconstruction loss.We choose the 1 -norm to enforce orthonormality, however other norms can be used.Note that if the dimension of the embedding p is larger than the number of centroids K, the embedding can always be transformed to satisfy the orthonormality of the centroids.On the other hand, if K > p, the assumption becomes restrictive also in term of possible clusterings.Nevertheless, its importance can be reduced with a small λ.This assumption also helps to avoid the centroids to collapse as their norm is required to be 1.An analysis of the Lagrange multiplier is provided in Appendix C.1.
The loss function of the AE-CM thus depends on four hyper-parameters: the weight β > 0, the concentration α ∈ S K , the Lagrange multiplier λ > 0, and the size of the batches B ∈ N * .

Implementation details
Since the AE-CM builds upon the CM, any implementation also contains an averaging epoch.In case of pre-training, both sub-networks need to be initialized.
We favor a straightforward end-to-end training of the DAE (without drop-out or noise) over a few epochs.The clustering module is then initialized using k-means++ on the embedded dataset.Finally, the CM is optimized alone using L CM for a few epochs.

Experiments
In this section, we evaluate the clustering module and the AE-CM on several data sets covering different types of data.To highlight the generality of our method, we rely only fully connected architecture, ie.we do not use convolution layers even for image data sets.That said, we focus on general purpose baselines.The experiments were conducted on an Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz with 32Gb of RAM supported with a NVIDIA(R) Tesla V100 SXM2 32GB GPU.

Datasets
We leverage eight common benchmark data sets in the deep clustering literature plus one synthetic data set: MNIST [38] contains 70, 000 handwritten images of the digits 0 to 9. The images are grayscale with the digits centered in the 28 × 28 images.The pixel values are normalized before processing.
fMNIST [55] contains 70, 000 images of fashion products organized in 10 classes.The images are grayscale with the product centered in the 28 × 28 images.The pixel values are normalized before processing.USPS1 contains 9, 298 images of digits 0 to 9. The images are grayscale with size 28 × 28 pixels.The pixel values are normalized before processing.
Reuters10k2 , here abbreviated R10K, consists of 800, 000 news articles.The dataset is pre-processed as in [20] to return a subset of 10, 000 random samples embedded into a 2, 000-dimensional space (tf-idf transformation) and distributed over 4 (highly) imbalanced categories.20News3 contains 18, 846 messages from newsgroups on 20 topics.Features consists of the tf-idf transformation of the 2, 000 most frequent words.
10x73k [62] consists of 73, 233 RNA-transcript belonging to 8 different cell types [28].The features consists of the log of the gene expression variance of the 720 genes with the largest variance.The dataset is relatively sparse with 40% of entries null.
Pendigit [3] consists of 10, 992 sequences of coordinates on a tablet captured as writers write digits, thus 10 classes.The dataset is normalized before processing.
5 Gaussians consists of N = 2, 000 points in R 2 drawn from a mixture of five bi-variate Gaussians (K = 5).The dataset is depicted in Figure 1.

Evaluation metrics
The clustering performance of each model is evaluated using three frequently-used metrics: the Adjusted Rand Index (ARI) [27], the Normalized Mutual Information (NMI) [14], and the clustering accuracy (ACC) [36].These metrics range between 0 and 1 where the latter indicates perfect clustering.For legibility, values are always multiplied by 100.For each table, scores not statistically different (t-test p < 0.05) from the best score of the column are marked in boldface.The * indicates the model with the best run.A failure (−) corresponds to an ARI close to 0.

Baselines
We include two baselines for the CM: k-means (KM), a GMM with full co-variance and an isotropic GMM (iGMM) with uniform mixture weights.The latter differs from the model the clustering module derives but the Dirichlet prior on the responsibilities yields non tractable updates and a Dirichlet prior on the mixture weights harms the performance.
Random and pre-trained initialization are indicated with r and p , respectively.If omitted, the initialization is random.Every experiment is repeated 20 times.

Implementation
Both CM and AE-CM are implemented using TensorFlow 2.1 [1] 4 .We also reimplemented DEC, IDEC and DCN.All deep models but ClusterGAN and VIB-GMM use the same fully connected autoencoder d-500-500-2000-p-2000-500-500-d and leaky relu activations, where d and p are the input and feature space dimensions, respectively.For ClusterGAN and VIB-GMM, we used the architecture provided in the original code.As well, the DAE reconstruction loss is the mean square error, regardless of the dataset and of the model, except for VIB-GMM on images which requires a cross-entropy loss (it under performs, otherwise).CM and its baselines are trained for up to 150 epochs, deep models for 1000 epochs.The hyper-parameters (batch-size, p, concentration, etc.) are listed in Tables 13 and 14.

CM: Evaluation
Recall that the loss of the clustering module is a lower bound of the objective function of its underlying isotropic GMM which approximates k-means.Moreover, the optimization of the CM is based on gradient descent instead of EM.We compare these three models as a sanity check, and show that, despite the differences, they report similar clustering performance.We also present an ablation study of the loss and a model selection scheme for the CM.An analysis of the hyper-parameters and of the Dirichlet prior are reported in Appendix B.1 and B.2, respectively.

CM: Clustering performance
In this experiment, we compare clustering performances and initialization schemes.Random and k-means++ initializations are indicated with the superscripts r and p , respectively.Each experiment is repeated 20 times.We report averages in   As expected, the clustering module performs similarly to iGMM and k-means on every dataset with respect to almost every metrics and for any initialization scheme.Often the k-means++ initialization does not improve the results.In the case of the clustering module the difference is never significant except on the 20News dataset.

CM: Runtimes
We compare here the runtimes of the different method on MNIST.For a fair comparison, we do not use any early-topping criterion and all the methods are run for exactly 150 epochs.We report on Table 3 average over 10 runs.
It appears clearly that EM-based models are much faster.Interestingly GMM is slower than iGMM, although they share the same implementation.This difference is certainly due to the extra computations needed to update the covariance matrices.

CM: Ablation study of the loss
The loss function of the CM arises as a whole from the Q function of the underlying GMM.Nevertheless, for additional insight, we perform here an ablation study of its terms.We train a CM with different combinations of the terms of its original loss (Eq.( 7)).To highlight the influence of each term, we focus on the 5 Gaussians dataset (depicted in Figure 1).Table 4 reports the clustering performance in terms of ARI.The gap in terms of ARI between the CM trained with the complete loss (first line) and any other combination of its terms confirms the coherence of the loss.The model trained without the reconstruction term E 0 reports the worse score.The ablation of the single terms indicate that the reconstruction term E 0 is the most important followed by the sparsity term E 1 .Although the removal of only E 3 has the least impact, it is the only term that combined with E 0 reports better performance than a loss function made of solely the reconstruction term E 0 .Given the variations of each curves, it seems that during the first 25 epochs the optimization focuses on minimizing the reconstruction term even if it implies an increase of the other ones.However, the complete loss curve does not flatten out until E 3 reaches its minimum.From there, a second phase begins where the total loss and the clustering metrics slowly grow in opposite directions.Interestingly, as the metrics increase and the clustering improves, E 3 also increases, which is contrary to the expected behavior.On the other hand, E 2 which is of the same magnitude as E 3 and is also influenced by the sparsity of the assignments, continuously decreases without reaching a minimum.Overall, these curves suggest that both E 2 and E 3 could be used to either stop early the training or selecting the best run.

CM: Model selection
In an unsupervised scenario, true labels are not available.It is thus necessary to have an internal measure to avoid selecting a sub-optimal solution.
There are two natural choices: select either the clustering associated with the lowest loss or the less ambiguous clustering.In the first case, the sparsity of the clusters responsibilities γ i might be eclipsed by other aspects optimized by the L CM , such as the reconstruction term.On the other hand, by selecting only given the sparsity, we may end up always choosing the most degenerate clustering.Leaning on the analysis of Figure 3, we propose to use L sp = E 2 + E 3 which sums both terms of the loss governing the sparsity of the γ i s but also involves the norm of the µ k s.
In Table 5, we report the ARI of the runs with the lowest L sp for each dataset.For comparison, we also report the average and the largest ARI.Scores selected according to L sp that were higher than the average are marked in boldface.A model selection based on L sp finds the best runs only for R10K.Nevertheless, it selects runs with ARI greater than the average in more than half of the cases.In the other cases, the difference to the average score remains below 1 point of ARI expect for CM p on 20News.The average absolute difference with the best score is 1.60 and 3.10 for CM r and CM p , respectively.Without 20News, on which CM p performs the worst, that average difference drops to 2.25 for CM p .These are satisfying results that substantiates our heuristic that L sp = E 2 + E 3 can be used as an internal metric for the CM.

AE-CM: Evaluation
In this section, we compare the clustering performance of our novel deep clustering model AE-CM against a set of baselines.We study the robustness of the model with respect to the number of clusters and a model selection scheme.We also evaluate the quality of the embeddings through the k-means clusterings thereof.Finally, we review the generative capabilities of our model.

AE-CM: Clustering performance
We compare now clustering performances and initialization schemes for representative deep clustering models.We reports average ARI, NMI and ACC over 20 runs in Table 6.An extended table including standard deviations and best run can be found in Table 16 of Appendix C. 3.
In their original papers, the DEC, IDEC and DCN are pre-trained ( p ).We report here slightly lower scores that we ascribe to our implementation and slightly different architectures.Nonetheless, the take-home message here is the consistency of their poor results when randomly initialized.This reflects an inability to produce cluster from scratch, regardless of implementation.Note that, in that case, even k-means outperforms all of them on all the datasets (see Table 10).On the other hand, DKM does return competitive score for both initialization scheme.Such results yield the question of how much clustering that is actually performed by DEC, IDEC and DCN and how much that is due to the pre-training phase.
In practice, DKM has proven sensitive to the choice of its λ hyper-parameter and to the duration of the optimization.For example, we could not find a value able to cluster Pendigit.We conjecture that expanding the clustering term of DKM's loss, as we did between Eq. ( 2) and (3), would improve the robustness of the model.
On six of the datasets, at least one of the variants of AE-CM reports the highest average or highest best run.Especially, AE-CM r produces competitive clusterings on all datasets except CIFAR10 despite its random initialization.This setback is expected as clustering models are known to fail to cluster color images from the raw pixels [30,26].
Also our AE-CM with random initialization always surpasses AE+KM except on R10K and CIFAR10, and even outperforms all the competitors by at least 20 ARI points on 20News.On the down side, AE-CM r is associated with large standard deviations which implies a less predictable optimization (see Table 16 of Appendix C.3).Therefore, we investigate an internal measure to select the best run.

AE-CM: Runtimes
We compare here the runtimes of the different method on MNIST.For a fair comparison, all the methods use the same batch size of 256 instances.We report on Table 7 average over 10 runs.We do not used any early-stopping criterion.Note that our implementation of DEC, IDEC and DCN are based on that of AE-CM.Hence these are the most comparable.The advantage goes to the model joint optimization, AE-CM, which is more that 5 minutes faster.ClusterGAN is the slowest method.It also has the most complex architecture.

AE-CM: Robustness to the number of clusters
In the previous experiments, we provided the true number of clusters to all algorithms for all datasets.In this experiment, we investigate the behavior of the AE-CM when it is set with a different number of clusters on four datasets: MNIST, USPS, R10K and Pendigit.Figure 4 shows the evolution of the ARI (left) and the homogeneity [48] score (right).The latter measures the purity in terms of true labels of each cluster.The number of cluster varies from 5 to 20.The correct value for all datasets is 10.The ARI curves (left plot) reach their maximum at 10 and then decrease.This behavior is expected since this metric (as well as NMI and ACC) penalizes the number of clusters.On the other hand, the homogeneity curves (right plot) increase with K and stabilize for K larger than 10.The convergence of these curves indicates that the clustering performance of the AE-CM do not degrade if K is set larger than the ground truth.Such a results suggests that, when K is larger than the ground truth, the AE-CM finds solutions that are partitions of those found with smaller K.Such a phenomenon is illustrated in Appendix B.3.

AE-CM: Model selection
Similarly to the CM, we discuss here a model selection heuristic for the AE-CM.The rationale behind the use of a DAE is to have an encoding facilitating the objective of the clustering module.Hence, we propose to use the heuristic of the CM (Section 5.2.4).In Table 8, we report the ARI of the runs with the lowest L sp for each dataset.For comparison, we also report the average and best ARI.Selected scores greater than the average are marked in boldface.Again, scores associated to the lowest L sp are better than the average more than half of the time.The criterion detects the best runs of AE-CM r on MNIST and CIFAR10 and of AE-CM p on MNIST.The average absolute difference to the highest score is 6.5 and 6.6 ARI points for AE-CM r and AE-CM p , respectively.In summary, although L sp as a criterion does not necessarily select the best run, it filters out the worst runs.

AE-CM: Embeddings for k-means
All baselines, including our model, are non-linear extensions of k-means and aim to improve the AE+KM.We audit the methods by running k-means on the embeddings produced by the 20 runs with random initialization on the MNIST dataset computed for Table 6 and report the average ARI, NMI and ACC in Table 9.
First, KM reports the worse results.This means that applying k-means on a feature space learned by an autoencoder does improve the quality of the clustering.Next, the results clearly show the superiority of methods utilizing a joint optimization, i.e., DKM r +KM and our AE-CM r +KM.Interestingly, the scores of DCN r +KM are here better than those of DCN r .This discrepancy is certainly due the moving average used to update the centroids.
We continue the analysis of the embeddings with UMAP [44] projections.Figure 5 depicts the projections of different embeddings of the same 2, 000 datapoints. Figure 5a represents the UMAP projection from the input space.For (5b), we used the best run of the AE+KM.For consistency, we used that embedding for the AE-CM p .Hence, Figure 5 (c) does not show the best run the AE-CM p .Finally (5d) is based on the best run of the AE-CM r .
The projection of MNIST from the input space (5a) has two pairs of classes entangled: (3,5) and (4,9).The end-to-end training of the DAE (5b) successfully isolates each class except for cluster 4 (dark pink) and cluster 9 (light blue) which stay grouped together, although separable.The AE-CM p (5c) further contracts the cluster around the centroids found by the AE+KM, but fails to separate 4 and 9. Remark that even the best run of the AE-CM p does not to correctly split the data points.The centroids for 4 and 9 in Figures 5b and c are in comparable positions: they align along the gap separating the true clusters.This suggests that the optimization of the AE-CM p did not move them much.This remark applies to the pre-trained baselines, as well.Lastly, the AE-CM r successfully produces homogeneous groups (5f).Remark that the original entanglements of the pairs (3,5) and (4,9) are suggested by the trails between the corresponding clusters.
The previous observations summarize into two insights on the behavior of the AE-CM.If the AE-CM starts with an embedding associated to a low reconstruction loss for the DAE, the optimization contracts the clusters which yields higher ARI scores.However, it is unable to move the centroids to reach another local optimum.Although the AE-CM r separates (4, 9), it also produces clusters more spread than those of the AE-CM p .The improved performances of the latter over AE+KM indicates that the AE-CM r would benefit from tighter groups.

AE-CM: Sample generation and interpolation
Thanks to the reversible feature maps obtained by the DAE, both the AE-CM and its baselines (except DEC) are generative models.Figure 6 shows the decoding of the centroids of the best run of AE+KM (ARI=67.7),IDEC p (ARI=77.2), DKM r (ARI=83.6)and AE-CM r (ARI=88.6).AE+KM's and IDEC p 's centroids for the 4 and 9 both look like 9's.With an ARI and an ACC larger than 80 and 90, respectively, DKM r and AE-CM r both clustered the data correctly and found correct centroids for each class.Both models produce clear images for each class, which align reasonably well with the washed-out average image of the respective classes (first row).Being a generative model, the AE-CM can also be used to interpolate between classes.Figure 7 shows a path made of nine interpolations between the ten centroids of the AE-CM r .We observe smooth transitions between all the pairs, which indicate that the model learned a smooth manifold from noise (random initialization).

Conclusion
We presented a novel clustering algorithm that is jointly optimized with the embedding of an autoencoder to allow for nonlinear and interpretable clusterings.We first as a key result showed that the objective function of an isotropic GMM can be turned into a loss function for autoencoders.The clustering module (CM), defined as the smallest resulting network, was shown to perform similarly to its underlying GMM in extensive empirical evaluations.
Importantly, we showed how the clustering module can be straightforwardly incorporated into deep autoencoders to allow for nonlinear clusterings.The resulting clustering network, the AE-CM, empirically outperformed existing centroid-based deep clustering architectures and performed on par with representative contemporary state-of-the-art deep clustering strategies.Nevertheless, the AE-CM, and to a lesser extent of the clustering module itself, presented a greater volatility when trained from a randmly initialized network.We expect that we could improve on that point by involving an annealing strategy on the parameter, similarly to what is done in DKM and VIB-GMM.
A future line of work consist of extending the panel of deep architectures into which the clustering module can be nested.In order to improve performance on image data sets, especially, it is necessary to involve convolution.However, standard image-specific architectures are not structured as autoencoder.This raises the question of the robustness of our model with respect to the symmetry of the DAE, especially for applications where the computation of class representative is not a must.
From a theoretical point of view, we believe that the derivations that led to the neural interpretation of Gaussian mixture models could benefit other mixture models such as the von Mises-Fisher mixture models [24] or hidden Markov models (HMM).The case of Gaussian-HMM seems especially promising as it allows to bridge with Recurrent networks [49].we want the AE-CM to learn an embedding function approximating a polynomial of degree at least.Therefore, we use 3 layers with 20 units followed by a layers with a single unit.The decoder is the mirror of the encoder.For the two circles dataset, the encoder consists of a quadratic layer, i.e.
2 , x 2 x 1 ), followed by a dense layer with a dimension 3 output.That way, the function associated to the encoder is a feature for a polynomial kernel of degree 2. The training just has to find the proper weights of the quadratic function.As for the decoder, it consists of a single layer.The AE-CM successfully clustered the two circles and two moons dataset, suggesting that it indeed learned embedding functions associated to polynomial kernels of degree 2 and at least 3, respectively.
Finally, the last dataset consisting of a square filled with a single class, we chose an α lower than 1 for both CM and AE-CM.Such a setting, informs the model that the three classes will be very imbalanced.both models reacted differently.Indeed, only AE-CM was able to perfectly assign all the points to a single cluster.

B.1 CM: Hyper-parameters
The clustering module depends on two hyper-parameters: the size of the mini-batches and the concentration of the Dirichlet prior.To visualize their influence on the clustering performance, we trained a CM on Pendigit with various sizes of batch and concentrations.Figure 9 shows the variation of the final ARI score for both a random (left) and a k-means++ initialization (right).Both axes' scales are logarithmic: exponential base for the x-axis and base 10 for the y-axis.Each combination is run once.The ARI of each dot is the average of the nine neighboring combinations.Black dots indicate an average ARI greater than the ones reported in Table 10.10.
There is a lower bound on α under which the optimization of a randomly initialized model underperforms or fails.The k-means++ initialization removes this border and spreads out the well-performing area.The distribution in both settings means that the hyper-parameters can be tuned by fitting a bi-variate Gaussian distribution.

B.2 CM: asymmetric prior
So far we considered only symmetric Dirichlet priors (α = α1 K , α ∈ R + ) regardless of the imbalance between the labels.Here, we repeat the previous experiment using the true labels distribution as the prior, i.e. α = αf where f f k K ∈ S K is the frequency of each label.In terms of implementation, E 4 is computed by sorting both α and γ i .We evaluate results on the 10x73k and Pendigit datasets, which have unbalance and balanced classes, respectively.We consider here only random initializations.Again, black dots indicate an average ARI greater than the ones reported in Table 10.10.
Figure 10b contains more black dots and a larger red area compared to 10a.The changes are greater than between Figure 10c and 9b.This discrepancy between the datasets illustrates that unbalanced ones benefit more from a custom prior.However, a higher concentration is needed to enforce the distribution: the lower bound on α is higher in Figure 10b and c than in 10a and 9a, respectively.Using the true class distribution, especially if the data is unbalanced, does ease the hyper-parameter selection.Nevertheless, such an information is not always known a priori.

B.3 CM: Merging clusters with E 3
We claimed in Section 3.2 that E 3 favors the merging of clusters.To illustrate this phenomenon, we train CM r on the 5 Gaussians dataset with twice the number of true clusters (i.e., K = 10).We compare three variants of CM's loss function: without E 3 , with E 3 and with E 3 multiplied by 1.5.The final centroids and clustering are depicted in Figure 11.For legibility, overlapping centroids are slightly shifted using a Gaussian noise.
A vanilla CM (Figure 11b) correctly positions five pairs of centroids on top of the true cluster centroids.Without E 3 (Fig. 11a), the model fails to merge the clusters properly.While five centroids are close to each of the true cluster, the five remaining are gathered around 0. Conversely, if E 3 is weighted stronger (Fig. 11c), the model becomes so prone to merge clusters that it partitions the left cloud using only two groups of centroids.

B.5 CM: Empirical setting
For the experiments reported in Section 5.2, the clustering module is trained over 150 epochs using the Adam optimizer (learning rate=0.001).The concentration α and batch-size B used for each dataset are reported in Table 13.The hyper-parameters were optimized using Bayesian optimization over 2, 000 iterations.

C.1 AE-CM: Hyper-parameters
Besides the architecture of the DAE, AE-CM has two hyper-parameters more than CM: β weighting the reconstruction of the DAE and the Lagrange coefficient λ enforcing the orthonormality of the centroids.To visualize their influence on the clustering performance, a AE-CM is trained on MNIST with various values of β and λ.Each combination is repeated five times.Note that when β = 0 the setting is equivalent to training only the encoder of the DAE (akin to DEC).Also, if λ = 0 the orthogonality constraint is omitted.Figure 12 represents the average ARI scores for each combination and both initialization schemes as a heat-map.
With a random initialization (Figure 12a), if both β and λ are not large enough, the clustering fails, excepted when β = 1.In that case, the model performs well for every value of λ, even for λ = 0, i.e., without the orthonormality constraint.Conversely, the AE-CM r always fails if trained without the reconstruction of the DAE, (β = 0) .As the order of magnitude of both parameters increases, the performances worsen.
The distribution of AE-CM p (Figure 12b) presents similarities with the previous one.Overall the average performances are better for each combination.When β is very small, the ARI exceeds 0.5.The performance also decrease as β and λ become larger.Most noticeable, the band around β = 1 is still there, but it is thicker.This is in line with the similar analysis on CM (Section B.1): Pre-trained models are less sensitive to hyperparameters.

C.2 AE-CM: Empirical setting
For the experiments reported in Section 5.3, AE-CM is trained over 150 epochs using the Adam optimizer (learning rate=0.001).Each layer of the DAE is activated with a leaky-ReLU with a slope of 0.2, except for the last one of the encoder and of the decoder.AE-CM depends on   four hyper-parameters: the weight β > 0, the concentration α ∈ S K , the Lagrange multiplier λ > 0 and the size of the batches B ∈ N * .The four hyper-parameters plus the dimension of the feature space, p; were optimized using Bayesian optimization over 2, 000 iterations.The selected values are reported in Table 14.The same architecture is used for the baselines, except for DKM where the activation are all ReLU.DEC, IDEC and DCN update their clustering every u iterations, IDEC and DCN rely on a hyper-parameter γ and DKM on a λ.Regarding DKM, the annealing process of the softmax parameter is updated every 5 epochs.We report in Table 14 the values used for each dataset.
the posterior probability of z i = k, also called the responsibility of cluster k on a data-point x i , φ k := p(z i = k) is the prior probability or mixture weight of cluster k and µ k ∈ R d is the centroid of cluster k.The average responsibilities of cluster k is γk = 1 N N i=1 γ ik and γ ∈ S K .The concentration hyperparameter of the Dirichlet prior is

Fig. 1
Fig. 1 The intermediate centroids of the last epoch are spread, whereas their averages almost match the true centroids.

Fig. 2
Fig. 2 Schematic representation of the AE-CM.Combining a clustering module and a deep autoencoder allows to jointly learn a clustering and an embedding.

Fig. 3
Fig. 3 Evolution of the total loss, each of its term and of the clustering metrics during the training of the CM on the 5 Gaussians dataset.

Figure 3
Figure3illustrates the behavior of each term of the loss of the CM (L CM ) during the training.Given the variations of each curves, it seems that during the first 25 epochs the optimization focuses on minimizing the reconstruction term even if it implies an increase of the other ones.However, the complete loss curve does not flatten out until E 3 reaches its minimum.From there, a second phase begins where the total loss and the clustering metrics slowly grow in opposite directions.Interestingly, as the metrics increase and the clustering improves, E 3 also increases, which is contrary to the expected behavior.On the other hand, E 2 which is of the same magnitude as E 3 and is also influenced by the sparsity of the assignments, continuously decreases without reaching a minimum.Overall, these curves suggest that both E 2 and E 3 could be used to either stop early the training or selecting the best run.

Fig. 4
Fig. 4 Adjusted Rand index and homogeneity score vs number of clusters.The correct number of clusters being K = 10.

Fig. 5
Fig. 5 UMAP representation of a subset of MNIST and embeddings thereof learned by AE, IDEC and AE-CM.The squares indicate the centroids.

Fig. 6
Fig. 6 Centroids mapped back to image space for AE+KM, IDEC p , DKM r , and AE-CM r .The first row displays the average image of each class.

Fig. 7
Fig. 7 Linear interpolations between different centroids (plots with border) produced with the AE-CM.

Fig. 9
Fig. 9 Clustering performance (ARI) for different combinations of batch-size, concentration and initialization scheme.Black dots indicate average ARI greater than the ones reported in Table10.

Fig. 10
Fig. 10 Clustering performance (ARI) for different combinations of batch-size, concentration and prior distribution.Black dots indicate average ARI greater than the ones reported in Table10.

Fig. 11
Fig. 11 Final positions of the centroids depending on the importance of E 3 in the loss of the clustering module.

Fig. 12
Fig. 12 Clustering performance (ARI) of AE-CM on MNIST for different values of β and λ and initialization scheme.

Table 1
The clustering performance (×100) of different models on the selected datasets.

Table 10 .
For each dataset, scores not statistically significant different from the highest (p < .05)score are marked in boldface.The * indicate the model with the highest best score among its 20 runs.An extended table including standard deviations and best run can be found in Table 12 of Appendix B.4.Average runtime for MNIST are reported on Table 3.

Table 2
Average runtime of each model to cluster MNIST in 150 epochs.

Table 3
Average runtime of each model to cluster MNIST in 150 epochs.

Table 4
Clustering performance (ARI) of the CM on the 5 Gaussians dataset trained with various combination of the terms of its original loss (first line).

Table 5
Adjusted Rand index of the run with lowest Lsp, the average run and the best run.Score larger than the average are marked in boldface.

Table 6
The clustering scores (×100) of representative deep clustering models on the selected datasets.

Table 7
Average runtime of each model to cluster MNIST in 150 epochs.

Table 8
Adjusted Rand index of the run with lowest Lsp, the average run and the best run.Scores larger than the average are marked in boldface.

Table 9
Average clustering performance of k-means on different embeddings.

Table 11
Hyper-parameters used for clustering the toy datasets.

Table 12
contains the full clustering results for CM, including standard deviation and the best run.

Table 12
The clustering results of the methods on the experimental datasets.

Table 13
Hyper-parameters used for the experiments in Section 5.2.

Table 14
Hyper-parameters used to train AE-CM for the experiments in Section 5.3.

Table 15
Hyper-parameters used to train the baselines for the experiments in Section 5.3.

Table 16
contains the full clustering results for AE-CM, including standard deviation and the best run.

Table 16
Clustering performance of the methods on the experimental datasets including average, standard deviation and best run.