Abstract
Deep embedded clustering has become a dominating approach to unsupervised categorization of objects with deep neural networks. The optimization of the most popular methods alternates between the training of a deep autoencoder and a kmeans clustering of the autoencoder’s embedding. The diachronic setting, however, prevents the former to benefit from valuable information acquired by the latter. In this paper, we present an alternative where the autoencoder and the clustering are learned simultaneously. This is achieved by providing novel theoretical insight, where we show that the objective function of a certain class of Gaussian mixture models (GMM’s) can naturally be rephrased as the loss function of a onehidden layer autoencoder thus inheriting the builtin clustering capabilities of the GMM. That simple neural network, referred to as the clustering module, can be integrated into a deep autoencoder resulting in a deep clustering model able to jointly learn a clustering and an embedding. Experiments confirm the equivalence between the clustering module and Gaussian mixture models. Further evaluations affirm the empirical relevance of our deep architecture as it outperforms related baselines on several data sets.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Clustering is one of the oldest and most difficult problems in machine learning and a great deal of different clustering techniques have been proposed in the literature. Perhaps the most prominent clustering approach is the kmeans algorithm (Lloyd, 1982). Its simplicity renders the algorithm particularly attractive to practitioners beyond the field of data mining and machine learning (Punj & Stewart, 1983; Gasch & Eisen, 2002; Hennig & Liao, 2013; Frandsen et al., 2015). A variety of related algorithms and extensions have been studied (Dunn, 1973; Krishna & Murty, 1999; Dhillon et al., 2004). The model behind the standard kmeans algorithm is an isotropic Gaussian mixture model (Kulis & Jordan, 2012; Lücke & Forster, 2019) (GMM). As such, it can only produce linear partitions of the input space.
A dominating recent line of research aims to alleviate this issue by learning an embedding of the data with a deep autoencoder (DAE) and to have variants of Gaussian mixture models operate on the embedded instances (Xie et al., 2016; Yang et al., 2017; Guo et al., 2017; Fard et al., 2020; Miklautz et al., 2020). This can be referred to as deep embedded clustering. An important reason for the emergence of this modern line of research in clustering is that it combines elements (the deep autoencoder and the GMM clustering) that are theoretically well understood when considered separately. In practice, since the embedding map is nonlinear, the result is that the otherwise linear clustering translates to a nonlinear solution in input space. This approach resembles the rationale behind kernel methods (Guyon et al., 1993). The difference lies in either computing a kernel matrix before learning the clustering versus learning the feature map as the function approximated by the encoder. Another advantage of an autoencoder (AE) based approach is provided by the decoder of the network that allows to map the centroids back into (an approximated version of the) input space. Depending on the kernel function/matrix, this is not always possible with traditional approaches.
However, even with advantages of deep embedded clustering as described above, the fact that the embedding procedure and the clustering procedure are decoupled means that the former cannot benefit from the latter, and vice versa. In the most naive approach, the deep autoencoder is first trained, and then the features output by the autoencoder are clustered. In Xie et al. (2016), an improvement is made by alternatively shrinking the clusters in the embedded space around centroids for then to update the latter using kmeans. The shrinking may however lead to an adaptation of the clustering to the embedding instead of the desired clusteringinduced embedding. In general, these approaches cannot handle a random initialization which usually leads to noisy embeddings and, in turn, meaningless clusterings.
Despite the obvious need, few deep embedded clustering procedures have been proposed featuring an inherent capability of bringing out clustering structure in a joint and simultaneous procedure. This is clearly due to the lack of a theoretical foundation that could support such an objective. In this paper, we provide novel theoretical insight that enables us to simultaneously optimize both the embedding and the clustering by relying on a neural network interpretation of Gaussian mixture models. Our solution derives from the fact that during the EM, the objective function of a certain class of GMM can be rephrased as the loss function of an autoencoder, as we show as a key result. A network trained with that loss is thus guaranteed to learn a linear clustering. Our contributions are threefold. (1) We state and prove a theorem connecting Gaussian mixture models and autoencoders. (2) We define the clustering module (CM) as a one hidden layer AE with \({\text {softmax}}\) activation trained with the loss resulting from the previous point. (3) We integrate the CM into a deep autoencoder to form a novel class of deep clustering architectures, denoted AECM. Building on the theoretical insight, the experiments confirm the empirical equivalence of the clustering module with GMM. As for the AECM, the model outperforms on several datasets baselines extending Gaussian mixture models to nonlinear clustering with deep autoencoders.
The remainder of the paper is organized as follows. Section 2 reviews related works and baselines and Sect. 3 provides the theory underpinning our contributions. The clustering module and the AECM are introduced in Sect. 4. We report empirical evaluations in Sect. 5. Section 6 concludes the paper.
2 Related work
Recently, several deep learning models have been proposed to group data in an unsupervised manner (Tian et al., 2014; Springenberg 2016; Yang et al., 2016; Kampffmeyer et al., 2019; Ghasedi Dizaji et al., 2017; Haeusser et al., 2018; McConville et al., 2019; Mukherjee et al., 2019; Uğur et al., 2020; Van Gansbeke et al., 2020). Since our main contribution unravels the connection between Gaussian mixture models and autoencoders, models based on associative (Haeusser et al., 2018; Yang et al., 2019), spectral (Tian et al., 2014; Yang et al., 2019; Bianchi et al., 2020) or subspace (Zhang et al., 2019; Miklautz et al., 2020) clusterings are outside the scope of this paper. Furthermore, unlike (Yang et al., 2016; Chang et al., 2017; Kampffmeyer et al., 2019; Ghasedi Dizaji et al., 2017; Guo et al., 2017; Caron et al., 2018; Ji et al., 2019; Van Gansbeke et al., 2020), our model falls within the category of general purpose deep embedded clustering models which builds upon GMM (Xie et al., 2016; Guo et al., 2017; Yang et al., 2017; Fard et al., 2020). We mention that recent works related to specific topics such as perturbation robustness and nonredundant subspace clustering also utilize ideas that are to some degree related to the general concept of deep embedded clustering (Yang et al., 2020; Miklautz et al., 2020). Given the high generality of our method, we detail models that are also built without advanced mechanisms, but can still support them.
One of the first approaches in this latter category of dominating modern deep clustering algorithms is DEC (Xie et al., 2016), which consists of an embedding network coupled with an adhoc matrix representing the centroids. During training, the network is trained to shrink the data around their closest centroids in a way similar to tSNE (Maaten & Hinton, 2008). However, instead of matching the distributions in the input and feature spaces, the target distribution is a mixture of Studentt models inferred from a kmeans clustering of the embedding itself. The latter is updated every few iterations to keep track of evolution of the embedding and the centroids are stored in the adhoc matrix. In order to avoid suboptimal solutions when starting from random initialization, the embedding network is pretrained. Nevertheless, the difference between the two training phases can cause stability issues, such as collapsing centroids. IDEC (Guo et al., 2017) extends DEC and alleviates this issue to some degree by using the reconstruction loss of the DAE also during the training phase. Although IDEC is more robust than DEC, both are highly dependent on the initial embedding as the clustering phase quickly shrinks the data around the centroids. The Deep Clustering Network (DCN) (Yang et al., 2017) further strengthens the deep embedding clustering framework and is based on an architecture comparable to DEC and IDEC but includes hard clustering. The optimization scheme consists of three steps: (1) Optimize the DAE to reconstruct and bring the datapoints closer to their closest centroids (similar to kmeans); (2) update assignments to the closest clusters; (3) update the centroids iteratively using the assignments. Although alternating optimization helps prevent instabilities during the optimization, DCN still requires initialization schemes in order to reach good solutions. Our proposed model instead, while being theoreticallydriven, resolves the need for alternative optimization schemes the previous methods rely on. Since we learn a variant of GMM using a neural network, the optimizations of both the embedding and the clustering profit from each other. In practice, our model is thus less reliant on pretraining schemes.
To some degree our approach is similar the very recent method known as DKM (Fard et al., 2020). The model shares the same architecture as DEC, namely, a deep autoencoder and an adhoc matrix representing the centroids. However, here, the model optimizes the network and the centroids simultaneously. To achieve that the loss function includes a term similar to the loss of fuzzy cmeans (Bezdek, 2013), i.e, kmeans with softassignments. The cluster coefficients are computed from the distances to the centroids and normalized using a parameterized \({\text {softmax}}\). In order to convert soft assignments into hard ones, the parameter follows an annealing strategy. In our model, the assignment probability function is indirectly related to the distance to the closest cluster representative, as used in DKM. Our the loss function aims to minimize the distance between the data points and their reconstruction as convex combinations of the centroids. This means that the learning of the centroid and of the assignment probabilities regularize each other, mitigating the need of an annealing strategy.
There are also other recent and promising approaches to clustering that are leveraging deep neural networks. However, these operate in a very different manner compared to deep embedded clustering, which our novel theoretical insight sheds new light on and which enables our new proposed autoencoder optimization for joint clustering and embedding. We nevertheless briefly review some of these approaches and also compare to representatives of such methods in our experiments. In particular, a great deal of works have focus on the generative aspect of Gaussian mixture models and studied variations based on deep generative models such as variational autoencoders (VAE) (Kingma & Welling, 2013) and generative adversarial network (GAN) (Goodfellow et al., 2014).
Adversarial approaches for clustering have been proposed with examples being CatGAN (Springenberg, 2016) and adversarial autoencoders (AAE) (Makhzani et al., 2015). The former optimizes the mutual information and predicts a categorical distribution of classes in the data while maximizing at the same time the robustness against an adversarial generative model. Instead, AAE uses two adversarial networks to impose a Gaussian and a categorical distribution in the latent space. The recent ClusterGAN (Mukherjee et al., 2019) makes an original use of an autoencoder as the input is not a data point but a sample drawn from the product of a multinomial and a Gaussian distributions. The decoded sample is then fed to the discriminator and the encoder. This method is all robust to random network initialization.
Variational approaches that enable clustering include methods like Gaussian Mixture VAE (GMVAE) (Dilokthanakul et al., 2016) and Variational Deep Embedding (VaDE) (Jiang et al., 2016). Although the former explicitly refers to GMMs, both methods are similar. The difference is that the parameters of the Gaussian distributions in embedded space depend only on the cluster index in VaDE whereas GMVAE involves a second random variable. Both models require a pretrained autoencoder. Variational information bottleneck with Gaussian mixture model (VIBGMM) (Uğur et al., 2020) also assumes a mixture of Gaussian distributions on the embedding, however the model is trained accordingly to the deep variational information bottleneck (VIB) framework. The latter can be seen as a information theorydriven generalization of GAN and VAE (Tishby et al., 2000; Alemi et al., 2016). VIBGMM reveals to be a robust alternative to previous approaches as it is able to produce meaningful clusterings from a randomly initialized autoencoder.
3 Theoretical groundwork
Throughout the paper the set of positive integers is denoted by \({\mathbb {N}}^*\). The set of the ddimensional stochastic vectors is written as \({\mathbb {S}}^d = \{ {\varvec{x}}\in {{\mathbb {R}}}_{\ge 0}^d \, : \, \sum _{i=1}^d x_i = 1\}.\) When the context allows, the ranges of the indices are abbreviated using the upperbound, e.g., a vector \({\varvec{x}}\in {\mathbb {R}}^d\) decomposes as \({\varvec{x}}=\langle x_j \rangle _{1\le j \le d}= \langle x_j \rangle _d\). The zero vector of \({\mathbb {R}}^d\) is written as \({\varvec{0}}_d\). The notation extends to any number. The identity matrix of \({\mathbb {R}}^{d \times d}\) is denoted by \({\mathbf {I}}_d\).
3.1 From GMMs to Autoencoders
We aim to fit an isotropic Gaussian mixture model with \(K \in {\mathbb {N}}^*\) components and a Dirichlet prior on the average cluster responsibilities on a dataset \({\mathcal {X}}=\{{\varvec{x}}_i\}_N \subset {\mathbb {R}}^d\). The expected value of the completedata loglikelihood function of the model, also called the \({\mathcal {Q}}\)function (Bishop, 2006) is:
In this expression, \(z_i \in {\mathbb {N}}\) is the cluster assignment of the data point \({\varvec{x}}_i\), \(\gamma _{ik}:= p(z_i=k  {\varvec{x}}_i)\) is the posterior probability of \(z_i=k\), also called the responsibility of cluster k on a datapoint \({\varvec{x}}_i\), \(\phi _{k}:= p(z_i=k)\) is the prior probability or mixture weight of cluster k and \({\varvec{\mu }}_k \in {\mathbb {R}}^{ d}\) is the centroid of cluster k. The average responsibilities of cluster k is \(\tilde{ \gamma }_k = \frac{1}{N} \sum _{i=1}^N \gamma _{ik}\) and \(\tilde{\varvec{\gamma }} \in {\mathbb {S}}^K\). The concentration hyperparameter of the Dirichlet prior is \(\langle \alpha _k \rangle _K \in {\mathbb {R}}^K {\setminus } \{ {\varvec{0}_K} \}\). The covariance matrices do not appear in this expression as they are all constant and equal to \(\frac{1}{2} {\mathbf {I}}_d\) due to the isotropic assumption. We summarize the parameters into matrices \(\varvec{\Gamma }\in {\mathbb {R}}^{N \times K}\), \(\varvec{\Phi }\in {\mathbb {S}}^{K}\) and \({\varvec{\mu }}\in {\mathbb {R}}^{K \times d}\). Note that the cluster responsibility vector of \({\varvec{x}}_i\) is stochastic, i.e., \({\varvec{\gamma }}_i = \langle \gamma _{ik} \rangle _K \in {\mathbb {S}}^K\).
Although, the isotropic covariances allow for great simplifications, they restrict the model to spherical Gaussian distributions. We later alleviate this constraint by introducing a deep autoencoder. The Dirichlet prior involves an extra parameter but also smoothens the optimization. Note that these two assumptions make the model a relaxation of the one underlying kmeans (Kulis & Jordan, 2012; Lücke & Forster, 2019).
Theorem 1
The maximization of the \({\mathcal {Q}}\)function in Eq. (1) with respect to \(\varvec{\Phi }\) yields a term that can be interpreted as the reconstruction loss of an autoencoder.
Proof
During the EMalgorithm, the maximization with respect to \(\varvec{\Phi }\) updates the latter as the average cluster responsibility vector:
Using this result, the \({\mathcal {Q}}\)function can be rephrased as
The first term corresponds to the entropy of \(\langle \tilde{ \gamma }_k\rangle _K\) which, given the Dirichlet prior, is constant and can thus be omitted. We expand now \(\gamma _{ik}\Vert \varvec{x}_i  {\varvec{\mu }}_k \Vert ^2\) by adding and subtracting the norm of \(\bar{{\varvec{x}}}_i = \sum _{k=1}^K \gamma _{ik} {\varvec{\mu }}_k\): For any given \(i \in [1 \mathrel {{.}\,{.}}N]\),
Note that this simplification grounds on the isotropic assumption. It is not obvious how to reach a similar result without it. The \({\mathcal {Q}}\)function becomes:
The variable \(\bar{{\varvec{x}}}_i\) is actually a function of \({\varvec{x}}\) that factorizes into two functions \({\mathcal {F}}\) and \({\mathcal {G}}\). The former computes \(\varvec{\gamma }_{i}\) which, as a posterior probability, is a function of \({\varvec{x}}_i\). The latter is the dotproduct with \({\varvec{\mu }}\).
The first term of Eq. (3) can thus be interpreted as the reconstruction term characteristic of a loss of an autoencoder, consisting of the encoder and decoder functions, which are \({\mathcal {F}}\) and \({\mathcal {G}}\), respectively. \(\square\)
3.2 Analysis of the terms
The four terms of the expansion of the \({\mathcal {Q}}\)function as Eq. (3) gives new insights into the training of GMMs.
\(E_1\): Reconstruction The term \(E_1\) suggests an autoencoder structure to optimize \({\mathcal {Q}}\). The decoder, \({\mathcal {G}}\), is a linear function. However, without loss of generality, it can be treated in practice as an affine function, i.e. a single layer network with bias. Regarding the encoder \({\mathcal {F}}\), the architecture can be anything as long as it has a stochastic output.
\(E_2\): Sparsity and regularization The second term \(E_2\) is related to the Gini impurity index (Breiman et al., 1984) applied to \(\varvec{\gamma }_{i}\):
where \(\Vert .\Vert _F\) is the Frobenius norm. The Gini index is standard in decision tree theory to select branching features and is an equivalent of the entropy. It is nonnegative and null if, and only if, \(\varvec{\gamma }_i\) is a onehot vector. Hence, minimizing this term favors sparse \(\varvec{\gamma }_i\) resulting in clearer assignments.
The terms \(\Vert {\varvec{\mu }}_k\Vert ^2\) play a role similar to an \(\ell _2\)regularization: they prevent the centroids from diverging away from the datapoints. However, they may also favor the trivial solution where all the centroids are merged into zero.
\(E_3\): Sparsity and cluster merging To study the behavior of \(E_3\) during the optimization, let us consider a simple example with one observation and two clusters, i.e, \(\varvec{\Gamma }\equiv \varvec{\gamma }_{1}=(\gamma , 1\gamma )\). If the observation is unambiguously assigned to one cluster, \(\varvec{\gamma }_{1}\) is a onehot vector and \(E_3\) is null. If it is not the case, the difference between \(E_2\) and \(E_3\) factorizes as follows:
The optimization will thus either push \(\varvec{\gamma }_{1}\) toward a more sparse vector, or merge the two centroids. Appendix 3 presents an analysis of the role of this term.
\(E_4\): Balancing The Dirichlet prior steers the distribution of the cluster assignments. If none of the \(\alpha _k\) is null, the prior will push the optimization to use all the clusters, moderating thus the penchant of \(E_2\) for the trivial clustering (Yang et al., 2017; Guo et al., 2017).
Note that if \({\varvec{\alpha }} = \left( 1 + \frac{1}{K}\right) {\varvec{1}}_K\), \(E_4\) is, up to a constant, equal to the Kullback–Leibler (KL) divergence between a multinomial distribution with parameter \(\tilde{\varvec{\gamma }}\) and the uniform multinomial distribution:
4 Clustering and embedding with Autoencoders
Theorem 1 says that an autoencoder could be involved during the EM optimization of an isotropic GMM. We go one step further and bypass the EM to directly optimize the \({\mathcal {Q}}\)function in Eq. (3)) using an autoencoder.
4.1 The clustering module
We define the Clustering Module (CM) as the onehidden layer autoencoder with encoding and decoding functions \({\mathcal {F}}\) and \({\mathcal {G}}\) such as:
where \({\varvec{X}}\in {\mathbb {R}}^{N \times d} \sim {\mathcal {X}}\), code representation/cluster responsibilities \(\varvec{\Gamma }= \langle \gamma _{ik} \rangle _{N \times K} \in {\mathbb {R}}^{N \times K}\) s.t. \(\varvec{\gamma }_{i} \in {\mathbb {S}}^K\), and the reconstruction \(\bar{{\varvec{X}}}\in {\mathbb {R}}^{N \times d}\). The weight and bias parameters of the encoder are \({\varvec{W}}_{\text {enc}} \in {\mathbb {R}}^{d \times K}\) and \({\varvec{B}}_{\text {enc}} \in {\mathbb {R}}^{K}\), respectively, and analogously for the decoder \({\varvec{W}}_{\text {dec}} \in {\mathbb {R}}^{K \times d}\) and \({\varvec{B}}_{\text {dec}} \in {\mathbb {R}}^{d}\). The \({\text {softmax}}\) enforces the rowstochasticity of the code, ie., \(\varvec{\Gamma }\). The associated loss function is the negative of Eq. (3):
with \(\Theta =\Big ({\varvec{W}}_{\text {enc}},{\varvec{B}}_{\text {enc}},{\varvec{W}}_{\text {dec}},{\varvec{B}}_{\text {dec}}\Big )\). The centroids of the underlying GMM correspond to the images of \({\mathcal {G}}\) of the canonical basis of \({\mathbb {R}}^K\).
Initialization
The CM can be initialized using kmeans or any initialization scheme thereof such as kmeans++ (Arthur & Vassilvitskii, 2007). In such a case, the columnvectors of \({\varvec{W}}_{\text {dec}}\) are set equal to the desired centroids. The pseudoinverse of this matrix becomes the encoder’s weights, \({\varvec{W}}_{\text {enc}}\), and both bias vectors are set to null.
Averaging epoch
In practice, the CM will be optimized by minibatch learning with stochastic gradient descent. In such a procedure, the optimizer updates the positions of the centroids given the current batch. A small batchsize relative to the size of \({\mathcal {X}}\) may cause dispersion of the intermediate centroids. Hence, choosing the final centroids based on the last iteration may be suboptimal.
We illustrate this phenomenon in Fig. 1. The data consists of \(N = 2000\) points in \({\mathbb {R}}^2\) drawn from a mixture of five bivariate Gaussians (\(K = 5\)) (gray dots). The data is standardized before processing. A CM is trained in minibatches of size 20 over 50 epochs using stochastic gradient descent. The concentration is set to \(\varvec{\alpha }= {\varvec{5}}_K\). The dispersion of the centroids after each iteration of the last epoch (crosses) is significant. On the other hand, their average positions (squares) provide a good approximation of the true centers (circles). Therefore, we include one extra epoch to any implementation of the CM to compute the average position of the individual centroids over the last iterations.
4.2 Embedding with feature maps
The theory behind GMMs limits the CM to a linear decoder, thus enabling merely a linear partition of the input space. In addition, the isotropy assumption, specific to the CM, bars clusters to spread differently. We alleviate both limitations using a similar approach to that of kernel methods (Guyon et al., 1993): we nonlinearly project the input into a feature space where it will be clustered. However, we do not learn the kernel matrices, providing an implicit feature map. Instead, we learn explicitly the feature maps using a deep autoencoder (DAE).
The idea is to optimize the CM and the DAE simultaneously, in order to let the latter find distortions of the input space along the way that guides the CM toward a better optimum. Using a deep autoencoder architecture prevents the optimization to produce degenerate feature maps Guo et al. (2017). It also preserves the generative nature of the model: points in the input space can be generated from a combination of centroids in the feature space. We refer to this model as the AECM.
The model consists of a clustering module nested into a deep autoencoder The architecture is illustrated in Fig. 2. The first part of the DAE encodes an input \({\varvec{x}}\in {\mathbb {R}}^d\) into a vector \({\varvec{z}}\in {\mathbb {R}}^p\). Note, CM now works on code representation \({\varvec{z}}\) and not directly on the input \({\varvec{x}}\). The code \({\varvec{z}}\) is fed to the CM and to the decoder of the DAE, yielding two outputs: \(\tilde{{\varvec{z}}}\), the CM’s reconstruction of \({\varvec{z}}\), and \(\bar{{\varvec{x}}}\), the DAE’s reconstruction of \({\varvec{x}}\).
4.2.1 Adapting the loss function to a deep architecture
Empirical evaluation showed that current gradient descent optimizers (e.g., Adam, Kingma & Ba, 2015) often return suboptimal solutions when the reconstruction of the deep autoencoder is simply added to the loss of the CM. To help the optimization to find better optima, we add the assumption that the centroids are orthonormal:
Although the previous formula involves only \({\varvec{\mu }}\) which is learned by the nested clustering module, it affects the surrounding DAE. Indeed, the constraint encourages it to produce an embedding where the centroids can simultaneously be orthonormal and minimize CM’s loss. As a consequence, \(E_2\) simplifies and \(E_3\) becomes null:
Note that the constraint Eq. (8) is satisfied for the “ideal” clustering, in which case clusters will be mapped to the corners of a simplex in the embedding space, as discussed recently in Kampffmeyer et al. (2019). In this perspective, our inclusion of this constraint helps guide the clustering towards the ideal clustering, and at the same time simplifies the loss function.
We employ Lagrange multipliers to integrate the orthonormality constraint. That way the final loss can be stated as follows:
where \(\lambda > 0\) is the Lagrange multiplier and \(\beta > 0\) weights the DAE’s reconstruction loss. We choose the \(\ell _1\)norm to enforce orthonormality, however other norms can be used.
Note that if the dimension of the embedding p is larger than the number of centroids K, the embedding can always be transformed to satisfy the orthonormality of the centroids. On the other hand, if \(K > p\), the assumption becomes restrictive also in term of possible clusterings. Nevertheless, its importance can be reduced with a small \(\lambda\). This assumption also helps to avoid the centroids to collapse as their norm is required to be 1. An analysis of the Lagrange multiplier is provided in Appendix 1.
The loss function of the AECM thus depends on four hyperparameters: the weight \(\beta >0\), the concentration \({\varvec{\alpha }}\in {\mathbb {S}}^K\), the Lagrange multiplier \(\lambda >0\), and the size of the batches \(B \in {\mathbb {N}}^*\).
4.2.2 Implementation details
Since the AECM builds upon the CM, any implementation also contains an averaging epoch. In case of pretraining, both subnetworks need to be initialized. We favor a straightforward endtoend training of the DAE (without dropout or noise) over a few epochs. The clustering module is then initialized using kmeans++ on the embedded dataset. Finally, the CM is optimized alone using \({\mathcal {L}}_{\text {CM}}\) for a few epochs.
5 Experiments
In this section, we evaluate the clustering module and the AECM on several data sets covering different types of data. To highlight the generality of our method, we rely only fully connected architecture, ie. we do not use convolution layers even for image data sets. That said, we focus on general purpose baselines. The experiments were conducted on an Intel(R) Xeon(R) CPU E52698 v4 @ 2.20GHz with 32Gb of RAM supported with a NVIDIA(R) Tesla V100 SXM2 32GB GPU.
5.1 Experimental Setting
5.1.1 Datasets
We leverage eight common benchmark data sets in the deep clustering literature plus one synthetic data set:

MNIST LeCun et al. (1989) contains 70, 000 handwritten images of the digits 0 to 9. The images are grayscale with the digits centered in the \(28\times 28\) images. The pixel values are normalized before processing.

fMNIST Xiao et al. (2017) contains 70, 000 images of fashion products organized in 10 classes. The images are grayscale with the product centered in the \(28\times 28\) images. The pixel values are normalized before processing.

USPS^{Footnote 1} contains 9298 images of digits 0 to 9. The images are grayscale with size \(28\times 28\) pixels. The pixel values are normalized before processing.

CIFAR10 Krizhevsky et al. (2009) contains 60, 000 color images of 10 classes of subjects (dogs, cats, airplanes...). Images are of size \(32 \times 32\). The pixel values are normalized before processing.

Reuters10k^{Footnote 2}, here abbreviated R10K, consists of 800, 000 news articles. The dataset is preprocessed as in Guo et al. (2017) to return a subset of 10, 000 random samples embedded into a 2, 000dimensional space (tfidf transformation) and distributed over 4 (highly) imbalanced categories.

20News^{Footnote 3} contains 18, 846 messages from newsgroups on 20 topics. Features consists of the tfidf transformation of the 2000 most frequent words.

10 \(\times\) 73k Zheng et al. (2017) consists of 73, 233 RNAtranscript belonging to 8 different cell types Jang et al. (2017). The features consists of the \(\log\) of the gene expression variance of the 720 genes with the largest variance. The dataset is relatively sparse with \(40\%\) of entries null.

Pendigit Alimoglu and Alpaydin (1996) consists of 10, 992 sequences of coordinates on a tablet captured as writers write digits, thus 10 classes. The dataset is normalized before processing.

5 Gaussians consists of \(N = 2000\) points in \({\mathbb {R}}^2\) drawn from a mixture of five bivariate Gaussians (\(K = 5\)). The dataset is depicted in Fig. 1.
5.1.2 Evaluation metrics
The clustering performance of each model is evaluated using three frequentlyused metrics: the Adjusted Rand Index (ARI) (Hubert & Arabie, 1985), the Normalized Mutual Information (NMI) (Estévez et al., 2009), and the clustering accuracy (ACC) (Kuhn, 1955). These metrics range between 0 and 1 where the latter indicates perfect clustering. For legibility, values are always multiplied by 100. For each table, scores not statistically different (ttest \(p<0.05\)) from the best score of the column are marked in boldface. The \(*\) indicates the model with the best run. A failure (−) corresponds to an ARI close to 0.
5.1.3 Baselines
We include two baselines for the CM: kmeans (KM), a GMM with full covariance and an isotropic GMM (iGMM) with uniform mixture weights. The latter differs from the model the clustering module derives but the Dirichlet prior on the responsibilities yields non tractable updates and a Dirichlet prior on the mixture weights harms the performance.
We compare the AECM to four baselines reviewed in Sect. 2: DEC (Xie et al., 2016), its extension IDEC (Guo et al., 2017), DCN (Yang et al., 2017) and DKM (Fard et al., 2020). We include as the naive approach (AE+KM) consisting of a trained DAE followed by kmeans on the embedding. We also add ClusterGAN (Mukherjee et al., 2019) and VIBGMM (Uğur et al., 2020) as alternatives based on variational autoencoders (Kingma & Welling, 2013) and generative adversarial networks (Goodfellow et al., 2014), respectively.
Random and pretrained initialization are indicated with \({}^r\) and \({}^p\), respectively. If omitted, the initialization is random. Every experiment is repeated 20 times.
5.1.4 Implementation
Both CM and AECM are implemented using TensorFlow 2.1 (Abadi et al., 2016)^{Footnote 4}. We also reimplemented DEC, IDEC and DCN. All deep models but ClusterGAN and VIBGMM use the same fully connected autoencoder d5005002000p2000500500d and leaky relu activations, where d and p are the input and feature space dimensions, respectively. For ClusterGAN and VIBGMM, we used the architecture provided in the original code. As well, the DAE reconstruction loss is the mean square error, regardless of the dataset and of the model, except for VIBGMM on images which requires a crossentropy loss (it under performs, otherwise). CM and its baselines are trained for up to 150 epochs, deep models for 1000 epochs. The hyperparameters (batchsize, p, concentration, etc.) are listed in Tables 12 and 13.
5.2 CM: evaluation
Recall that the loss of the clustering module is a lower bound of the objective function of its underlying isotropic GMM which approximates kmeans. Moreover, the optimization of the CM is based on gradient descent instead of EM. We compare these three models as a sanity check, and show that, despite the differences, they report similar clustering performance. We also present an ablation study of the loss and a model selection scheme for the CM. An analysis of the hyperparameters and of the Dirichlet prior are reported in Appendix 1 and 2, respectively.
5.2.1 CM: clustering performance
In this experiment, we compare clustering performances and initialization schemes. Random and kmeans++ initializations are indicated with the superscripts \({}^r\) and \({}^p\), respectively. Each experiment is repeated 20 times. We report averages in Table 1. For each dataset, scores not statistically significant different from the highest (\(p<.05\)) score are marked in boldface. The \(*\) indicate the model with the highest best score among its 20 runs. An extended table including standard deviations and best run can be found in Table 11 of Appendix 4. Average runtime for MNIST are reported on Table 2.
As expected, the clustering module performs similarly to iGMM and kmeans on every dataset with respect to almost every metrics and for any initialization scheme. Often the kmeans++ initialization does not improve the results. In the case of the clustering module the difference is never significant except on the 20News dataset.
5.2.2 CM: runtimes
We compare here the runtimes of the different method on MNIST. For a fair comparison, we do not use any earlytopping criterion and all the methods are run for exactly 150 epochs. We report on Table 2 average over 10 runs.
It appears clearly that EMbased models are much faster. Interestingly GMM is slower than iGMM, although they share the same implementation. This difference is certainly due to the extra computations needed to update the covariance matrices.
5.2.3 CM: ablation study of the loss
The loss function of the CM arises as a whole from the \({\mathcal {Q}}\) function of the underlying GMM. Nevertheless, for additional insight, we perform here an ablation study of its terms. We train a CM with different combinations of the terms of its original loss (Eq. (7)). To highlight the influence of each term, we focus on the 5 Gaussians dataset (depicted in Fig. 1). Table 3 reports the clustering performance in terms of ARI.
The gap in terms of ARI between the CM trained with the complete loss (first line) and any other combination of its terms confirms the coherence of the loss. The model trained without the reconstruction term \(E_0\) reports the worse score. The ablation of the single terms indicate that the reconstruction term \(E_0\) is the most important followed by the sparsity term \(E_1\). Although the removal of only \(E_3\) has the least impact, it is the only term that combined with \(E_0\) reports better performance than a loss function made of solely the reconstruction term \(E_0\).
Figure 3 illustrates the behavior of each term of the loss of the CM (\({\mathcal {L}}_{CM}\)) during the training. Given the variations of each curves, it seems that during the first 25 epochs the optimization focuses on minimizing the reconstruction term even if it implies an increase of the other ones. However, the complete loss curve does not flatten out until \(E_3\) reaches its minimum. From there, a second phase begins where the total loss and the clustering metrics slowly grow in opposite directions. Interestingly, as the metrics increase and the clustering improves, \(E_3\) also increases, which is contrary to the expected behavior. On the other hand, \(E_2\) which is of the same magnitude as \(E_3\) and is also influenced by the sparsity of the assignments, continuously decreases without reaching a minimum. Overall, these curves suggest that both \(E_2\) and \(E_3\) could be used to either stop early the training or selecting the best run.
5.2.4 CM: model selection
In an unsupervised scenario, true labels are not available. It is thus necessary to have an internal measure to avoid selecting a suboptimal solution.
There are two natural choices: select either the clustering associated with the lowest loss or the less ambiguous clustering. In the first case, the sparsity of the clusters responsibilities \(\varvec{\gamma }_i\) might be eclipsed by other aspects optimized by the \({\mathcal {L}}_{\text {CM}}\), such as the reconstruction term. On the other hand, by selecting only given the sparsity, we may end up always choosing the most degenerate clustering. Leaning on the analysis of Fig. 3, we propose to use \({\mathcal {L}}_{\text {sp}}=E_2+E_3\) which sums both terms of the loss governing the sparsity of the \(\varvec{\gamma }_i\)s but also involves the norm of the \({\varvec{\mu }}_k\)s.
In Table 4, we report the ARI of the runs with the lowest \({\mathcal {L}}_{\text {sp}}\) for each dataset. For comparison, we also report the average and the largest ARI. Scores selected according to \({\mathcal {L}}_{\text {sp}}\) that were higher than the average are marked in boldface.
A model selection based on \({\mathcal {L}}_{\text {sp}}\) finds the best runs only for R10K. Nevertheless, it selects runs with ARI greater than the average in more than half of the cases. In the other cases, the difference to the average score remains below 1 point of ARI expect for CM\({}^p\) on 20News. The average absolute difference with the best score is 1.60 and 3.10 for CM\({}^r\) and CM\({}^p\), respectively. Without 20News, on which CM\({}^p\) performs the worst, that average difference drops to 2.25 for CM\({}^p\). These are satisfying results that substantiates our heuristic that \({\mathcal {L}}_{\text {sp}}=E_2+E_3\) can be used as an internal metric for the CM.
5.3 AECM: evaluation
In this section, we compare the clustering performance of our novel deep clustering model AECM against a set of baselines. We study the robustness of the model with respect to the number of clusters and a model selection scheme. We also evaluate the quality of the embeddings through the kmeans clusterings thereof. Finally, we review the generative capabilities of our model.
5.3.1 AECM: clustering performance
We compare now clustering performances and initialization schemes for representative deep clustering models. We reports average ARI, NMI and ACC over 20 runs in Table 5. An extended table including standard deviations and best run can be found in Table 15 of Appendix 3.
In their original papers, the DEC, IDEC and DCN are pretrained (\({}^p\)). We report here slightly lower scores that we ascribe to our implementation and slightly different architectures. Nonetheless, the takehome message here is the consistency of their poor results when randomly initialized. This reflects an inability to produce cluster from scratch, regardless of implementation. Note that, in that case, even kmeans outperforms all of them on all the datasets (see Table 1). On the other hand, DKM does return competitive score for both initialization scheme. Such results yield the question of how much clustering that is actually performed by DEC, IDEC and DCN and how much that is due to the pretraining phase.
In practice, DKM has proven sensitive to the choice of its \(\lambda\) hyperparameter and to the duration of the optimization. For example, we could not find a value able to cluster Pendigit. We conjecture that expanding the clustering term of DKM’s loss, as we did between Eqs. (1) and (3), would improve the robustness of the model.
On six of the datasets, at least one of the variants of AECM reports the highest average or highest best run. Especially, AECM\({}^r\) produces competitive clusterings on all datasets except CIFAR10 despite its random initialization. This setback is expected as clustering models are known to fail to cluster color images from the raw pixels (Jiang et al., 2016; Hu et al., 2017).
Also our AECM with random initialization always surpasses AE+KM except on R10K and CIFAR10, and even outperforms all the competitors by at least 20 ARI points on 20News. On the down side, AECM\({}^r\) is associated with large standard deviations which implies a less predictable optimization (see Table 15 of Appendix 3). Therefore, we investigate an internal measure to select the best run.
5.3.2 AECM: runtimes
We compare here the runtimes of the different method on MNIST. For a fair comparison, all the methods use the same batch size of 256 instances. We report on Table 6 average over 10 runs. We do not used any earlystopping criterion.
Note that our implementation of DEC, IDEC and DCN are based on that of AECM. Hence these are the most comparable. The advantage goes to the model joint optimization, AECM, which is more that 5 min faster. ClusterGAN is the slowest method. It also has the most complex architecture.
5.3.3 AECM: robustness to the number of clusters
In the previous experiments, we provided the true number of clusters to all algorithms for all datasets. In this experiment, we investigate the behavior of the AECM when it is set with a different number of clusters on four datasets: MNIST, USPS, R10K and Pendigit. Figure 4 shows the evolution of the ARI (left) and the homogeneity (Rosenberg & Hirschberg, 2007) score (right). The latter measures the purity in terms of true labels of each cluster. The number of cluster varies from 5 to 20. The correct value for all datasets is 10.
The ARI curves (left plot) reach their maximum at 10 and then decrease. This behavior is expected since this metric (as well as NMI and ACC) penalizes the number of clusters. On the other hand, the homogeneity curves (right plot) increase with K and stabilize for K larger than 10. The convergence of these curves indicates that the clustering performance of the AECM do not degrade if K is set larger than the ground truth. Such a results suggests that, when K is larger than the ground truth, the AECM finds solutions that are partitions of those found with smaller K. Such a phenomenon is illustrated in Appendix 3.
5.3.4 AECM: model selection
Similarly to the CM, we discuss here a model selection heuristic for the AECM. The rationale behind the use of a DAE is to have an encoding facilitating the objective of the clustering module. Hence, we propose to use the heuristic of the CM (Sect. 5.2.4). In Table 7, we report the ARI of the runs with the lowest \({\mathcal {L}}_{\text {sp}}\) for each dataset. For comparison, we also report the average and best ARI. Selected scores greater than the average are marked in boldface.
Again, scores associated to the lowest \({\mathcal {L}}_{\text {sp}}\) are better than the average more than half of the time. The criterion detects the best runs of AECM\({}^r\) on MNIST and CIFAR10 and of AECM\({}^p\) on MNIST. The average absolute difference to the highest score is 6.5 and 6.6 ARI points for AECM\({}^r\) and AECM\({}^p\), respectively. In summary, although \({\mathcal {L}}_{\text {sp}}\) as a criterion does not necessarily select the best run, it filters out the worst runs.
5.3.5 AECM: embeddings for kmeans
All baselines, including our model, are nonlinear extensions of kmeans and aim to improve the AE+KM. We audit the methods by running kmeans on the embeddings produced by the 20 runs with random initialization on the MNIST dataset computed for Table 5 and report the average ARI, NMI and ACC in Table 8.
First, KM reports the worse results. This means that applying kmeans on a feature space learned by an autoencoder does improve the quality of the clustering. Next, the results clearly show the superiority of methods utilizing a joint optimization, i.e., DKM\({}^r\)+KM and our AECM\({}^r\)+KM. Interestingly, the scores of DCN\({}^r\)+KM are here better than those of DCN\({}^r\). This discrepancy is certainly due the moving average used to update the centroids.
We continue the analysis of the embeddings with UMAP McInnes et al. (1802) projections. Figure 5 depicts the projections of different embeddings of the same 2000 datapoints. Figure 5a represents the UMAP projection from the input space. For (5b), we used the best run of the AE+KM. For consistency, we used that embedding for the AECM\({}^p\). Hence, Fig. 5c does not show the best run the AECM\({}^p\). Finally (5d) is based on the best run of the AECM\({}^r\).
The projection of MNIST from the input space (5a) has two pairs of classes entangled: (3, 5) and (4, 9). The endtoend training of the DAE (5b) successfully isolates each class except for cluster 4 (dark pink) and cluster 9 (light blue) which stay grouped together, although separable. The AECM\({}^p\) (5c) further contracts the cluster around the centroids found by the AE+KM, but fails to separate 4 and 9. Remark that even the best run of the AECM\({}^p\) does not to correctly split the data points. The centroids for 4 and 9 in Fig. 5b and c are in comparable positions: they align along the gap separating the true clusters. This suggests that the optimization of the AECM\({}^p\) did not move them much. This remark applies to the pretrained baselines, as well. Lastly, the AECM\({}^r\) successfully produces homogeneous groups (5f). Remark that the original entanglements of the pairs (3, 5) and (4, 9) are suggested by the trails between the corresponding clusters.
The previous observations summarize into two insights on the behavior of the AECM. If the AECM starts with an embedding associated to a low reconstruction loss for the DAE, the optimization contracts the clusters which yields higher ARI scores. However, it is unable to move the centroids to reach another local optimum. Although the AECM\({}^r\) separates (4, 9), it also produces clusters more spread than those of the AECM\({}^p\). The improved performances of the latter over AE+KM indicates that the AECM\({}^r\) would benefit from tighter groups.
5.3.6 AECM: sample generation and interpolation
Thanks to the reversible feature maps obtained by the DAE, both the AECM and its baselines (except DEC) are generative models. Figure 6 shows the decoding of the centroids of the best run of AE+KM (ARI=67.7), IDEC\({}^p\) (ARI=77.2), DKM\({}^r\) (ARI=83.6) and AECM\({}^r\) (ARI=88.6). AE+KM’s and IDEC\({}^p\)’s centroids for the 4 and 9 both look like 9’s. With an ARI and an ACC larger than 80 and 90, respectively, DKM\({}^r\) and AECM\({}^r\) both clustered the data correctly and found correct centroids for each class. Both models produce clear images for each class, which align reasonably well with the washedout average image of the respective classes (first row).
Being a generative model, the AECM can also be used to interpolate between classes. Figure 7 shows a path made of nine interpolations between the ten centroids of the AECM\({}^r\). We observe smooth transitions between all the pairs, which indicate that the model learned a smooth manifold from noise (random initialization).
6 Conclusion
We presented a novel clustering algorithm that is jointly optimized with the embedding of an autoencoder to allow for nonlinear and interpretable clusterings. We first as a key result showed that the objective function of an isotropic GMM can be turned into a loss function for autoencoders. The clustering module (CM), defined as the smallest resulting network, was shown to perform similarly to its underlying GMM in extensive empirical evaluations.
Importantly, we showed how the clustering module can be straightforwardly incorporated into deep autoencoders to allow for nonlinear clusterings. The resulting clustering network, the AECM, empirically outperformed existing centroidbased deep clustering architectures and performed on par with representative contemporary stateoftheart deep clustering strategies. Nevertheless, the AECM, and to a lesser extent of the clustering module itself, presented a greater volatility when trained from a randmly initialized network. We expect that we could improve on that point by involving an annealing strategy on the parameter, similarly to what is done in DKM and VIBGMM.
A future line of work consist of extending the panel of deep architectures into which the clustering module can be nested. In order to improve performance on image data sets, especially, it is necessary to involve convolution. However, standard imagespecific architectures are not structured as autoencoder. This raises the question of the robustness of our model with respect to the symmetry of the DAE, especially for applications where the computation of class representative is not a must.
From a theoretical point of view, we believe that the derivations that led to the neural interpretation of Gaussian mixture models could benefit other mixture models such as the von MisesFisher mixture models (Hasnat et al., 2017) or hidden Markov models (HMM). The case of GaussianHMM seems especially promising as it allows to bridge with Recurrent networks (Salaün et al., 2019).
Notes
Code available at: https://github.com/AhceneB/clusteringModule.
References
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., ... & Zheng, X. (2016). Tensorflow: A system for largescale machine learning. In 12th USENIX symposium on operating systems design and implementation (OSDI 16) (pp. 265–283).
Alemi, A. A., Fischer, I., Dillon, J. V., & Murphy, K. (2016). Deep variational information bottleneck. http://arxiv.org/abs/1612.00410
Alimoglu, F., & Alpaydin, E. (1996). Methods of combining multiple classifiers based on different representations for Penbased Handwritten Digit Recognition. In Proceedings of the Fifth Turkish Artificial Intelligence and Artificial Neural Networks Symposium.
Arthur, D., & Vassilvitskii, S. (2007). Kmeans++: The advantages of careful seeding. In Proceedings of the Eighteenth Annual ACMSIAM Symposium on Discrete Algorithms (SODA) (pp. 1027–1035).
Bezdek, J. C. (2013). Pattern recognition with fuzzy objective function algorithms. Berlin: Springer.
Bianchi, F. M., Grattarola, D., & Alippi, C. (2020). Spectral clustering with graph neural networks for graph pooling. In Proceedings of the 37th of the International Conference on Machine Learning (ICML) (pp. 874–883).
Bishop, C. M. (2006). Pattern recognition and machine learning (Information science and statistics). Berlin, Heidelberg: SpringerVerlag.
Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). Classification and regression trees. Monterey, CA: Wadsworth and Brooks.
Caron, M., Bojanowski, P., Joulin, A., & Douze, M. (2018). Deep clustering for unsupervised learning of visual features. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 132–149).
Chang, J., Wang, L., Meng, G., Xiang, S., & Pan, C. (2017). Deep adaptive image clustering. In Proceedings of the IEEE international conference on computer vision (ICCV) (pp. 5879–5887).
Dhillon, I. S., Guan, Y., & Kulis, B. (2004). Kernel kmeans: Spectral clustering and normalized cuts. In Proceedings of the 2004 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining  KDD ’04.
Dilokthanakul, N., Mediano, P. A. M., Garnelo, M., Lee, M. C. H., Salimbeni, H., Arulkumaran, K., & Shanahan, M. (2016). Deep unsupervised clustering with Gaussian mixture variational autoencoders. http://arxiv.org/abs/1611.02648
Dunn, J. C. (1973). A fuzzy relative of the ISODATA process and its use in detecting compact wellseparated clusters. Journal of Cybernetics, 3(3), 32–57.
Estévez, P. A., Tesmer, M., Perez, C. A., & Zurada, J. M. (2009). Normalized mutual information feature selection. IEEE Trans. Neural Netw., 20(2), 189–201.
Fard, M. M., Thonet, T., & Gaussier, E. (2020). Deep kmeans: Jointly clustering with kmeans and learning representations. Pattern Recogn. Lett., 138, 185–192.
Frandsen, P. B., Calcott, B., Mayer, C., & Lanfear, R. (2015). Automatic selection of partitioning schemes for phylogenetic analyses using iterative kmeans clustering of site rates. BMC Evolut. Biol., 15(1), 1–17.
Gasch, A. P., & Eisen, M. B. (2002). Exploring the conditional coregulation of yeast gene expression through fuzzy kmeans clustering. Genome Biol., 3(11), 1–22.
Dizaji, K. G., Herandi, A., Deng, C., Cai, W., & Huang, H. (2017). Deep clustering via joint convolutional autoencoder embedding and relative entropy minimization. In Proceedings of the IEEE international conference on computer vision (ICCV) (pp. 5736–5745).
Goodfellow, I. J., PougetAbadie, J., Mirza, M., Xu, B., WardeFarley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative Adversarial Networks. http://arxiv.org/abs/1406.2661
Guo, X., Gao, L., Liu, X., & Yin, J. (2017). Improved Deep Embedded Clustering with local structure preservation. In Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI) (pp. 1753–1759).
Guo, X., Liu, X., Zhu, E., & Yin, J. (2017). Deep Clustering with Convolutional Autoencoders. In Neural Information Processing (pp. 373–382).
Guyon, I., Boser, B., & Vapnik, V. (1993). Automatic capacity tuning of very large VCdimension classifiers. In Advances in Neural Information Processing Systems (NIPS) (pp. 147–155).
Haeusser, P., Plapp, J., Golkov, V., Aljalbout, E., & Cremers, D. (2018). Associative deep clustering: Training a classification network with no labels. In German Conference on Pattern Recognition (pp. 18–32).
Hasnat, M. A., Bohné, J., Milgram, J., Gentric, S., & Chen, L. (2017). von MisesFisher mixture modelbased deep learning: Application to face verification. http://arxiv.org/abs/1706.04264
Hennig, C., & Liao, T. F. (2013). How to find an appropriate clustering for mixedtype variables with application to socioeconomic stratification. J. Royal Stat. Soc.: Ser. C (Appl. Stat.), 62(3), 309–369.
Hu, W., Miyato, T., Tokui, S., Matsumoto, E., & Sugiyama, M. (2017). Learning discrete representations via information maximizing selfaugmented training. In Proceedings of the 34th International Conference on Machine Learning (ICML) (pp. 1558–1567).
Hubert, L., & Arabie, P. (1985). Comparing partitions. J. Classif., 2(1), 193–218.
Jang, E., Gu, S., & Poole, B. (2016). Categorical reparameterization with GumbelSoftmax. http://arxiv.org/abs/1611.01144
Ji, X., Henriques, J. F., & Vedaldi, A. (2019). Invariant information clustering for unsupervised image classification and segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (pp. 9865–9874).
Jiang, Z., Zheng, Y., Tan, H., Tang, B., & Zhou, H. (2017). Variational deep embedding: an unsupervised and generative approach to clustering. In Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI) (pp. 1965–1972).
Kampffmeyer, M., Løkse, S., Bianchi, F. M., Livi, L., Salberg, A. B., & Jenssen, R. (2019). Deep divergencebased approach to clustering. Neural Netw., 113, 91–101.
Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. http://arxiv.org/abs/1412.6980
Kingma, D. P., & Welling, M. (2013). Autoencoding variational bayes. http://arxiv.org/abs/1312.6114
Krishna, K., & Murty, M. N. (1999). Genetic kmeans algorithm. IEEE Trans. Syst. Man Cybern. Part B, 29(3), 433–439.
Krizhevsky, A. (2009). Learning multiple layers of features from tiny images. Technical Report TR2009, University of Toronto, Toronto.
Kuhn, H. W. (1955). The hungarian method for the assignment problem. Naval Res. Logist. Quarterly, 2(1–2), 83–97.
Kulis, B., Jordan, M.I. (2012). Revisiting kmeans: New algorithms via bayesian nonparametrics. In Proceedings of the 29th International Conference on Machine Learning (ICML) (pp. 1131–1138).
LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L. D. (1989). Backpropagation applied to handwritten zip code recognition. Neural Comput., 1(4), 541–551.
Lloyd, S. (1982). Least squares quantization in pcm. IEEE Trans. Inform. Theory, 28(2), 129–137.
Lücke, J., & Forster, D. (2019). kmeans as a variational em approximation of gaussian mixture models. Pattern Recogn. Lett., 125, 349–356.
Maaten, L., & Hinton, G. (2008). Visualizing data using tsne. J. Mach. Learn. Res., 9, 2579–2605.
Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I., & Frey, B. (2015). Adversarial autoencoders. http://arxiv.org/abs/1511.05644
McConville, R., SantosRodriguez, R., Piechocki, R. J., & Craddock, I. (2021). N2D: (not too) deep clustering via clustering the local manifold of an autoencoded embedding. In 2020 25th International Conference on Pattern Recognition (ICPR) (pp. 5145–5152).
McInnes, L., Healy, J., & Melville, J. (2018). UMAP: Uniform manifold approximation and projection for dimension reduction. http://arxiv.org/abs/1802.03426
Miklautz, L., Mautz, D., Altinigneli, M. C., Böhm, C., & Plant, C. (2020). Deep embedded nonredundant clustering. Proc. AAAI Conf. Artif. Intell., 34, 5174–5181.
Mukherjee, S., Asnani, H., Lin, E., & Kannan, S. (2019). Clustergan: Latent space clustering in generative adversarial networks. Proc. AAAI Conf. Artif. Intell., 33, 4610–4617.
Punj, G., & Stewart, D. W. (1983). Cluster analysis in marketing research: Review and suggestions for application. J. Marketing Res., 20(2), 134–148.
Rosenberg, A., & Hirschberg, J. (2007). Vmeasure: A conditional entropybased external cluster evaluation measure. In Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLPCoNLL) (pp. 410–420).
Salaün, A., Petetin, Y., & Desbouvries, F. (2019). Comparing the modeling powers of RNN and HMM. In 2019 18th IEEE International Conference On Machine Learning and Applications (ICMLA) (pp. 1496–1499).
Springenberg, J. T. (2015). Unsupervised and semisupervised learning with categorical generative adversarial networks. http://arxiv.org/abs/1511.06390
Tian, F., Gao, B., Cui, Q., Chen, E., & Liu, T. Y. (2014). Learning deep representations for graph clustering. In Proceedings of the AAAI Conference on Artificial Intelligence (pp. 1293–1299).
Tishby, N., Pereira, F. C., & Bialek, W. (2000). The information bottleneck method. http://arxiv.org/abs/physics/0004057
Uğur, Y., Arvanitakis, G., & Zaidi, A. (2020). Variational information bottleneck for unsupervised clustering: Deep gaussian mixture embedding. Entropy, 22(2), 213.
Van Gansbeke, W., Vandenhende, S., Georgoulis, S., Proesmans, M., Van Gool, L. (2020). Scan: Learning to classify images without labels. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 268–285).
Xiao, H., Rasul, K., & Vollgraf, R. (2017). FashionMNIST: A novel image dataset for benchmarking machine learning algorithms. http://arxiv.org/abs/1708.07747
Xie, J., Girshick, R., Farhadi, A. (2016). Unsupervised deep embedding for clustering analysis. In Proceedings of the 33rd International Conference on Machine Learning (ICML) (pp. 478–487).
Yang, B., Fu, X., Sidiropoulos, N. D., & Hong, M. (2017). Towards Kmeansfriendly spaces: simultaneous deep learning and clustering. In Proceedings of the 34th International Conference on Machine Learning (ICML) (pp. 3861–3870).
Yang, J., Parikh, D., Batra, D. (2016). Joint unsupervised learning of deep representations and image clusters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 5147–5156).
Yang, L., Cheung, N.M., Li, J., Fang, J. (2019). Deep clustering by gaussian mixture variational autoencoders with graph embedding. In Proceedings of the IEEE international conference on computer vision (ICCV) (pp. 6440–6449).
Yang, X., Deng, C., Wei, K., Yan, J., & Liu, W. (2020). Adversarial Learning for Robust Deep Clustering. In Advances in Neural Information Processing Systems (NeurIPS) (pp. 9098–9108).
Zhang, T., Ji, P., Harandi, M., Huang, W., Li, H. (2019). Neural collaborative subspace clustering. In Proceedings of the 36th International Conference on Machine Learning (ICML) (pp. 7384–7393).
Zheng, G. X., Terry, J. M., Belgrader, P., Ryvkin, P., Bent, Z. W., Wilson, R., et al. (2017). Massively parallel digital transcriptional profiling of single cells. Nature communications, 8(1), 1–12.
Funding
Open access funding provided by UiT The Arctic University of Norway (incl University Hospital of North Norway).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Editors: Annalisa Appice, Sergio Escalera, Jose A. Gamez, Heike Trautmann.
Appendices
Appendix 1: additional experiments
1.1 UCI datasets
In this section, we report clustering performance of iGMM, DEC, IDEC, DCN and DKM as well as our models CM and AECM on a selection of UCI datasets. We consider here only random initialization. Deep clustering models share the same architecture: \(d2\times kd\), where d is the dimension of the input and k the number of clusters. The kmeans clusterings of DEC and IDEC are updated every 5 epochs and \(\gamma =0.1\). The \(\lambda\) hyperparameter of DCN and DKM are set to 1 and 0.001, respectively. For CM and AECM the three hyperparameters \((\alpha ,\beta ,\lambda )\) were leaned using Bayesian optimization. We report average ARI, NMI and ACC over 10 runs (Table 9).
1.2 Scikitlearn benchmark
Figure 8 reports clusterings of the scikitlearn toy datasets^{Footnote 5}. We compare here, iGMM, CM and AECM. The CM is optimized using SGD while the AECM relies on Adam. Training are stopped if the difference between the clusterings of two successive iterations are difference by less that \(0.1\%\). The models are run 10 times for up to 100 epochs with a fixed batch size of 20 instances. The average run time is shown in the lower right corner of each plot (Intel(R) Xeon(R) CPU E52698 v4 @ 2.20GHz). The parameters for each dataset are given in Table 10.
The standard way to split the two circles and the two moons datasets is to use a polynomial kernel of degree 2 and at least 3 respectively.
To highlight the relationship between embedding and feature maps, we choose for these datasets specific architectures of the deep autoencoder of AECM. For the two moons dataset, we want the AECM to learn an embedding function approximating a polynomial of degree at least. Therefore, we use 3 layers with 20 units followed by a layers with a single unit. The decoder is the mirror of the encoder. For the two circles dataset, the encoder consists of a quadratic layer, i.e.
followed by a dense layer with a dimension 3 output. That way, the function associated to the encoder is a feature for a polynomial kernel of degree 2. The training just has to find the proper weights of the quadratic function. As for the decoder, it consists of a single layer. The AECM successfully clustered the two circles and two moons dataset, suggesting that it indeed learned embedding functions associated to polynomial kernels of degree 2 and at least 3, respectively.
Finally, the last dataset consisting of a square filled with a single class, we chose an \(\alpha\) lower than 1 for both CM and AECM. Such a setting, informs the model that the three classes will be very imbalanced. both models reacted differently. Indeed, only AECM was able to perfectly assign all the points to a single cluster.
Appendix 2: supplementary materials related to the clustering module
1.1 CM: hyperparameters
The clustering module depends on two hyperparameters: the size of the minibatches and the concentration of the Dirichlet prior. To visualize their influence on the clustering performance, we trained a CM on Pendigit with various sizes of batch and concentrations. Figure 9 shows the variation of the final ARI score for both a random (left) and a kmeans++ initialization (right). Both axes’ scales are logarithmic: exponential base for the xaxis and base 10 for the yaxis. Each combination is run once. The ARI of each dot is the average of the nine neighboring combinations. Black dots indicate an average ARI greater than the ones reported in Table 1.
There is a lower bound on \(\alpha\) under which the optimization of a randomly initialized model underperforms or fails. The kmeans++ initialization removes this border and spreads out the wellperforming area. The distribution in both settings means that the hyperparameters can be tuned by fitting a bivariate Gaussian distribution.
1.2 CM: asymmetric prior
So far we considered only symmetric Dirichlet priors (\(\varvec{\alpha }= \alpha {\varvec{1}}_K, \, \alpha \in {\mathbb {R}}^+\)) regardless of the imbalance between the labels. Here, we repeat the previous experiment using the true labels distribution as the prior, i.e. \(\varvec{\alpha }= \alpha {\varvec{f}}\) where \({\varvec{f}} \langle f_k\rangle _K \in {\mathbb {S}}^K\) is the frequency of each label. In terms of implementation, \(E_4\) is computed by sorting both \({\varvec{\alpha }}\) and \(\varvec{\gamma }_i\). We evaluate results on the \(10\times 73\)k and Pendigit datasets, which have unbalance and balanced classes, respectively. We consider here only random initializations. Again, black dots indicate an average ARI greater than the ones reported in Table 1.
Figure 10b contains more black dots and a larger red area compared to 10a. The changes are greater than between Figs. 10c and 9b. This discrepancy between the datasets illustrates that unbalanced ones benefit more from a custom prior. However, a higher concentration is needed to enforce the distribution: the lower bound on \(\alpha\) is higher in Fig. 10b and c than in 10a and 9a, respectively. Using the true class distribution, especially if the data is unbalanced, does ease the hyperparameter selection. Nevertheless, such an information is not always known a priori.
1.3 CM: merging clusters with \(E_3\)
We claimed in Sect. 3.2 that \(E_3\) favors the merging of clusters. To illustrate this phenomenon, we train CM\({}^r\) on the 5 Gaussians dataset with twice the number of true clusters (i.e., \(K=10\)). We compare three variants of CM’s loss function: without \(E_3\), with \(E_3\) and with \(E_3\) multiplied by 1.5. The final centroids and clustering are depicted in Fig. 11. For legibility, overlapping centroids are slightly shifted using a Gaussian noise.
A vanilla CM (Fig. 11b) correctly positions five pairs of centroids on top of the true cluster centroids. Without \(E_3\) (Fig. 11a), the model fails to merge the clusters properly. While five centroids are close to each of the true cluster, the five remaining are gathered around 0. Conversely, if \(E_3\) is weighted stronger (Fig. 11c), the model becomes so prone to merge clusters that it partitions the left cloud using only two groups of centroids.
1.4 CM: clustering performance
Table 11 contains the full clustering results for CM, including standard deviation and the best run.
1.5 CM: empirical setting
For the experiments reported in Sect. 5.2, the clustering module is trained over 150 epochs using the Adam optimizer (learning rate=0.001). The concentration \(\alpha\) and batchsize B used for each dataset are reported in Table 12. The hyperparameters were optimized using Bayesian optimization over 2000 iterations.
Appendix 3: supplementary materials for AECM
1.1 AECM: hyperparameters
Besides the architecture of the DAE, AECM has two hyperparameters more than CM: \(\beta\) weighting the reconstruction of the DAE and the Lagrange coefficient \(\lambda\) enforcing the orthonormality of the centroids. To visualize their influence on the clustering performance, a AECM is trained on MNIST with various values of \(\beta\) and \(\lambda\). Each combination is repeated five times. Note that when \(\beta =0\) the setting is equivalent to training only the encoder of the DAE (akin to DEC). Also, if \(\lambda =0\) the orthogonality constraint is omitted. Figure 12 represents the average ARI scores for each combination and both initialization schemes as a heatmap.
With a random initialization (Fig. 12a), if both \(\beta\) and \(\lambda\) are not large enough, the clustering fails, excepted when \(\beta =1\). In that case, the model performs well for every value of \(\lambda\), even for \(\lambda = 0\), i.e., without the orthonormality constraint. Conversely, the AECM\({}^r\) always fails if trained without the reconstruction of the DAE, (\(\beta =0\)) . As the order of magnitude of both parameters increases, the performances worsen.
The distribution of AECM\({}^p\) (Fig. 12b) presents similarities with the previous one. Overall the average performances are better for each combination. When \(\beta\) is very small, the ARI exceeds 0.5. The performance also decrease as \(\beta\) and \(\lambda\) become larger. Most noticeable, the band around \(\beta =1\) is still there, but it is thicker. This is in line with the similar analysis on CM (Sect. 1): Pretrained models are less sensitive to hyperparameters.
1.2 AECM: empirical setting
For the experiments reported in Sect. 5.3, AECM is trained over 150 epochs using the Adam optimizer (learning rate=0.001). Each layer of the DAE is activated with a \(\mathrm {leaky}\)\(\mathrm {ReLU}\) with a slope of 0.2, except for the last one of the encoder and of the decoder. AECM depends on four hyperparameters: the weight \(\beta >0\), the concentration \({\varvec{\alpha }}\in {\mathbb {S}}^K\), the Lagrange multiplier \(\lambda >0\) and the size of the batches \(B \in {\mathbb {N}}^*\). The four hyperparameters plus the dimension of the feature space, p; were optimized using Bayesian optimization over 2000 iterations. The selected values are reported in Table 13.
The same architecture is used for the baselines, except for DKM where the activation are all \(\mathrm {ReLU}\). DEC, IDEC and DCN update their clustering every u iterations, IDEC and DCN rely on a hyperparameter \(\gamma\) and DKM on a \(\lambda\). Regarding DKM, the annealing process of the \({\text {softmax}}\) parameter is updated every 5 epochs. We report in Tables 13 and 14 the values used for each dataset.
1.3 AECM: clustering performance
Table 15 contains the full clustering results for AECM, including standard deviation and the best run.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Boubekki, A., Kampffmeyer, M., Brefeld, U. et al. Joint optimization of an autoencoder for clustering and embedding. Mach Learn 110, 1901–1937 (2021). https://doi.org/10.1007/s10994021060155
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994021060155