Representation learning for clustering via building consensus

Deshmukh, Aniket Anand; Regatti, Jayanth Reddy; Manavoglu, Eren; Dogan, Urun

doi:10.1007/s10994-022-06194-9

Representation learning for clustering via building consensus

Open access
Published: 09 September 2022

Volume 111, pages 4601–4638, (2022)
Cite this article

Download PDF

You have full access to this open access article

Machine Learning Aims and scope Submit manuscript

Representation learning for clustering via building consensus

Download PDF

Aniket Anand Deshmukh ORCID: orcid.org/0000-0002-7292-8436¹,
Jayanth Reddy Regatti²,
Eren Manavoglu¹ &
…
Urun Dogan¹

3119 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

In this paper, we focus on unsupervised representation learning for clustering of images. Recent advances in deep clustering and unsupervised representation learning are based on the idea that different views of an input image (generated through data augmentation techniques) must be close in the representation space (exemplar consistency), and/or similar images must have similar cluster assignments (population consistency). We define an additional notion of consistency, consensus consistency, which ensures that representations are learned to induce similar partitions for variations in the representation space, different clustering algorithms or different initializations of a single clustering algorithm. We define a clustering loss by executing variations in the representation space and seamlessly integrate all three consistencies (consensus, exemplar and population) into an end-to-end learning framework. The proposed algorithm, consensus clustering using unsupervised representation learning (ConCURL), improves upon the clustering performance of state-of-the-art methods on four out of five image datasets. Furthermore, we extend the evaluation procedure for clustering to reflect the challenges encountered in real-world clustering tasks, such as maintaining clustering performance in cases with distribution shifts. We also perform a detailed ablation study for a deeper understanding of the proposed algorithm. The code and the trained models are available at https://github.com/JayanthRR/ConCURL_NCE.

D-TRACE: Deep Triply-Aligned Clustering

Contrastive Hierarchical Clustering

DeepECT: The Deep Embedded Cluster Tree

Article Open access 14 July 2020

Find the latest articles, discoveries, and news in related topics.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The field of artificial intelligence (AI) advanced significantly in the previous decade due to developments in deep learning LeCun et al. (2015). In the early years of this field, deep learning methods exhibited stellar supervised learning performance, where each data sample was coupled with a ground truth (labeled data), e.g., each image was associated with a category. Unfortunately, generating labeled datasets is time consuming and expensive, and there may not be enough experts to label the data at hand (e.g., medical images). The straightforward solution is to use clustering for large-scale problems.

In this work, we focus on representation learning for the unsupervised learning task of clustering images. Clustering is a ubiquitous task and has been actively used in many different scientific and practical pursuits (Frey & Dueck, 2007; Masulli & Schenone, 1999; Jain et al., 1999; Xu & Wunsch, 2005). Clustering algorithms do not learn representations and are hence limited to data for which we have a good representation available.

Advancements in deep learning techniques have enabled the end-to-end learning of rich image representations for supervised learning. For the purposes of clustering, however, such features learned via supervised learning cannot be obtained due to lack of available labels. Therefore, supervised learning approaches fall short of providing a solution. Self-supervised learning addresses the issue of learning representations without labeled data. Self-supervised learning is a subfield of unsupervised learning in which the main goal is to learn general-purpose representations by exploiting user-defined tasks (pretext tasks) Wu et al. (2018); Zhuang et al. (2019); He et al. (2020); Zhuang et al. (2019); Chen et al. (2020); Grill et al. (2020); Caron et al. (2020). Representation learning algorithms have been shown to achieve good results when evaluated using a linear evaluation protocol, semisupervised training on ImageNet, or transfer to downstream tasks. A straightforward solution to the clustering problem is to use the features obtained via self-supervised learning and apply an out-of-the-box clustering algorithm (such as k-means) to compute data clusters. However, the performance of these features for clustering (using an out-of-the-box clustering algorithm) is not known, and as seen in our results, these features may be improved for clustering purposes.

On the other hand, deep clustering involves simultaneously learning cluster assignments and features using deep neural networks. Simultaneously learning the feature spaces with a clustering objective in deep clustering may lead to degenerate solutions, which until recently limited end-to-end implementations of clustering with representation learning approaches (Caron et al., 2018a). Subsequently, several works have been developed (Xie et al., 2016a; Caron et al., 2018a; Shah and Koltun, 2018; Ji et al., 2019a; Niu et al., 2020a; Wu et al., 2019a; Huang et al., 2020a; Tao et al., 2021). We provide details regarding some of these works in Sect. 1.1. Our previous work shows some encouraging results but we extend the work substantially (Regatti et al., 2021). We categorize the current clustering and representation learning works based on the consistency constraints that are used to define their objective functions. We define an additional notion of consistency, consensus consistency, which ensures that representations are learned to induce similar partitions for variations in the representation space, different clustering algorithms or different initializations of a clustering algorithm. We use consensus consistency and propose an end-to-end learning approach that outperforms other end-to-end learning methods for image clustering. We summarize our contributions as follows:

(1)
We introduce different notions of consistency (exemplar, population and consensus) that are used in unsupervised representation learning.
(2)
We propose a novel clustering algorithm that incorporates the above three consistency constraints and can be trained in an end-to-end way. An ensemble is generated in the consensus clustering objective by performing random transformations on the underlying embeddings. We combine several methods, which is not trivial, and this combination, along with our new consensus loss, is novel.
(3)
We show that the proposed algorithm ConCURL (consensus clustering with unsupervised representation learning) outperforms baselines on popularly used computer vision datasets when evaluated with clustering metrics.
(4)
We demonstrate the clustering abilities of trained models under a data shift and argue for the need for different evaluation metrics for deep clustering algorithms.
(5)
We study the impacts of various hyperparameters, data augmentation methods, and image resolutions on the clustering ability of the proposed algorithm.

1.1 Related work

1.1.1 Self-supervised learning

Self-supervised learning is used to learn representations in an unsupervised way by defining some pretext tasks. There are many different flavors of self-supervised learning, such as instance discrimination (ID) tasks (Wu et al., 2018; Zhuang et al., 2019) and contrastive techniques (He et al., 2020; Chen et al., 2020). In ID tasks, each image is considered its own category so that the learned embeddings are well separated (Wu et al., 2018). Building on the ID task, Zhuang et al. (2019) proposed a local aggregation (LA) method based on a robust clustering objective (using multiple runs of k-means) to move statistically similar data points closer in the representation space and dissimilar data points further away. In contrastive techniques such as simple contrastive learning of visual representations (SimCLR) (Chen et al., 2020) and momentum contrast (MoCo) (He et al., 2019), representations are learned by maximizing the agreement between different the augmented views of the same data example (known as positive pairs) and minimizing the agreement between the augmented views of different examples (known as negative pairs). Recent works, including Bootstrap Your Own Latent (BYOL) (Grill et al., 2020) and swapping assignments between multiple views (SwAV) (Caron et al., 2020), have achieved state-of-the-art results without requiring negative pairs. Although self-supervised learning methods exhibit impressive performance on a variety of problems, it is not clear whether learned representations are good for clustering.

1.1.2 Clustering with representation learning

DEC (Xie et al., 2016a) is one of the first algorithms to show that deep learning can be used to effectively cluster images in an unsupervised manner; this approach uses features learned from an autoencoder to fine-tune the cluster assignments. DeepCluster (Caron et al., 2018a) shows that it is possible to train deep convolutional neural networks (DeCNNs) in an end-to-end manner with pseudolabels that are generated by a clustering algorithm. Subsequently, several works (Shah and Koltun, 2018; Ji et al., 2019a; Niu et al., 2020a; Wu et al., 2019a; Huang et al., 2020a) have introduced end-to-end clustering-based objectives and achieved state-of-the-art clustering results. For example, in the Gaussian attention network for image clustering (GATCluster) (Niu et al., 2020a), training is performed in two distinct steps (similar to Caron et al. (2018a)), where the first step is to compute pseudotargets for a large batch of data and the second step is to train the model in a supervised way using these pseudotargets. Both DeepCluster and GATCluster use k-means to generate pseudolabels that may not scale well. Wu et al. (2019a) proposed deep comprehensive correlation mining (DCCM), where discriminative features are learned by taking advantage of the correlations among the data using pseudolabel supervision and the triplet mutual information among the features. However, DCCM may be susceptible to trivial solutions (Niu et al., 2020a). Invariant information clustering (IIC) (Ji et al., 2019a) maximizes the mutual information between the class assignments of two different views of the same image (paired samples) to learn representations that preserve the commonalities between the views while discarding instance-specific details. It has been argued that the presence of an entropy term in mutual information plays an important role in avoiding degenerate solutions. However, a large batch size is needed for the computation of mutual information in IIC; this process may not be scalable for larger image sizes, which are common in popular datasets (Ji et al., 2019a; Niu et al., 2020a). Huang et al. (2020a) extended the celebrated maximal margin clustering idea to the deep learning paradigm by learning the most semantically plausible clusters through the minimization of a proposed partition uncertainty index. Their pixel intensity clustering (PICA) algorithm uses a stochastic version of this index, thereby facilitating minibatch training. PICA fails to assign a sample-correct cluster when that sample has either high foreground or background similarity to samples in other clusters. In a more recent approach, contrastive clustering (Li et al., 2021), a contrastive learning loss (as in SimCLR (Chen et al., 2020) was adopted along with an entropy term to avoid degenerate solutions. Similarly, Tao et al. (2021) combined ID (Wu et al., 2018) with novel softmax-formulated decorrelation constraints for representation learning and clustering. Their approach outperforms state-of-the-art methods and improves upon the instance discrimination method. Our method also improves upon ID and outperforms the method of Tao et al. (2021) on all datasets considered. There are other non-end-to-end approaches, such as SCAN (Van Gansbeke et al., 2020), which use the learned representations from a pretext task to find the images that are semantically closest to the given image using the nearest neighbors algorithm. Similarly, one more state of the art non-end-to-end approach, SPICE (Niu et al., 2021) divides the clustering network in two parts - one to measure instance level similarity and one to identify cluster level discrepancy.

2 Consensus clustering

One of the distinguishing factors between supervised learning and unsupervised learning is the existence of ground truth labels that construct a global constraint based on examples. In most self-supervised learning methods, the ground truth is replaced with some consistency constraint (Chen et al., 2020). Without a doubt, the performance of any self-supervised method is a function of the power of the consistency constraint used. We define two types of consistency constraints: exemplar consistency and population consistency.

Definition 1

Exemplar consistency: Representation learning algorithms that learn closer representations (in terms of some distance metric) for different augmentations of the same data point are said to follow exemplar consistency.

Examples of the usage of exemplar consistency include contrastive learning methods such as MoCo (He et al., 2019) and SimCLR (Chen et al., 2020). In these methods, a positive pair of images is defined as any two image augmentations of the same image, and a negative pair consists of any two different images.

Definition 2

Population consistency Representation learning algorithms that ensure that learned representations satisfy the consistency constraint, where two similar data points or any augmentations of the same data points should belong to the same cluster (or population), are said to follow population consistency.

Deep Cluster (Caron et al., 2018a) is a prominent self-supervised method that utilizes population consistency, i.e., Definition 2, by enforcing a clustering constraint on the input dataset. Please note that each cluster assignment contains data points that are similar to each other. Similarly, SwAV (Caron et al., 2020) is an example of the population consistency method.

Definition 3

Consensus consistency Representation learning algorithms that are able to learn representations that induce similar partitions for variations in the given representation space (subsets of features, random projections, etc. ), different clustering algorithms (k-means, Gaussian mixture models (GMMs), etc.) or different initializations of clustering algorithms are said to follow consensus consistency.

Earlier works on consensus consistency did not consider representation learning and used the knowledge reuse framework (see Strehl and Ghosh (2002),Ghosh and Acharya (2011)), where the cluster partitions were available (the features were irrelevant) or the features of the data were fixed. For example, Fern and Brodley (2003) successfully applied random projections to consensus clustering by performing k-means clustering on multiple random projections of the fixed features of input data. In contrast, the notion of consensus consistency here deals with learning representations that achieve a consensus regarding the cluster assignments of multiple clustering algorithms. One example of a method that enforces consensus consistency is LA (Zhuang et al., 2019). LA builds on the ID task (Wu et al., 2018) and was proposed as a method based on a robust clustering objective (using multiple runs of k-means) to move statistically similar data points closer in the representation space and dissimilar data points further away. However, Zhuang et al. (2019) did not evaluate the method with clustering metrics and only focused on linear evaluation using the learned features. Subsequently, we conducted a study to evaluate the clustering performance of these features (see Appendix) and observed that LA performed poorly when evaluated for clustering accuracy. In Definition 3, we inherently assume that the clustering algorithms under consideration have been tuned properly. Unfortunately, the definition of consensus consistency is ill posed, and there can be arbitrarily many different partitions that can satisfy the given condition^{Footnote 1}. We show that when exemplar consistency is used as an inductive bias, the resulting objective function achieves impressive performance on challenging datasets. Combining the exemplar and population constraints with consensus consistency seamlessly and effectively for clustering is the basis of our proposed method.

2.1 Loss for consensus and population consistency

We focus on learning generic representations that satisfy Definition 3 for clustering. By using different clustering algorithms or different representation variations (such as projections), one can easily generate multiple different partitions of the same data. In unsupervised learning, it is not known which partitioning is correct. To tackle this problem, some additional assumptions are needed.

We assume that there is an underlying latent space $\mathcal {Z}^*$ (possibly not unique) such that all clusterings (based on latent space, algorithm or initialization variations) that take input data from this latent space produce similar data partitions. Furthermore, every clustering algorithm that also takes the true number of clusters as input produces the partition that is closest to the hypothetical ground truth. Moreover, we assume that there exists a function $h:X \rightarrow \mathcal {Z}^*$, where X represents the input space and $\mathcal {Z}^*$ represents the underlying latent space. We call this assumption the principle of consensus. The open question is how one constructs an efficient loss that reflects the principle of consensus. We define one such way below.

Given an input batch of images $\mathcal {X}_b\subset \mathcal {X}$, the goal is to partition these images into K clusters. We obtain p views of these images (by different image augmentation approaches) and define a loss such that cluster assignment of any of the p views matches the target estimated from any other view. Without loss of generality, we define a loss for $p = 2$ views. The two views $\mathcal {X}_b^1, \mathcal {X}_b^2$ are generated using two randomly chosen image augmentations.

We learn a representation space $\mathcal {Z}_0$ at the end of every training iteration and obtain M variations of $\mathcal {Z}_0$ as $\{ \mathcal {Z}_1, \mathcal {Z}_2, ... , \mathcal {Z}_M\}$ (e.g., random projections). The goal is to build an efficient loss according to the principle of consensus among $\mathcal {Z}_0$ and its M variations $\{ \mathcal {Z}_1, \mathcal {Z}_2, ... , \mathcal {Z}_M\}$ such that we learn the latent space $\mathcal {Z}^*$ at the end of training (i.e., the learned features lie in the latent space described above). For a given batch of images $\mathcal {X}_b$ and a representation space $\mathcal {Z}_m, \forall m \in [1,...,M]$, we denote the cluster assignment probability of image i and cluster j for view 1 as $\textbf {p}_{i,j}^{1}(\mathcal {Z}_m)$ and that for view 2 as $\textbf {p}_{i,j}^{2}(\mathcal {Z}_m)$. We concisely use $\tilde{\textbf {p}}^{(1,m)},\tilde{\textbf {p}}^{(2,m)}$ when we talk about all the images and all the clusters. Here, we define a loss that incorporates “population consistency" and “consensus consistency". We assume that the target cluster assignment probabilities for the representation $\mathcal {Z}_0$ are given (as in DeepCluster (Caron et al., 2018a)), and they are denoted as $\textbf {q}_{i,j}^{1}$ for view 1 and $\textbf {q}_{i,j}^{2}$ for view 2.

We define the loss for any representation space $\mathcal {Z}$ and batch of images $\mathcal {X}_b$ as

$$\begin{aligned} {\begin{matrix} L_{\mathcal {Z}_m}^1 &=& - \frac{1}{2B}\sum _{i=1}^{B}\sum _{j=1}^K \textbf {q}^{2}_{ij} \log \textbf {p}^{1}_{ij}(\mathcal {Z}_m) \\ L_{\mathcal {Z}_m}^2 &=& - \frac{1}{2B}\sum _{i=1}^{B}\sum _{j=1}^K \textbf {q}^{1}_{ij} \log \textbf {p}^{2}_{ij}(\mathcal {Z}_m), \\ L_{\mathcal {Z}} &=& \sum _{m = 1}^M \Big ( L_{\mathcal {Z}_m}^1 + L_{\mathcal {Z}_m}^2 \Big ). \end{matrix}} \end{aligned}$$

(1)

Note that here, consensus among the clustering results is defined via the number of common targets $\textbf {q}$. An overview of the procedure is shown in Fig. 1. The exact details regarding how to obtain variations of $\mathcal {Z}_0$ and calculate the cluster assignment probabilities $\textbf {p}$ and targets $\textbf {q}$ are described in the next section.

2.2 End-to-End Stochastic Gradient Descent (SGD)-Based trainable consensus loss

In this section, we propose an end-to-end trainable algorithm and define a way to compute $\textbf {p}$ and $\textbf {q}$. When the cluster assignment probabilities $\textbf {p}$ can take any values in the set [0, 1], we refer to the process as soft clustering, and when $\textbf {p}$ is restricted to the set $\{0,1\}$, we refer to the process as hard clustering.

Without loss of generality, in this paper, we focus on soft clustering, which makes it easier to define a loss function using the probabilities and update the parameters using the gradients to enable end-to-end learning. We follow the soft clustering framework presented in SwAV (Caron et al., 2020), which is a centroid-based technique that aims to maintain consistency between the clusterings of the augmented views $\mathcal {X}_b^{1}$ and $\mathcal {X}_b^{2}$. We store a set of randomly initialized prototypes $C_0=\{ \textbf {c}_0^1,\cdots ,\textbf {c}_0^K \} \in \mathbb {R}^{d\times K}$, where K is the number of clusters and d is the dimensionality of the prototypes. These prototypes are used to represent clusters and define a “consensus consistency" loss. We compute M variations of $C_0$ as $C_1,...,C_M$ exactly as we compute the M variations of $\mathcal {Z}_0$.

2.2.1 Cluster assignment probability $\textbf {p}$

We use a two-layer multilayer perceptron (MLP) g to project the features $\textbf {f}^1 = f_\theta (\mathcal {X}_b^1)$ and $\textbf {f}^2 = f_\theta (\mathcal {X}_b^2)$ to a lower-dimensional space $\mathcal {Z}_0$ (of size d). The outputs of this MLP (referred to as cluster embeddings) are denoted as ${Z}_0^1 = \{\textbf {z}_0^{1,1}, \ldots , \textbf {z}_0^{1,B} \}$ and ${Z}_0^2 = \{\textbf {z}_0^{2,1}, \ldots , \textbf {z}_0^{2,B} \}$ for view 1 and view 2, respectively. Note that $h: \mathcal {X} \rightarrow \mathcal {Z}$ defined in 2.1 is equivalent to the composite function of $f: \mathcal {X} \rightarrow \Phi $ and $g: \Phi \rightarrow \mathcal {Z}$.

For a latent space $\mathcal {Z}$, we compute the probability of assigning a cluster j to image i using the normalized vectors $\bar{\textbf {z}}^{1,i} = \frac{\textbf {z}^{1,i}}{\Vert \textbf {z}^{1,i}\Vert }$, $\bar{\textbf {z}}^{2,i} = \frac{\textbf {z}^{2,i}}{\Vert \textbf {z}^{2,i}\Vert }$ and $\bar{\textbf {c}}_j = \frac{{\textbf{c}}^j}{\Vert {\textbf{c}}^j\Vert }$ as

$$ \begin{gathered} {\textbf{p}}_{{i,j}}^{1} ({\mathcal{Z}},C) = \frac{{\exp \left( {\frac{1}{\tau }\langle \overline{{\textbf{z}}} _{i}^{1} ,\overline{{\textbf{c}}} _{j} \rangle } \right)}}{{\sum\nolimits_{{j\prime }} {\exp \left( {\frac{1}{\tau }\langle \overline{{\textbf{z}}} _{i}^{1} ,\overline{{\textbf{c}}} _{{j\prime }} \rangle } \right)} }}, \hfill \\ {\textbf{p}}_{{i,j}}^{2} ({\mathcal{Z}},C) = \frac{{\exp \left( {\frac{1}{\tau }\langle \overline{{\textbf{z}}} _{i}^{2} ,\overline{{\textbf{c}}} _{j} \rangle } \right)}}{{\sum\nolimits_{{j\prime }} {\exp \left( {\frac{1}{\tau }\langle \overline{{\textbf{z}}} _{i}^{2} ,\overline{{\textbf{c}}} _{{j\prime }} \rangle } \right)} }}. \hfill \\ \end{gathered} $$

(2)

We concisely write $ \textbf {p}^1_{i}(\mathcal {Z}) = \{ \textbf {p}^1_{i,j}(\mathcal {Z},C) \}_{j = 1}^K $ and $ \textbf {p}^2_{i} = \{ \textbf {p}^2_{i,j}(\mathcal {Z},C) \}_{j = 1}^K $. Here, $\tau $ is a temperature parameter, and we set its value to 0.1, similar to Caron et al. (2020). Note that we use $\textbf {p}_{i}$ to denote the predicted cluster assignment probabilities for image i (when not referring to a particular view), and the shorthand notation $\textbf {p}$ is used when i is clear from context.

2.2.2 Targets $\textbf {q}$

The idea of predicting the assignments $\textbf {p}$ and then comparing them with the high-confidence estimates $\textbf {q}$ (referred to as codes henceforth) of the predictions was proposed by Xie et al. (2016a). While Xie et al. (2016a) used pretrained features (from autoencoders) to compute the predicted assignments and the codes, the use of their approach in an end-to-end unsupervised manner might lead to degenerate solutions. Asano et al. (2019) avoided such degenerate solutions by enforcing an equipartition constraint (the prototypes equally partitioned the data) during code computation using the Sinkhorn-Knopp algorithm (Cuturi, 2013). Caron et al. (2020) followed a similar formulation but computed the codes for the two views separately in an online manner for each minibatch. The assignment codes are computed by solving the following optimization problem:

$$\begin{aligned} {\begin{matrix} Q^1 &{}= \mathop {\hbox {arg max}}\limits _{Q\in \mathcal {Q}} \text {Tr}(Q^TC_0^TZ_0^1) + \epsilon H(Q) \\ Q^2 &{}= \mathop {\hbox {arg max}}\limits _{Q\in \mathcal {Q}} \text {Tr}(Q^TC_0^TZ_0^2) + \epsilon H(Q), \end{matrix}} \end{aligned}$$

(3)

where $ Q = \{\textbf {q}_1, \ldots , \textbf {q}_B \} \in \mathbb {R}_{+}^{K\times B}$, $\mathcal {Q}$ is the transportation polytope defined by

$$\begin{aligned} \mathcal {Q} = \{\textbf {Q}\in \mathbb {R}^{K\times B}_{+}~\text {s.t}~ \textbf {Q}\textbf {1}_B = \frac{1}{K}\textbf {1}_K, \textbf {Q}^T\textbf {1}_K = \frac{1}{B}\textbf {1}_B \} \end{aligned}$$

$\textbf {1}_K$ is a vector of ones of dimension K and $ H(Q) = -\sum _{i,j}Q_{i,j}\log Q_{i,j} $. The above optimization is computed using a fast version of the Sinkhorn-Knopp algorithm (Cuturi, 2013), as described by Caron et al. (2020).

After computing the codes $Q^1 $ and $Q^2$, to maintain the consistency between the clustering results of the augmented views, the loss is computed using the probabilities $\textbf {p}_{ij}$ and the assigned codes $\textbf {q}_{ij}$ by comparing the probabilities of view 1 with the assigned codes of view 2 and vice versa, as in (1).

2.2.3 Defining variations of $Z_0$ and $C_0$

To compute $\{Z_1,...,Z_M \}$, we project the d-dimensional space $Z_0$ to a D-dimensional space using a random projection matrix. We follow the same procedure to compute $\{C_1,...,C_M \}$ from $C_0$. At the beginning of the algorithm, we randomly initialize M such transformations and fix them throughout training. Suppose that by using a particular random transformation (a randomly generated matrix A), we obtain $\tilde{\textbf {z}} = A\textbf {z},\; \tilde{\textbf {c}} = A\textbf {c}$. We then compute the softmax probabilities using the normalized vectors $\tilde{\textbf {z}}/\Vert \tilde{\textbf {z}}\Vert $ and $\tilde{\textbf {c}}/\Vert \tilde{\textbf {c}}\Vert $. This step is repeated with the M transformation results in the M predicted cluster assignment probabilities for each view. When the network is untrained, the embeddings $\textbf {z}$ are random, and applying the random transformations, followed by computing the predicted cluster assignments, leads to a diverse set of soft cluster assignments. The parameter weights are trained by using the stochastic gradients of the loss for updates.

2.2.4 Backbone loss

To better capture exemplar consistency, based on previous evidence of successful clustering with the ID approach (Tao et al., 2021), we use ID (Wu et al., 2018) as one of the losses, as in Tao et al. (2021). The exemplar objective of ID is to classify each image as its own class.

Given n images and a neural network $f_{\theta }$ for calculating features, we first normalize the features $\bar{f}_{\theta }(x) = \frac{f_{\theta }(x)}{\Vert f_{\theta }(x) \Vert }$. Then, ID defines the probability of an example x being recognized as the i-th example as

$$\begin{aligned} P(i \vert f_{\theta }(x)) = \frac{\exp \left( \langle \bar{f}_{\theta }(x_i), \bar{f}_\theta (x) \rangle / \tau \right) }{\sum _{j=1}^n \exp \left( \langle \bar{f}_{\theta }(x_j), \bar{f}_\theta (x) \rangle / \tau \right) }. \end{aligned}$$

(4)

ID then uses the uniform distribution as a noise distribution $P_n = \frac{1}{n}$ to compute the probability that data example x comes from a data distribution $P_d$ as opposed to the noise distribution $P_n$ as $h(i, f_{\theta }(x)) := \frac{P(i\vert f_{\theta }(x))}{P(i\vert f_{\theta }(x)) + m P_n(i)}$. Assuming that the noise samples are m times more frequent than actual data samples, the ID loss is defined as

$$\begin{aligned} {\begin{matrix} L_{b}&= - E_{P_d} \left[ \log h(i, x)\right] - m E_{P_n} \left[ \log (1 - h(i, x')) \right] , \end{matrix}} \end{aligned}$$

(5)

where $x'$ is the feature from a randomly drawn image other than image x in a given dataset. We exactly follow the framework developed in Wu et al. (2018) to implement the ID loss.

The final loss that we seek to minimize is the combination of the losses $L_{\mathcal {Z}}$ ((1)) and $L_b$ ((5)),

$$\begin{aligned} L_{\text {total}} = \alpha L_{\mathcal {Z}} + \beta L_b. \end{aligned}$$

(6)

where $\alpha , \beta $ are nonnegative constants. Details of the algorithm are given Algorithm 1 and we also provide a PyTorch-style pseudocode in Algorithm 2 in the Appendix.

2.2.5 Computing the cluster metrics

In this section, we describe the approach used to compute the cluster assignments and the metrics chosen to evaluate their quality. Note that we assume that the number of true clusters (K) in the data is known.

There are two ways to compute the cluster assignments. The first way is to use the embeddings generated by the backbone; here, the embeddings are the outputs of the ID block $f_{\theta }(x)$. The embeddings of all the images are computed, and then we perform k-means clustering.

The second method is to use the soft clustering block to compute the cluster assignments. It is sufficient to use the computed probability assignments $\{\textbf {p}_i\}_{i=1}^N$ or the computed codes $\{\textbf {q}_i\}_{i=1}^N$ and assign the cluster index as $c_i = \arg \max _{k} \textbf {q}_{ik}$ for the $i^{\text {th}}$ data point. Once the model is trained, in this second approach, cluster assignment can be performed online without requiring the computation of the embeddings of all the input data.

We evaluate the quality of the clusterings using metrics such as the cluster accuracy, normalized mutual information (NMI), and adjusted Rand index (ARI). To compute the clustering accuracy, we are required to solve an assignment problem (computed using a Hungarian match (Kuhn, 1955, 1956)) between the true class labels and the cluster assignments. In our analysis, we observe that using k-means with the embeddings produced by the ID block achieves better clustering accuracy, and we use this method throughout the paper while evaluating our proposed algorithm.

2.3 Generating multiple clustering results

Fred and Jain (2005) discussed different ways to generate cluster ensembles; these methods are tabulated in Table 1. In our proposed algorithm, we focus on choosing of the appropriate data representation to generate cluster ensembles.

Table 1 Different ways to generate ensembles

Full size table

By fixing a stable clustering algorithm, we can generate arbitrarily large ensembles by applying different transformations on the embeddings. Random projections were previously successfully used in consensus clustering (Fern and Brodley, 2003). By generating ensembles using random projections, we have control over the amount of diversity we can induce into the framework by varying the dimensionality of the random projection. In addition to random projections, we also use diagonal transformations (Hsu et al., 2018) where different components of the representation vector are scaled differently. Hsu et al. (2018) illustrated that such scaling enables a diverse set of clusterings, which is helpful for the meta learning task. We study ablations over the number of transformations needed and the dimensions of these transformations in Sect. 5.

3 Understanding the consensus objective

We investigate a potential hypothesis regarding “training driven by noisy cluster assignments" that can shed light on the success of ConCURL^{Footnote 2}. The hypothesis stems from the following intuition. Using different clustering algorithms, the generated cluster assignments are noisy versions of the hypothetical ground truth; as the training process progresses, the noise in the cluster assignments is reduced, and eventually all the different clustering algorithms considered generate similar cluster assignments.

We verify this hypothesis empirically with the help of the following experiments on the STL-10 dataset: (i) We observe the noisy clusterings generated by using random projections and (ii) Verify that the noise in the cluster assignments is reduced as training progresses.

For the purpose of demonstrating noisy cluster assignments, we use synthetic data as follows. We generate three clusters in $\mathbb {R}^2$, as shown in Fig. 2a, and compute the centroids of each cluster. Here, the centroids act as the prototypes. We then generate a Gaussian random projection matrix A with a dimensionality of $\mathbb {R}^{2\times 2}$. We first normalize the embeddings (2-dimensional features) and the centroids (see Fig. 2b). Using the matrix A, we transform both the embeddings and prototypes for the new space and normalize the resultant vectors.

We follow the soft clustering framework discussed earlier and compute the soft cluster assignments for the original and transformed data. We observe that the cluster assignment probabilities in the new space are noisy versions of the cluster assignment probabilities in the original space (see Table 2).

Table 2 Predicted cluster assignment probabilities and target probabilities obtained from the Sinkhorn algorithm for four data points

Full size table

To verify that the noise in the cluster assignment probabilities is reduced as training progresses, we perform the following experiment. We measure the similarity among the cluster assignments at every epoch to observe the effect of consensus as training progresses. For each random projection used, we use the cluster assignment probabilities $\tilde{\textbf {p}}$ and compute cluster assignments by taking an $\mathop {\hbox {arg max}}\limits $ on $\tilde{\textbf {p}}$ for each image. We obtain M such cluster assignments due to the M random projections. We then compute a pairwise NMI (similar to the analysis of Fern and Brodley (2003)) between every two cluster assignments and compute the average and standard deviation of the pairwise NMI values across the $\frac{M(M-1)}{2}$ pairs. An NMI score of 1.0 signifies that the two clusters perfectly correlate with each other, and a score of 0.0 implies that the two clusters are uncorrelated. We observe from Fig. 3 that the pairwise NMI increases as training progresses and becomes closer to 1. At the beginning of training, the cluster assignments are very diverse (small NMI scores with a large standard deviation), and as training progresses, the diversity is reduced (large NMI scores with a smaller standard deviation). This observation leads us to conclude that for the applied clustering algorithms (defined using different random projections), we have learned an embedding space where the different cluster assignments concur. In other words, “consensus consistency" is achieved. Additionally, it is evident from our empirical results in Sect. 4 that we achieve an improved overall clustering accuracy.

If noisy cluster assignments are the reason behind the improved performance, one might wonder if it is sufficient to simply add noise to the original cluster assignments rather than computing multiple cluster assignments. However, this may not be fruitful because if noise is added externally, one must define a scheduler to reduce the noise as training progresses. However, in the case of ConCURL, the end-to-end learning algorithm determines the rate of consensus or agreement between $\textbf {p} $ and $\tilde{\textbf {p}}$ itself. In the next section, we provide empirical evidence of the effectiveness of our method.

4 Empirical evaluation

Evaluating clustering algorithms is a notoriously hard problem. The reference text Jain and Dubes (1988) states the following: The validation of clustering structures is the most difficult and frustrating part of cluster analysis. Without a strong effort in this direction, cluster analysis will remain a black art accessible only to those true believers who have experience and great courage.

In the literature on representation learning for clustering, e.g., Li et al. (2021); Huang et al. (2020b); Tao et al. (2021), to evaluate the performance of different algorithms, the following methodology has been used: as a set of models with some hyperparameters are trained, these models are sorted by the observed clustering performance. Finally the best model’s results are reported. This methodology is called max performance in the remainder of the paper. We assess the quality of the learned embeddings by using five challenging image datasets for clustering and report their performance with the max-performance strategy. Although the max-performance procedure provides some insights into the performance of the method under consideration, we provide additional insights by significantly extending the evaluations. In practice, it is desirable that the learned models can be utilized for different datasets other than the training dataset. However, the max-performance method may not be suitable for this purpose. To address this, we design two additional experiments that focus on the performance of cross-model features under a distribution shift. Furthermore, we also assess the quality of the learned embeddings in image retrieval tasks. Finally, we present a detailed ablation study to assess the impact of the loss terms, data augmentation methods, hyperparameters and architecture choices utilized to obtain a more complete picture.

4.1 Image clustering with the max-performance strategy

We evaluate our algorithm and compare it with existing methods on some popular image datasets, namely, ImageNet-10, ImageNet-Dogs, STL-10, CIFAR-10, and CIFAR100-20.

For CIFAR100-20, we use the 20 meta classes as the class labels while evaluating the clustering results. For STL-10, similar to the earlier PICA (Huang et al., 2020a) and GATCluster (Niu et al., 2020a) approaches, we use both training and testing splits for training and evaluation. Note that PICA also uses an unlabeled data split with 100k points in STL-10, which we do not use. ImageNet-10 and ImageNet-Dogs are subsets of ImageNet, and we use only the training splits for these two datasets (Deng et al., 2009). We use the same classes as Chang et al. (2017a) for evaluation on the ImageNet-10 and ImageNet-Dogs datasets. The dataset summary is given in Table 3 and methods compared are given in Table 4. We evaluate the cluster accuracy, NMI, and ARI of each computed cluster assignment (see the Appendix for details).

Table 3 Dataset summary

Full size table

Table 4 Methods compared

Full size table

4.1.1 Comparison with state-of-the-art baselines

In our comparison, we consider some state-of-the-art methods that were developed for image clustering problems and targeted for end-to-end training scenarios with random initialization. We should note that we do not consider baselines that use prior information, e.g., the nearest neighbors algorithm derived by using pretrained models. The implementation details of ConCURL are provided in the Appendix, and the results are presented in Table 5.

Table 5 Clustering with the max-performance strategy

Full size table

We observe that ConCURL outperforms the baseline algorithms considered in terms of all three metrics for all the datasets except STL-10. ConCURL improves the state-of-the art clustering accuracy by approximately $17.5\%$ on ImageNet-Dogs, by $12.7\%$ on CIFAR100-20 and by $3.8\%$ on CIFAR-10. Although ConCURL improves upon the results of ID (Tao et al., 2021), please note that ID is the backbone used in this paper and is slightly worse than IDFD, as shown in Tao et al. (2021).

The proposed method achieves good clustering performance on popular computer vision datasets. Similar to all the algorithms considered, we assume that K, the number of clusters, is known. However, this may not hold true in practice in real-world applications. In such a case, we may assume an estimate for the upper bound on the number of clusters to use as the number of prototypes. Additionally, we also assume that the dataset is equally distributed among the K clusters. If this assumption (also common in the literature (Huang et al., 2020a; Niu et al., 2020a)) does not hold, the fast Sinkhorn-Knopp algorithm used to solve Eq. 3 may not be optimal.

4.1.2 Performance on the Test split

In the previous section, we studied the clustering performance on the training split used to train the algorithm. Here, we shall evaluate the clustering performance on a held out test set. We shall use the standard test split for the datasets CIFAR-10, CIFAR100-20, ImageNet-10, and ImageNet-Dogs. We used the trained models to extract the features for each test dataset and compute the clustering as above. We observe from Table 6 that the performance doesn’t get affected much on the test set. This shows that the algorithm is able to extract good feature representations for clustering on data not used for training which shows good generalization ability of the algorithm when the data is drawn from the same distribution.

Table 6 Clustering on the test dataset

Full size table

4.1.3 Class-specific accuracy

We present the class-specific accuracies (percentage) and confusion matrices in Fig. 4. In each row i, $j^{th}$ entry in the matrix represents the percentage of samples from category i belonging to the cluster of category j. For better visualization, we round each percentage to the nearest integer.Row sum may not be equal to 100 because we are rounding to the nearest integer. For perfect clustering, all elements along the diagonal should be equal to 100. Here, we note some interesting observations. For ImageNet-10, the airliner category shows the worst performance, with 7% of airliner samples being confused with the airship category. Additionally, 4% of the samples from the orange category are categorized as soccer balls. In ImageNet-Dogs, there are three types of spaniels- Blenheim spaniels, Brittany spaniels and Welsh springer spaniel, which look very similar to each other. Nineteen percent of the samples from the Blenheim spaniel category are categorized as Brittany spaniels, and 5% are categorized as Welsh springer spaniels. Similarly, 16% of the samples from the Brittany spaniel category are categorized as Blenheim spaniels, and 16% more are categorized as Welsh springer spaniels. Forty-four percent of the samples from the Welsh springer spaniel category are categorized as Brittany spaniels, and 15% are categorized as Blenheim spaniels. Kelpies and Dobermans are also confused with each other, where 39% of the kelpie samples are categorized as Dobermans, and 29% of the Doberman samples are categorized as kelpies. For CIFAR-10, 33% of the samples from the dog category are categorized as cats, and 16% of the samples from the cat category are categorized as dogs.

For STL-10, none of the samples from the cat category are categorized correctly. Even though STL-10 and CIFAR-10 have the same list of categories, STL-10 seems harder to cluster than CIFAR-10. Note that STL-10 has only 13000 images to learn representations, while in CIFAR-10, 60000 images are used. For CIFAR100-20, none of the samples from aquatic mammals are categorized correctly. Thirty-seven percent of the samples from aquatic mammals are categorized as fish, and 19% are categorized as large omnivores and herbivores. In the case of reptiles, only 15% of the examples are categorized correctly, but 23% are categorized as fish, 16% as large carnivores and 13% as insects. Surprisingly, 28% of the samples from trees are categorized as aquatic mammals.

4.2 Out-of-distribution results

In this section, we evaluate ConCURL by performing clustering on dataset that is not used during training. We focus mainly on studying the clustering performance on datasets that maybe similar to the training dataset and datasets that may have a different number of clusters than the training dataset.

4.2.1 Cross-model accuracy

Here, we calculate the clustering performance achieved when the model is trained on one dataset but evaluated on a different dataset that may have a different number of clusters. For example, the first row in Table 7 gives the performance of the model trained on ImageNet-10 and evaluated on both ImageNet-10 and ImageNet-Dogs. Similarly, the second row shows the performance of the model trained on ImageNet-Dogs. We find that performance on ImageNet-10 is decreased to 35.6% when the model trained on ImageNet-Dogs is used instead of the model trained on ImageNet-10. Similarly, the performance on ImageNet-Dogs is decreased to 17.7 % when the model trained on ImageNet-10 is used instead of the model trained on ImageNet-Dogs. Table 8 provides the same performance metrics for CIFAR-10 and CIFAR100-20.

For cross-model performance to be high, the embedding function must be generalizable to the out of distribution dataset. It is important to observe that for each pair of datasets considered, the distributions of the datasets are very different due to the classes being completely different in both cases (ImageNet-10 vs ImageNet-Dogs, and CIFAR-10 vs CIFAR100-20). However, since we are considering datasets with a small number of datapoints and small number of classes (see Table 3), the representation power of the learnt embeddings is limited and this affects the cross-model accuracy. Moreover, the consensus loss $L_{\mathcal {Z}}$ here assumes knowledge of the number of clusters in the dataset. Therefore, the embeddings learnt by optimizing the $L_{total}$ loss on one dataset may be sub-optimal for evaluating clustering on a dataset with different number of clusters.

It is clear that these performance drops are significant, and the generalization performance of the learned embeddings needs to be assessed by taking out-of-distribution datasets into account. However, since there are only 2 groups of different datasets, it is difficult to reach a definitive conclusion. Hence, in the following section, we propose a new evaluation methodology that sheds light on the out-of-distribution performance of the learned embeddings.

Table 7 ImageNet-10 vs. ImageNet-Dogs: cross-model performance

Full size table

Table 8 CIFAR-10 vs. CIFAR100-20: cross-model performance

Full size table

4.2.2 ImageNet random-10 and random-15 accuracies

Here, we compare the baseline model trained with ID (Wu et al., 2018) with our proposed method ConCURL. We randomly sample 10 and 15 classes from the 1000-class ImageNet data and evaluate the clustering accuracy obtained on the training split of the data using the model trained on the original ImageNet-10 and ImageNet-Dogs sets. We repeat the process 100 times for both the 10-class and 15-class datasets and call them the random-10 and random-15 datasets, respectively. Note that we do not retrain the model on the randomly sampled dataset; we only evaluate the model on this set. We show the histogram of the obtained accuracies for these 100 random datasets. In Fig. 5, we compare the accuracy of the ConCURL model and the baseline ID model trained on ImageNet-10 on both random-10 and random-15. Along with the histogram, we show a Gaussian distribution (along the red dotted line) with first and second moments equal to the average and standard deviation of all accuracies, respectively. Similarly, in Fig. 6, we show the accuracies obtained based on models trained on ImageNet-Dogs. Among the models trained on ImageNet-10, the baseline ID model performs slightly better than the proposed ConCURL model. The trend is reversed for the evaluation based on the model trained on ImageNet-Dogs, where ConCURL performs better than the baseline model. Even though the proposed method performs best with the max-performance strategy, it performs slightly worse on random-10. This result strengthens our argument regarding the need to go beyond the traditional reporting of maximum performance based on the ACC, NMI and ARI metrics.

4.3 Cluster visualizations

In this section, we provide two visualizations of the ImageNet-10 dataset. In Fig. 7, each row presents randomly drawn images from each cluster. Images that have red-and-yellow borders are categorized incorrectly and should belong to different clusters. For example, the first image in the fourth row should be in the truck category, but it is categorized as an airliner. In the soccer ball category, there are two mistakes: the first image should be categorized as a truck, and the fourth image should be categorized as a dog. In the ninth row, the last image should be categorized as an airship, but it is categorized as a truck.

To check whether the learned representations that are closest to each other belong to the same category, we use a retrieval task. In Fig. 8, we show the results. The first image in each row was used as a query (random samples from the dataset), and the five images nearest to the query image were retrieved using their representations. Ideally, one would expect all retrieved images to belong to the same category as the query image. For the example from the soccer ball category, the first image retrieved does not belong to this category; however, both images have water as their main feature. In the last row, the second image retrieved is a penguin and ideally should not be one of the closest matches to an image in the orange category.

5 Ablation studies

Although the proposed method is trained in an end-to-end manner, each component of the method may have a different impact on the results. We conduct various controlled experiments to quantify the impact of the losses (Sect. 5.1), the data augmentation methods (Sect. 5.2), the image resolution (Sect. 5.3), the number of transformations (Sect. 5.4), the dimensionality of each transformation (Sect. 5.4) and the architecture choice (Sect. 5.5).

5.1 Effects of the loss terms

We consider the following scenarios: train with only loss $L_{\mathcal Z}$ (Consensus Loss), train with only loss $L_{b}$ (ID), train with both losses $L_{b} + L_{\mathcal Z}$ (ConCURL). We do this for CIFAR-10, CIFAR100-20 datasets. We then compare the clustering performance of all three scenarios and observe that training with both losses improves the performance. Additionally, we also compare the loss trajectories during training. We compare the $L_b$ loss for the case when we train only $L_b$ and for the case when we train only $L_{\mathcal {Z}}$. We repeat this for loss $L_{\mathcal {Z}}$. The results are summarized in Table 9, where we observe that training with both the losses provides a much better clustering performance as compared to training with the losses individually.

Table 9 Ablation on the losses during training

Full size table

The exemplar consistency trains with the objective of classifying each data point into its own class. The population and consensus consistencies train without regarding for discrimination among the individual data points. An algorithm that is trained with only the consensus loss therefore is not effective in discriminating individual data points. From Fig. 9(a), we can observe that when we train with only $L_{\mathcal {Z}}$, we do not observe any improvement in the loss $L_{b}$; when we train with $L_b+L_{\mathcal {Z}}$, we observe a similar trajectory as training only with $L_b$. This shows that training with $L_{\mathcal {Z}}$ does not conflict with $L_b$ loss.

On the other hand, it is possible that an algorithm that is trained to discriminate individual data points can show some improvement on the consensus loss $L_{\mathcal {Z}}$. From Fig. 9(b), we can observe that $L_{\mathcal {Z}}$ is lesser when trained with $L_b+L_{\mathcal {Z}}$ than when training only with $L_{\mathcal {Z}}$. We observe a small decrease in $L_{\mathcal {Z}}$ value when trained only with $L_b$. This shows that optimizing $L_b$ helps to some extent in achieving a better $L_{\mathcal {Z}}$.

From this discussion, we observe that both losses contribute differently to the training without much interference or conflicts. They indeed complement each other as we observe improved clustering performance for ConCURL from Table 9 and Fig. 10.

5.2 Effects of data augmentation methods

Augmenting the training data is a standard technique for training deep learning methods (Shorten & Khoshgoftaar, 2019). The backbones used in this study rely on the different views that are generated by applying different augmentations to the input image. Recently, Tian et al. (2020) investigated the impact of data augmentation on contrastive learning methods and shed some light on this topic. In our setting, we would like to quantify the impacts of several data augmentations on the consensus loss.

In Table 10, we show the maximum accuracy achieved when all data augmentation approaches are used and when we skip one data augmentation technique at a time. When all data augmentation methods are used, the maximum accuracies achieved are 0.8459 and 0.4798 on CIFAR-10 and CIFAR100-20, respectively. When random resized cropping data augmentation is dropped, we obtain the maximum accuracy drops for both datasets, followed by color jitter. Other data augmentation techniques are important for obtaining the best possible accuracy but do not have as much of an effect as color jitter and random resized cropping. In Figs. 11 and 12, we show how the running mean of accuracy progresses during training for each of the experiments in Table 10.

Table 10 Data augmentation details

Full size table

5.3 Effect of image resolution

Image resolution is often considered a free parameter (Niu et al., 2020a), and however, its effect on clustering performance is not evaluated rigorously in most works. We try to quantify the effects of different resolutions to the greatest extent possible, given that some datasets are available only at specific resolutions. For STL-10, we use $32\times 32$, $64\times 64$ and $96\times 96$ resolutions. For ImageNet-10 and ImageNet-Dog-15, we use $96\times 96$, $160\times 160$ and $224\times 224$ resolutions. The results are given in Table 11.

Table 11 Effects of different resolutions for STL-10, ImageNet-10 and ImageNet-Dogs

Full size table

The best performance for ImageNet-10 and ImageNet-Dogs is obtained at a resolution of 160, and for STL-10, the best performance is obtained at a resolution of 96. It is not clear why ImageNet-10 and ImageNet-Dogs do not yield the best performance at high resolutions, and further investigation is needed; we keep this as an open problem.

5.4 Distribution of accuracies across the set of hyperparameters

Table 12 Hyperparameters and the range values used for the experiments

Full size table

Table 13 Hyperparameters for obtaining maximum performance

Full size table

The proposed consensus loss has two parameters. The first is the number of transformations used, and the second is the dimensionality of the projection space. To understand the proposed loss, we conduct a detailed experimental study on STL-10 and CIFAR100-20.^{Footnote 3} The hyperparameters used are given in Table 12.

Due to the sheer number of conducted experiments, we supply the summary statistics obtained on a random set. We report the empirical mean and standard deviation of the marginal distribution of the quantity under investigation. Let $P_{\tau ,\eta ,d,l}$ be the joint distribution over the hyperparameters $\tau $ (temperature parameter), l (learning rate), $\eta $ (natural log of the number of transformations) and d (dimensionality of the projection space). We consider $n_h$ as the number of distinct values used in the experiment for each hyperparameter $h \in \{ \tau ,\eta ,d,l \}$ based on Table 12. We the denote accuracy from each experiment based on the hyperparameters used as $a_{\tau ,\eta ,d,l}$ . Let $P_{h_i \vert h_j}$ be the conditional marginal distribution of hyperparameter $h_i$ given $h_j$ and the conditional empirical mean of $P_{h_i \vert h_j}$ be $m(P_{h_i \vert h_j})$. In this case, the conditional empirical mean $m(P_{h_i \vert h_j})$ when $h_i = d$ and $h_j=\tau $ can be calculated using $m(P_{d \vert \tau }) = \frac{1}{n_{\eta } \times n_{l}} \sum _{\eta } \sum _{l} a_{\tau ,\eta ,d,l}$. The conditional empirical means and standard deviations of other hyperparameters are calculated in the same way. In Fig. 13, we show each conditional empirical mean with a blue dot, and each red line around a dot represents a standard deviation. For both STL-10 and CIFAR100-20, we see a trend regarding the number of projections. For STL-10, the smaller the number of random projections, the better the results are, and for CIFAR100-20, increasing the number of random projections is helpful for improving the clustering accuracy up to some point. Note that when the number of random projections is equal to zero, our setting is equivalent to the baseline ID model, and our approach always performs better than ID. This means that the optimal number of random projections is greater than or equal to one. There is no such clear trend in the number of dimensions of the random projections.

The max-performance procedure provides some insights into the performance of the algorithms at hand, although it does not provide the whole picture because it does not consider the robustness of the performance differences. In Table 13, we give the hyperparameters that yield the max performance. In other words, finding a hyperparameter set that yields better performance than the baseline is the core idea behind the max-performance procedure. We ask the following question: given a hyperparameter grid, how likely is our method to achieve better accuracy than the baseline? In Fig. 14, we report the empirical accuracy distributions on STL-10 and CIFAR100-20 for all hyperparameters given in Table 12. The red dotted lines show the corresponding baseline accuracy for each dataset. For STL-10, only approximately $12.5\%$ of the hyperparameter sets yield better results than the baseline. On the other hand, for CIFAR100-20, approximately $90\%$ of the hyperparameter sets yield better results than the baseline. In other words, it does not require a significant amount of computational power to find a better model than the state-of-the art models for CIFAR100-20; however, the situation is the opposite for STL-10. The results given in Fig. 14 suggest that when comparing models, multiple metrics need to be considered, not only the max-performance procedure.

5.5 Effect of architecture choice

In this work, we use ResNet-18 and ResNet-50 as network architectures. For both ResNet-18 and ResNet-50, we sweep over the same set of hyperparameter choices, i.e., the temperature, number of projections and projection dimensionality, and report the results for ImageNet-10 dataset with image resolution of 160$\times 160$. Figure 15 shows the distribution of $\Delta _{acc}$, which is defined as the accuracy difference between ResNet-50 and ResNet-18. Figure 15 indicates that ResNet-50 slightly outperforms ResNet-18, i.e., the mean difference is approximately $0.5\%$.

5.6 Runtime comparison

To study the runtime of the proposed method, we compare the time taken per epoch for the baseline ID algorithm and the proposed algorithm. Due to the additional loss computation, the time taken to run the proposed algorithm is higher which can be observed from Fig. 16. The additional time taken is mainly due to computing the consensus loss for the different number of transformations. The current implementation computes the forward pass for the different transformations sequentially thus increasing the runtime. However, a more time efficient implementation where the forward passes for all the different random transformations are computed in parallel can make the runtime more comparable to the baseline ID algorithm.

6 Conclusion

In this work, we introduce different notions of the consistency constraints that are enforced in different unsupervised/self-supervised learning algorithms. We propose a novel clustering algorithm that seamlessly incorporates all three consistency constraints (exemplar, population and consensus) and achieves state-of-the-art clustering results for four out of five popular and challenging computer vision datasets. Our work on consensus clustering is significantly different from earlier consensus clustering works that do not learn representations. Moreover, we initiate a discussion on the adequacy of the currently used methods for evaluating clustering algorithms. We significantly extend the evaluation procedure for clustering algorithms, thereby reflecting the challenges of applying clustering to real-world tasks. We provide evaluation results for ConCURL and other state-of-the-art clustering algorithms based on max-performance criteria, according to which ConCURL outperforms other algorithms on most datasets. However, its average performance according to out-of-distribution criteria highlights the need to use the proposed evaluation methods for deep clustering algorithms.

Data Availability

The datasets used are available at https://www.cs.toronto.edu/~kriz/cifar.html, https://cs.stanford.edu/~acoates/stl10/ or can be requested at https://image-net.org/. All the trained models, their usage is available https://github.com/JayanthRR/ConCURL_NCE.

Code Availability

Code is available https://github.com/JayanthRR/ConCURL_NCE.

Notes

a) Degenerate solution where all cluster assignments are the same. b) Random assignment can satisfy this condition given that all clustering process produces the same but random assignments
A theoretically grounded explanation of ConCURL is considered future work due to the nonconvexity of deep learning methods and the nonconvexity of the proposed loss.
We conduct a similar but smaller study on the remaining dataset, and we observe similar trends.
https://scikit-learn.org/stable/modules/clustering.html#adjusted-rand-score..
https://scikit-learn.org/stable/modules/clustering.html#mini-batch-kmeans..
using the official PyTorch implementation available at https://github.com/neuroailab/LocalAggregation-PyTorch/tree/master/config..

References

Asano, Y.M. , Rupprecht, C., Vedaldi, A. (2019). Self-labelling via simultaneous clustering and representation learning. arXiv preprint arXiv:1911.05371.
Bengio, Y. , Lamblin, P. , Popovici, D. , Larochelle, H., Montreal, U. (2007). Greedy layer-wise training of deep networks . NeurIPS19, 153-160.
Cai, D. , He, X. , Wang, X. , Bao, H., Han, J. (2009). Locality Preserving Nonnegative Matrix Factorization. Ijcai (pp. 1010–1015). http://ijcai.org/Proceedings/09/Papers/171.pdf
Caron, M. , Bojanowski, P. , Joulin, A., Douze, M. (2018a). Deep clustering for unsupervised learning of visual features. In: Proceedings of the European Conference on Computer Vision (eccv) (pp. 132–149).
Caron, M., Bojanowski, P., Joulin, A., & Douze, M. (2018). Deep clustering for unsupervised learning of visual features. Eccv,11218, 139–156.
Caron, M. , Misra, I. , Mairal, J. , Goyal, P. , Bojanowski, P., Joulin, A. (2020). Unsupervised learning of visual features by contrasting cluster assignments. arXiv preprint arXiv:2006.09882.
Chang, J. , Guo, Y. , Wang, L. , Meng, G. , Xiang, S., Pan, C. (2019). Deep discriminative clustering analysis.
Chang, J. , Wang, L. , Meng, G. , Xiang, S., Pan, C. (2017). Deep adaptive image clustering. The IEEE International Conference on Computer Vision (iccv).
Chang, J. , Wang, L. , Meng, G. , Xiang, S., Pan, C. (2017). Deep adaptive image clustering. Iccv (pp. 5880-5888). https://doi.org/10.1109/ICCV.2017.626
Chen, T. , Kornblith, S. , Norouzi, M., Hinton, G. (2020). A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709.
Cuturi, M. (2013). Sinkhorn distances: Lightspeed computation of optimal transport. Advances in Neural Information Processing Systems (pp. 2292–2300).
Deng, J. , Dong, W. , Socher, R. , Li, L.J. , Li, K., Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition (pp. 248–255).
Fern, X.Z., & Brodley, C.E. (2003). Random projection for high dimensional data clustering: A cluster ensemble approach. In: Proceedings of the 20th International Conference on Machine Learning (icml-03) (pp. 186–193).
Franti, P., Virmajoki, O., & Hautamaki, V. (2006). Fast agglomerative clustering using a k-nearest neighbor graph. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(11), 1875–1881.
Article Google Scholar
Fred, A. L., & Jain, A. K. (2005). Combining multiple clusterings using evidence accumulation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(6), 835–850.
Article Google Scholar
Frey, B. J., & Dueck, D. (2007). Clustering by passing messages between data points. Science, 315(5814), 972–976.
Article MathSciNet MATH Google Scholar
Ghosh, J., & Acharya, A. (2011). Cluster ensembles. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(4), 305–315.
Google Scholar
Grill, J. B. , Strub, F. , Altché, F. , Tallec, C. , Richemond, P.H. , Buchatskaya, E. & others (2020). Bootstrap your own latent: A new approach to self-supervised learning. arXiv preprint arXiv:2006.07733.
Haeusser, P. Plapp, J. , Golkov, V. , Aljalbout, E., Cremers, D. (2019). Associative deep clustering: Training a classification network with no labels. T. Brox, A. Bruhn, & M. Fritz (eds), Pattern recognition (pp. 18–32). Cham, Springer International Publishing.
He, K. , Fan, H. , Wu, Y. , Xie, S., Girshick, R. (2019). Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722.
He, K. , Fan, H. , Wu, Y. , Xie, S., Girshick, R. (2020). Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 9729–9738).
Hsu, K. , Levine, S., Finn, C. (2018). Unsupervised learning via meta-learning. arXiv preprint arXiv:1810.02334.
Huang, J. , Gong, S., Zhu, X. (2020a). Deep semantic clustering by partition confidence maximisation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 8849–8858).
Huang, J. , Gong, S., Zhu, X. (2020b). Deep semantic clustering by partition confidence maximisation. Cvpr.
Jain, A.K., & Dubes, R.C. (1988). Algorithms for clustering data. Prentice-Hall, Inc.
Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: A review. ACM Computing Surveys (CSUR), 31(3), 264–323.
Article Google Scholar
Ji, X. , Henriques, J.F., Vedaldi, A. (2019a). Invariant information clustering for unsupervised image classification and segmentation. In: Proceedings of the IEEE International Conference on Computer Vision (pp. 9865–9874).
Ji, X., Henriques, J.F., Vedaldi, A. (2019b). Invariant information clustering for unsupervised image classification and segmentation. Iccv.
Kuhn, H. W. (1955). The hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2(1–2), 83–97.
Article MathSciNet MATH Google Scholar
Kuhn, H. W. (1956). Variants of the hungarian method for assignment problems. Naval Research Logistics Quarterly, 3(4), 253–258.
Article MathSciNet MATH Google Scholar
LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444.
Article Google Scholar
Li, Y. , Hu, P. , Liu, Z. , Peng, D. , Zhou, J.T., Peng, X. (2021). Contrastive clustering. Aaai.
Macqueen, J. (1967). Some methods for classification and analysis of multivariate observations. In 5-th berkeley Symposium on Mathematical Statistics and Probability (pp. 281–297).
Masulli, F., & Schenone, A. (1999). A fuzzy clustering based segmentation system as support to diagnosis in medical imaging. Artificial Intelligence in Medicine, 16(2), 129–147.
Article Google Scholar
Ng, A.Y. , Jordan, M.I., Weiss, Y. (2002). On spectral clustering: Analysis and an algorithm. T.G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), NeurIPS Neurips (pp. 849–856). MIT Press. http://papers.nips.cc/paper/2092-on-spectral-clustering-analysis-and-an-algorithm.pdf
Niu, C. , Shan, H., Wang, G. (2021). Spice: Semantic pseudo-labeling for image clustering. arXiv preprint arXiv:2103.09382.
Niu, C. , Zhang, J. , Wang, G., Liang, J. (2020a). Gatcluster: Self-supervised gaussian-attention network for image clustering. pp. 735–751.
Niu, C. , Zhang, J. , Wang, G., Liang, J. (2020b). Gatcluster: Self-supervised gaussian-attention network for image clustering. Eccv (pp. 735–751).
Regatti, J.R. , Deshmukh, A.A. , Manavoglu, E., Dogan, U. (2021). Consensus clustering with unsupervised representation learning. In: International Joint Conference on Neural Networks (IJCNN) arXiv preprint arXiv:2010.01245.
Schops, T. , Schonberger, J.L. , Galliani, S. , Sattler, T. , Schindler, K. , Pollefeys, M., Geiger, A. (2017). A multi-view stereo benchmark with high-resolution images and multi-camera videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3260–3269).
Shah, S.A., & Koltun, V. (2018). Deep continuous clustering. arXiv preprint arXiv:1803.01449.
Shorten, C., & Khoshgoftaar, T. M. (2019). A survey on image data augmentation for deep learning. Journal of Big Data, 6(1), 1–48.
Article Google Scholar
Strehl, A., & Ghosh, J. (2002). Cluster ensembles–a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research, 3, 583–617.
MathSciNet MATH Google Scholar
Tao, Y. , Takagi, K., Nakata, K. (2021). Clustering-friendly representation learn-ing via instance discrimination and feature decorrelation. arXiv preprint arXiv:2106.00131.
Tian, Y. , Sun, C. , Poole, B. , Krishnan, D. , Schmid, C., Isola, P. (2020). What makes for good views for contrastive learning. arXiv preprint arXiv:2005.10243.
Van Gansbeke, W. , Vandenhende, S. , Georgoulis, S. , Proesmans, M. & Van Gool, L. (2020). Scan: Learning to classify images without labels.
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., & Manzagol, P. A. (2010). Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research, 11(12), 3371–3408.
MathSciNet MATH Google Scholar
Wu, J. , Long, K. , Wang, F. , Qian, C. , Li, C. , Lin, Z., Zha, H. (2019a). Deep comprehensive correlation mining for image clustering. In: Proceedings of the IEEE International Conference on Computer Vision (pp. 8150–8159).
Wu, J. , Long, K. , Wang, F. , Qian, C. , Li, C. , Lin, Z., Zha, H. (2019b). Deep comprehensive correlation mining for image clustering. Iccv.
Wu, Z. , Xiong, Y. , Yu, S.X., Lin, D. (2018). Unsupervised feature learning via non-parametric instance discrimination. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3733–3742).
Xie, J. , Girshick, R., Farhadi, A. (2016a). Unsupervised deep embedding for clustering analysis. In: International Conference on Machine Learning (pp. 478–487).
Xie, J. , Girshick, R., Farhadi, A. (2016b). Unsupervised deep embedding for clustering analysis. Icml (pp. 478–487). JMLR.org. http://dl.acm.org/citation.cfm?id=3045390.3045442
Xu, R., & Wunsch, D. (2005). Survey of clustering algorithms. IEEE Transactions on Neural Networks, 16(3), 645–678.
Article Google Scholar
Yang, J. , Parikh, D., Batra, D. (2016). Joint unsupervised learning of deep representations and image clusters. Cvpr.
Zeiler, M.D. , Krishnan, D. , Taylor, G.W., Fergus, R. (2010). Deconvolutional networks. In: Computer Vision and Pattern Recognition.
Zhuang, C. , Zhai, A.L., Yamins, D. (2019). Local aggregation for unsupervised learning of visual embeddings. In: Proceedings of the IEEE International Conference on Computer Vision (pp. 6002–6012).

Download references

Funding

Work was done at Microsoft with Microsoft’s support.

Author information

Authors and Affiliations

Microsoft, Mountain View, CA, 94043, USA
Aniket Anand Deshmukh, Eren Manavoglu & Urun Dogan
The Ohio State University, Columbus, OH, 43210, USA
Jayanth Reddy Regatti

Authors

Aniket Anand Deshmukh
View author publications
You can also search for this author in PubMed Google Scholar
Jayanth Reddy Regatti
View author publications
You can also search for this author in PubMed Google Scholar
Eren Manavoglu
View author publications
You can also search for this author in PubMed Google Scholar
Urun Dogan
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors - Aniket Anand Deshmukh, Jayanth Reddy Regatti, Eren Manavoglu and Urun Dogan contributed to this work and were essential to complete this submission.

Corresponding author

Correspondence to Aniket Anand Deshmukh.

Ethics declarations

Conflict of Interest

Authors are from Microsoft and The Ohio State University. There are no conflict of interests to disclose.

Ethics approval

Not applicable

Consent to participate

Yes

Consent for publication

Yes

Additional information

Editor: Andrea Passerini.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Jayanth Reddy Regatti: Work was done while author was at Microsoft.

Appendices

Appendix A: Pseudo-code

Appendix B: Evaluation Metrics

We evaluate our algorithm by computing traditional clustering metrics (the ACC, NMI, and ARI), which we discuss below in detail.

1.1 B.1 ACC

The ACC is computed by first computing a cluster partition of the input data. Once the partitions are computed and cluster indices are assigned to each input data point, a linear assignment map is computed using the Kuhn-Munkres (Hungarian) algorithm, which reassigns the cluster indices to the true labels of the data. The ACC is then given by

$$\begin{aligned} ACC = \frac{ \sum _{i = 1}^N \mathbb {I}\{y_{true}(x_i) = c(x_i)\}}{N}, \end{aligned}$$

where $y_{true}(x_i) $ is a true label of $x_i$ and $c(x_i)$ is the cluster assignment produced by an algorithm (after Hungarian mapping).

1.2 B.2 NMI

For two clusters U, V, where each contains |U|, |V| clusters and $|U_i|$ represents the number of samples in cluster $U_i$ of clustering result U (similar for V), the MI is given by

$$\begin{aligned} MI(U,V) = \sum _{i=1}^{\vert U \vert }\sum _{j=1}^{\vert V \vert } \frac{\vert U_i \cap V_i \vert }{N}\log \frac{N \vert U_i\cap V_j \vert }{\vert U_i\vert \vert V_j \vert } \end{aligned}$$

where N is the number of data points under consideration. The NMI is defined as

$$\begin{aligned} NMI(U,V) = \frac{MI(U,V)}{\sqrt{MI(U,U)MI(V,V)}} \end{aligned}$$

1.3 B.3 ARI

^{Footnote 4} Suppose that R is the ground truth clustering result and that S is a partition. The RI of S is given as follows. Let a be the number of pairs of elements that are in the same set in R and in S; let b be the number of pairs of elements that are in different sets in R and in S. Then,

$$\begin{aligned} RI= & {} \frac{a + b}{{n \atopwithdelims ()2}}\\ ARI= & {} \frac{RI - \mathbb {E}[RI] }{\max (RI) - \mathbb {E}[RI] } \end{aligned}$$

Appendix C: Implementation Details

In this section, we discuss the implementation of the proposed algorithm. We use PyTorch version 1.7.1 for the implementation. The ID block of the algorithm uses the code from the implementation of ID (Wu et al., 2018) available at https://github.com/zhirongw/lemniscate.pytorch. The repository uses PyTorch version 0.3, and appropriate changes are made to use it with the latest version of PyTorch. We use ResNet-18 and ResNet-50 blocks during our experiments. In the ID block, the ResNet architecture is modified as follows. The final fully connected layer consists of 128 dimensions instead of the usual 1000 dimensions. The output of the final fully connected layer proceeds to compute the noise contrastive estimation (NCE) loss, and the feature representations (the layer before the fully connected layer) are fed to the clustering part.

For the clustering part, the MLP projection head g consists of a hidden layer of size 2048, followed by batch normalization and rectified linear unit (ReLU) layers, and an output layer of size 256. The prototypes are thus chosen to have 256 dimensions. Note that we fix the number of prototypes to be equal to the number of ground truth classes in the dataset. It has been shown, however, that overclustering leads to better representations (Caron et al., 2020; Ji et al., 2019a; Asano et al., 2019), and we can extend our model to include an overclustering block with a larger set of prototypes (Ji et al., 2019a) and alternate the training procedure between the blocks.

We train the algorithm for 2000 epochs on all datasets. We use the SGD optimizer with a learning rate decay of 0.1 at prespecified epochs (600, 950, 1300, 1650, 2000) to perform the updates for all datasets. We perform a coarse learning rate search and find that 0.03 is the best-performing setting. We use a batch size of 128 for all the datasets. To evaluate the cluster accuracy, we compute the cluster assignments using MiniBatchKMeans^{Footnote 5} with a batch size of 6000 and 20 random initializations.

1.1 Image augmentations

The different views $\mathcal {X}_b^1, \mathcal {X}_b^2$ are not the same as the views in multiview datasets (Schops et al., 2017). The views referred to in this paper correspond to different augmented views that are generated by image augmentation techniques, such as RandomHorizontalFlip and RandomCrop. We explain the generation process of multiple augmented views, which have been shown to be very effective in unsupervised learning (Chen et al., 2020). Indeed, it is possible to use more than two augmented views, but we limit to the number to two for the sake of simplicity. Caron et al. (2020) proposed an augmentation technique (Multi-Crop) to use more than two views. In this work, we use the augmentation methods used in Chen et al. (2020); Grill et al. (2020). We first crop a random patch of the input image with a scale ranging from 0.08 to 1.0 and resize the cropped patch to 224$\times $224 (96$\times $96 for smaller-resolution datasets such as STL10). The resulting image is then flipped horizontally with a probability of 0.5. We then apply color transformations, starting by applying grayscale with a probability of 0.2 followed by randomly changing the brightness, contrast, saturation and hue with a probability of 0.8. Then, we apply a Gaussian blur with a kernel size of 23$\times $23 and a sigma chosen uniformly and randomly between 0.1 and 2.0. The probabilities of applying Gaussian blur are 1.0 for view 1 and 0.5 for view 2. During the evaluation, we resize the image such that the smaller edge of the image is of size 256 (not required for STL-10, CIFAR-10, and CIFAR100-20), and a center crop operation is performed with the resolution mentioned in the main paper. We finally normalize the image channels with the mean and standard deviation computed on ImageNet. Additionally, during training, we experiment with applying a Sobel filter after all the image augmentation steps are performed but before the forward pass. Applying a Sobel filter reduces the number of channels in the input images to 2. We also experiment by augmenting the RGB images with the output of the Sobel transform, resulting in 5-channel input images. In both of these cases, the input channels in the first convolution layer are modified accordingly. All image augmentations are computed using PyTorch’s torchvision module (available in version 1.7.1).

1.2 Random transformations

To compute the random transformations on the embeddings $\textbf {z}$, we follow two techniques. We use Gaussian random projections with an output dimensionality of d and transform the embeddings $\textbf {z}$ to the new space with a dimensionality of d. In Gaussian random projections, the projection matrix is generated by picking rows from a Gaussian distribution such that they are orthogonal. We also use diagonal transformation (Hsu et al., 2018), where we multiply $\textbf {z}$ with a randomly generated diagonal matrix with the same dimensions as $\textbf {z}$. We initialize M random transformations at the beginning, and they are kept fixed throughout the training process.

Appendix D: Comparison with LA

LA (Zhuang et al., 2019) builds on nonparametric ID (Wu et al., 2018) and uses a robust clustering objective (it uses a closest neighbors set generated using multiple runs of k-means) similar to consensus clustering to move statistically similar points closer in the representation space and dissimilar points farther away. By using the linear evaluation protocol on ImageNet, the authors demonstrate that the representations learned with LA are better than those obtained without LA.

However, the performance of these features with respect to clustering was not discussed. Since LA is similar to our work in spirit, we perform a study on the clustering performance of the features learned using LA^{Footnote 6} on the ImageNet-10 and ImageNet-Dogs datasets. The results are presented in Table 14.

Table 14 Comparison with LA on ImageNet-10

Full size table

We observe that the clustering performance of our proposed ConCURL algorithm is much better than the clustering performance of the LA method. Note that the clustering performance of the ID features is better than that of LA, and our algorithm further improves upon the clustering performance of ID. One major difference between our work and LA is the way in which we generate the ensemble. Our approach allows us to control and measure the diversity of the ensemble, which can be useful in making algorithm design choices. Although LA controls the ensemble by varying the number of clustering results and the number of clusters in each clustering result, which aptly suits the objective that LA is solving, the resultant ensembles are limited to utilizing k-means clustering (the authors showed that other clustering approaches were either not scalable or not optimal). In our case, by applying feature space transformations, we have much more freedom in generating the ensemble. We use random projections and diagonal transformations, but there could be other transformations on the feature space that we have not yet explored.

1.1 Implementation Details of LA

We train the model for 500 epochs. Since the original implementation of LA was designed for ImageNet, we perform a hyperparameter search as follows. Using the config file from the official PyTorch implementation, we create 36 configuration files by varying the learning rate and k-means-k. In the original config file provided by the authors, k-means-k = 30000 (for 1.28 million images). We scale the k-means-k for ImageNet-10 (13000 images) and ImageNet-Dogs (19500 images) accordingly and try six different values. In particular, we try learning rates = [0.003, 0.005, 0.01, 0.03, 0.05, 0.1], and k-means-k = [10, 15, 100, 310, 452, 500].

The number of background neighbors is 4096, as used in the original paper. We run ResNet-18 experiments for the full hyperparameter search (36 experiments) and evaluate the cluster metrics. Additionally, we choose the top 5 choices of hyperparameters from above and run ResNet-34 experiments with those parameters for both datasets. In Table 14, we present the best results obtained for each dataset.

The code repository uses a different version of ResNet (PreActResNet). Therefore, for evaluating the clustering performance, we take the output of the layer before the final dense layer. For ResNet-18, the output dimensions of this layer are (512,7,7), and we take the mean along the (1,2) dimensions and use the resulting 512-dimensional vector. We compute the k-means clustering results on these embeddings using faiss (https://github.com/facebookresearch/faiss) for the training split of the data and compute the cluster metrics as mentioned in our paper.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Deshmukh, A.A., Regatti, J.R., Manavoglu, E. et al. Representation learning for clustering via building consensus. Mach Learn 111, 4601–4638 (2022). https://doi.org/10.1007/s10994-022-06194-9

Download citation

Received: 08 September 2021
Revised: 25 February 2022
Accepted: 05 April 2022
Published: 09 September 2022
Issue Date: December 2022
DOI: https://doi.org/10.1007/s10994-022-06194-9

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Representation learning for clustering via building consensus

Abstract

Similar content being viewed by others

D-TRACE: Deep Triply-Aligned Clustering

Contrastive Hierarchical Clustering

DeepECT: The Deep Embedded Cluster Tree

Explore related subjects

1 Introduction

1.1 Related work

1.1.1 Self-supervised learning

1.1.2 Clustering with representation learning

2 Consensus clustering

Definition 1

Definition 2

Definition 3

2.1 Loss for consensus and population consistency

2.2 End-to-End Stochastic Gradient Descent (SGD)-Based trainable consensus loss

2.2.1 Cluster assignment probability \(\textbf {p}\)

2.2.2 Targets \(\textbf {q}\)

2.2.3 Defining variations of \(Z_0\) and \(C_0\)

2.2.4 Backbone loss

2.2.5 Computing the cluster metrics

2.3 Generating multiple clustering results

3 Understanding the consensus objective

4 Empirical evaluation

4.1 Image clustering with the max-performance strategy

4.1.1 Comparison with state-of-the-art baselines

4.1.2 Performance on the Test split

4.1.3 Class-specific accuracy

4.2 Out-of-distribution results

4.2.1 Cross-model accuracy

4.2.2 ImageNet random-10 and random-15 accuracies

4.3 Cluster visualizations

5 Ablation studies

5.1 Effects of the loss terms

5.2 Effects of data augmentation methods

5.3 Effect of image resolution

5.4 Distribution of accuracies across the set of hyperparameters

5.5 Effect of architecture choice

5.6 Runtime comparison

6 Conclusion

Data Availability

Code Availability

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of Interest

Ethics approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Appendices

Appendix A: Pseudo-code

Appendix B: Evaluation Metrics

1.1 B.1 ACC

1.2 B.2 NMI

1.3 B.3 ARI

Appendix C: Implementation Details

1.1 Image augmentations

1.2 Random transformations

Appendix D: Comparison with LA

1.1 Implementation Details of LA

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation