1 Introduction

The field of artificial intelligence (AI) advanced significantly in the previous decade due to developments in deep learning LeCun et al. (2015). In the early years of this field, deep learning methods exhibited stellar supervised learning performance, where each data sample was coupled with a ground truth (labeled data), e.g., each image was associated with a category. Unfortunately, generating labeled datasets is time consuming and expensive, and there may not be enough experts to label the data at hand (e.g., medical images). The straightforward solution is to use clustering for large-scale problems.

In this work, we focus on representation learning for the unsupervised learning task of clustering images. Clustering is a ubiquitous task and has been actively used in many different scientific and practical pursuits (Frey & Dueck, 2007; Masulli & Schenone, 1999; Jain et al., 1999; Xu & Wunsch, 2005). Clustering algorithms do not learn representations and are hence limited to data for which we have a good representation available.

Advancements in deep learning techniques have enabled the end-to-end learning of rich image representations for supervised learning. For the purposes of clustering, however, such features learned via supervised learning cannot be obtained due to lack of available labels. Therefore, supervised learning approaches fall short of providing a solution. Self-supervised learning addresses the issue of learning representations without labeled data. Self-supervised learning is a subfield of unsupervised learning in which the main goal is to learn general-purpose representations by exploiting user-defined tasks (pretext tasks) Wu et al. (2018); Zhuang et al. (2019); He et al. (2020); Zhuang et al. (2019); Chen et al. (2020); Grill et al. (2020); Caron et al. (2020). Representation learning algorithms have been shown to achieve good results when evaluated using a linear evaluation protocol, semisupervised training on ImageNet, or transfer to downstream tasks. A straightforward solution to the clustering problem is to use the features obtained via self-supervised learning and apply an out-of-the-box clustering algorithm (such as k-means) to compute data clusters. However, the performance of these features for clustering (using an out-of-the-box clustering algorithm) is not known, and as seen in our results, these features may be improved for clustering purposes.

On the other hand, deep clustering involves simultaneously learning cluster assignments and features using deep neural networks. Simultaneously learning the feature spaces with a clustering objective in deep clustering may lead to degenerate solutions, which until recently limited end-to-end implementations of clustering with representation learning approaches (Caron et al., 2018a). Subsequently, several works have been developed (Xie et al., 2016a; Caron et al., 2018a; Shah and Koltun, 2018; Ji et al., 2019a; Niu et al., 2020a; Wu et al., 2019a; Huang et al., 2020a; Tao et al., 2021). We provide details regarding some of these works in Sect. 1.1. Our previous work shows some encouraging results but we extend the work substantially (Regatti et al., 2021). We categorize the current clustering and representation learning works based on the consistency constraints that are used to define their objective functions. We define an additional notion of consistency, consensus consistency, which ensures that representations are learned to induce similar partitions for variations in the representation space, different clustering algorithms or different initializations of a clustering algorithm. We use consensus consistency and propose an end-to-end learning approach that outperforms other end-to-end learning methods for image clustering. We summarize our contributions as follows:

  1. (1)

    We introduce different notions of consistency (exemplar, population and consensus) that are used in unsupervised representation learning.

  2. (2)

    We propose a novel clustering algorithm that incorporates the above three consistency constraints and can be trained in an end-to-end way. An ensemble is generated in the consensus clustering objective by performing random transformations on the underlying embeddings. We combine several methods, which is not trivial, and this combination, along with our new consensus loss, is novel.

  3. (3)

    We show that the proposed algorithm ConCURL (consensus clustering with unsupervised representation learning) outperforms baselines on popularly used computer vision datasets when evaluated with clustering metrics.

  4. (4)

    We demonstrate the clustering abilities of trained models under a data shift and argue for the need for different evaluation metrics for deep clustering algorithms.

  5. (5)

    We study the impacts of various hyperparameters, data augmentation methods, and image resolutions on the clustering ability of the proposed algorithm.

1.1 Related work

1.1.1 Self-supervised learning

Self-supervised learning is used to learn representations in an unsupervised way by defining some pretext tasks. There are many different flavors of self-supervised learning, such as instance discrimination (ID) tasks (Wu et al., 2018; Zhuang et al., 2019) and contrastive techniques (He et al., 2020; Chen et al., 2020). In ID tasks, each image is considered its own category so that the learned embeddings are well separated (Wu et al., 2018). Building on the ID task, Zhuang et al. (2019) proposed a local aggregation (LA) method based on a robust clustering objective (using multiple runs of k-means) to move statistically similar data points closer in the representation space and dissimilar data points further away. In contrastive techniques such as simple contrastive learning of visual representations (SimCLR) (Chen et al., 2020) and momentum contrast (MoCo) (He et al., 2019), representations are learned by maximizing the agreement between different the augmented views of the same data example (known as positive pairs) and minimizing the agreement between the augmented views of different examples (known as negative pairs). Recent works, including Bootstrap Your Own Latent (BYOL) (Grill et al., 2020) and swapping assignments between multiple views (SwAV) (Caron et al., 2020), have achieved state-of-the-art results without requiring negative pairs. Although self-supervised learning methods exhibit impressive performance on a variety of problems, it is not clear whether learned representations are good for clustering.

1.1.2 Clustering with representation learning

DEC (Xie et al., 2016a) is one of the first algorithms to show that deep learning can be used to effectively cluster images in an unsupervised manner; this approach uses features learned from an autoencoder to fine-tune the cluster assignments. DeepCluster (Caron et al., 2018a) shows that it is possible to train deep convolutional neural networks (DeCNNs) in an end-to-end manner with pseudolabels that are generated by a clustering algorithm. Subsequently, several works (Shah and Koltun, 2018; Ji et al., 2019a; Niu et al., 2020a; Wu et al., 2019a; Huang et al., 2020a) have introduced end-to-end clustering-based objectives and achieved state-of-the-art clustering results. For example, in the Gaussian attention network for image clustering (GATCluster) (Niu et al., 2020a), training is performed in two distinct steps (similar to Caron et al. (2018a)), where the first step is to compute pseudotargets for a large batch of data and the second step is to train the model in a supervised way using these pseudotargets. Both DeepCluster and GATCluster use k-means to generate pseudolabels that may not scale well. Wu et al. (2019a) proposed deep comprehensive correlation mining (DCCM), where discriminative features are learned by taking advantage of the correlations among the data using pseudolabel supervision and the triplet mutual information among the features. However, DCCM may be susceptible to trivial solutions (Niu et al., 2020a). Invariant information clustering (IIC) (Ji et al., 2019a) maximizes the mutual information between the class assignments of two different views of the same image (paired samples) to learn representations that preserve the commonalities between the views while discarding instance-specific details. It has been argued that the presence of an entropy term in mutual information plays an important role in avoiding degenerate solutions. However, a large batch size is needed for the computation of mutual information in IIC; this process may not be scalable for larger image sizes, which are common in popular datasets (Ji et al., 2019a; Niu et al., 2020a). Huang et al. (2020a) extended the celebrated maximal margin clustering idea to the deep learning paradigm by learning the most semantically plausible clusters through the minimization of a proposed partition uncertainty index. Their pixel intensity clustering (PICA) algorithm uses a stochastic version of this index, thereby facilitating minibatch training. PICA fails to assign a sample-correct cluster when that sample has either high foreground or background similarity to samples in other clusters. In a more recent approach, contrastive clustering (Li et al., 2021), a contrastive learning loss (as in SimCLR (Chen et al., 2020) was adopted along with an entropy term to avoid degenerate solutions. Similarly, Tao et al. (2021) combined ID (Wu et al., 2018) with novel softmax-formulated decorrelation constraints for representation learning and clustering. Their approach outperforms state-of-the-art methods and improves upon the instance discrimination method. Our method also improves upon ID and outperforms the method of  Tao et al. (2021) on all datasets considered. There are other non-end-to-end approaches, such as SCAN (Van Gansbeke et al., 2020), which use the learned representations from a pretext task to find the images that are semantically closest to the given image using the nearest neighbors algorithm. Similarly, one more state of the art non-end-to-end approach, SPICE (Niu et al., 2021) divides the clustering network in two parts - one to measure instance level similarity and one to identify cluster level discrepancy.

2 Consensus clustering

One of the distinguishing factors between supervised learning and unsupervised learning is the existence of ground truth labels that construct a global constraint based on examples. In most self-supervised learning methods, the ground truth is replaced with some consistency constraint (Chen et al., 2020). Without a doubt, the performance of any self-supervised method is a function of the power of the consistency constraint used. We define two types of consistency constraints: exemplar consistency and population consistency.

Definition 1

Exemplar consistency: Representation learning algorithms that learn closer representations (in terms of some distance metric) for different augmentations of the same data point are said to follow exemplar consistency.

Examples of the usage of exemplar consistency include contrastive learning methods such as MoCo (He et al., 2019) and SimCLR (Chen et al., 2020). In these methods, a positive pair of images is defined as any two image augmentations of the same image, and a negative pair consists of any two different images.

Definition 2

Population consistency Representation learning algorithms that ensure that learned representations satisfy the consistency constraint, where two similar data points or any augmentations of the same data points should belong to the same cluster (or population), are said to follow population consistency.

Deep Cluster (Caron et al., 2018a) is a prominent self-supervised method that utilizes population consistency, i.e., Definition 2, by enforcing a clustering constraint on the input dataset. Please note that each cluster assignment contains data points that are similar to each other. Similarly, SwAV (Caron et al., 2020) is an example of the population consistency method.

Definition 3

Consensus consistency Representation learning algorithms that are able to learn representations that induce similar partitions for variations in the given representation space (subsets of features, random projections, etc. ), different clustering algorithms (k-means, Gaussian mixture models (GMMs), etc.) or different initializations of clustering algorithms are said to follow consensus consistency.

Earlier works on consensus consistency did not consider representation learning and used the knowledge reuse framework (see Strehl and Ghosh (2002),Ghosh and Acharya (2011)), where the cluster partitions were available (the features were irrelevant) or the features of the data were fixed. For example, Fern and Brodley (2003) successfully applied random projections to consensus clustering by performing k-means clustering on multiple random projections of the fixed features of input data. In contrast, the notion of consensus consistency here deals with learning representations that achieve a consensus regarding the cluster assignments of multiple clustering algorithms. One example of a method that enforces consensus consistency is LA (Zhuang et al., 2019). LA builds on the ID task (Wu et al., 2018) and was proposed as a method based on a robust clustering objective (using multiple runs of k-means) to move statistically similar data points closer in the representation space and dissimilar data points further away. However, Zhuang et al. (2019) did not evaluate the method with clustering metrics and only focused on linear evaluation using the learned features. Subsequently, we conducted a study to evaluate the clustering performance of these features (see Appendix) and observed that LA performed poorly when evaluated for clustering accuracy. In Definition 3, we inherently assume that the clustering algorithms under consideration have been tuned properly. Unfortunately, the definition of consensus consistency is ill posed, and there can be arbitrarily many different partitions that can satisfy the given conditionFootnote 1. We show that when exemplar consistency is used as an inductive bias, the resulting objective function achieves impressive performance on challenging datasets. Combining the exemplar and population constraints with consensus consistency seamlessly and effectively for clustering is the basis of our proposed method.

2.1 Loss for consensus and population consistency

We focus on learning generic representations that satisfy Definition 3 for clustering. By using different clustering algorithms or different representation variations (such as projections), one can easily generate multiple different partitions of the same data. In unsupervised learning, it is not known which partitioning is correct. To tackle this problem, some additional assumptions are needed.

We assume that there is an underlying latent space \(\mathcal {Z}^*\) (possibly not unique) such that all clusterings (based on latent space, algorithm or initialization variations) that take input data from this latent space produce similar data partitions. Furthermore, every clustering algorithm that also takes the true number of clusters as input produces the partition that is closest to the hypothetical ground truth. Moreover, we assume that there exists a function \(h:X \rightarrow \mathcal {Z}^*\), where X represents the input space and \(\mathcal {Z}^*\) represents the underlying latent space. We call this assumption the principle of consensus. The open question is how one constructs an efficient loss that reflects the principle of consensus. We define one such way below.

Given an input batch of images \(\mathcal {X}_b\subset \mathcal {X}\), the goal is to partition these images into K clusters. We obtain p views of these images (by different image augmentation approaches) and define a loss such that cluster assignment of any of the p views matches the target estimated from any other view. Without loss of generality, we define a loss for \(p = 2\) views. The two views \(\mathcal {X}_b^1, \mathcal {X}_b^2\) are generated using two randomly chosen image augmentations.

We learn a representation space \(\mathcal {Z}_0\) at the end of every training iteration and obtain M variations of \(\mathcal {Z}_0\) as \(\{ \mathcal {Z}_1, \mathcal {Z}_2, ... , \mathcal {Z}_M\}\) (e.g., random projections). The goal is to build an efficient loss according to the principle of consensus among \(\mathcal {Z}_0\) and its M variations \(\{ \mathcal {Z}_1, \mathcal {Z}_2, ... , \mathcal {Z}_M\}\) such that we learn the latent space \(\mathcal {Z}^*\) at the end of training (i.e., the learned features lie in the latent space described above). For a given batch of images \(\mathcal {X}_b\) and a representation space \(\mathcal {Z}_m, \forall m \in [1,...,M]\), we denote the cluster assignment probability of image i and cluster j for view 1 as \(\textbf {p}_{i,j}^{1}(\mathcal {Z}_m)\) and that for view 2 as \(\textbf {p}_{i,j}^{2}(\mathcal {Z}_m)\). We concisely use \(\tilde{\textbf {p}}^{(1,m)},\tilde{\textbf {p}}^{(2,m)}\) when we talk about all the images and all the clusters. Here, we define a loss that incorporates “population consistency" and “consensus consistency". We assume that the target cluster assignment probabilities for the representation \(\mathcal {Z}_0\) are given (as in DeepCluster (Caron et al., 2018a)), and they are denoted as \(\textbf {q}_{i,j}^{1}\) for view 1 and \(\textbf {q}_{i,j}^{2}\) for view 2.

We define the loss for any representation space \(\mathcal {Z}\) and batch of images \(\mathcal {X}_b\) as

$$\begin{aligned} {\begin{matrix} L_{\mathcal {Z}_m}^1 &=& - \frac{1}{2B}\sum _{i=1}^{B}\sum _{j=1}^K \textbf {q}^{2}_{ij} \log \textbf {p}^{1}_{ij}(\mathcal {Z}_m) \\ L_{\mathcal {Z}_m}^2 &=& - \frac{1}{2B}\sum _{i=1}^{B}\sum _{j=1}^K \textbf {q}^{1}_{ij} \log \textbf {p}^{2}_{ij}(\mathcal {Z}_m), \\ L_{\mathcal {Z}} &=& \sum _{m = 1}^M \Big ( L_{\mathcal {Z}_m}^1 + L_{\mathcal {Z}_m}^2 \Big ). \end{matrix}} \end{aligned}$$
(1)

Note that here, consensus among the clustering results is defined via the number of common targets \(\textbf {q}\). An overview of the procedure is shown in Fig. 1. The exact details regarding how to obtain variations of \(\mathcal {Z}_0\) and calculate the cluster assignment probabilities \(\textbf {p}\) and targets \(\textbf {q}\) are described in the next section.

Fig. 1
figure 1

An illustration of the consensus loss part of ConCURL

2.2 End-to-End Stochastic Gradient Descent (SGD)-Based trainable consensus loss

In this section, we propose an end-to-end trainable algorithm and define a way to compute \(\textbf {p}\) and \(\textbf {q}\). When the cluster assignment probabilities \(\textbf {p}\) can take any values in the set [0, 1], we refer to the process as soft clustering, and when \(\textbf {p}\) is restricted to the set \(\{0,1\}\), we refer to the process as hard clustering.

Without loss of generality, in this paper, we focus on soft clustering, which makes it easier to define a loss function using the probabilities and update the parameters using the gradients to enable end-to-end learning. We follow the soft clustering framework presented in SwAV (Caron et al., 2020), which is a centroid-based technique that aims to maintain consistency between the clusterings of the augmented views \(\mathcal {X}_b^{1}\) and \(\mathcal {X}_b^{2}\). We store a set of randomly initialized prototypes \(C_0=\{ \textbf {c}_0^1,\cdots ,\textbf {c}_0^K \} \in \mathbb {R}^{d\times K}\), where K is the number of clusters and d is the dimensionality of the prototypes. These prototypes are used to represent clusters and define a “consensus consistency" loss. We compute M variations of \(C_0\) as \(C_1,...,C_M\) exactly as we compute the M variations of \(\mathcal {Z}_0\).

2.2.1 Cluster assignment probability \(\textbf {p}\)

We use a two-layer multilayer perceptron (MLP) g to project the features \(\textbf {f}^1 = f_\theta (\mathcal {X}_b^1)\) and \(\textbf {f}^2 = f_\theta (\mathcal {X}_b^2)\) to a lower-dimensional space \(\mathcal {Z}_0\) (of size d). The outputs of this MLP (referred to as cluster embeddings) are denoted as \({Z}_0^1 = \{\textbf {z}_0^{1,1}, \ldots , \textbf {z}_0^{1,B} \}\) and \({Z}_0^2 = \{\textbf {z}_0^{2,1}, \ldots , \textbf {z}_0^{2,B} \}\) for view 1 and view 2, respectively. Note that \(h: \mathcal {X} \rightarrow \mathcal {Z}\) defined in 2.1 is equivalent to the composite function of \(f: \mathcal {X} \rightarrow \Phi \) and \(g: \Phi \rightarrow \mathcal {Z}\).

For a latent space \(\mathcal {Z}\), we compute the probability of assigning a cluster j to image i using the normalized vectors \(\bar{\textbf {z}}^{1,i} = \frac{\textbf {z}^{1,i}}{\Vert \textbf {z}^{1,i}\Vert }\), \(\bar{\textbf {z}}^{2,i} = \frac{\textbf {z}^{2,i}}{\Vert \textbf {z}^{2,i}\Vert }\) and \(\bar{\textbf {c}}_j = \frac{{\textbf{c}}^j}{\Vert {\textbf{c}}^j\Vert }\) as

$$ \begin{gathered} {\textbf{p}}_{{i,j}}^{1} ({\mathcal{Z}},C) = \frac{{\exp \left( {\frac{1}{\tau }\langle \overline{{\textbf{z}}} _{i}^{1} ,\overline{{\textbf{c}}} _{j} \rangle } \right)}}{{\sum\nolimits_{{j\prime }} {\exp \left( {\frac{1}{\tau }\langle \overline{{\textbf{z}}} _{i}^{1} ,\overline{{\textbf{c}}} _{{j\prime }} \rangle } \right)} }}, \hfill \\ {\textbf{p}}_{{i,j}}^{2} ({\mathcal{Z}},C) = \frac{{\exp \left( {\frac{1}{\tau }\langle \overline{{\textbf{z}}} _{i}^{2} ,\overline{{\textbf{c}}} _{j} \rangle } \right)}}{{\sum\nolimits_{{j\prime }} {\exp \left( {\frac{1}{\tau }\langle \overline{{\textbf{z}}} _{i}^{2} ,\overline{{\textbf{c}}} _{{j\prime }} \rangle } \right)} }}. \hfill \\ \end{gathered} $$
(2)

We concisely write \( \textbf {p}^1_{i}(\mathcal {Z}) = \{ \textbf {p}^1_{i,j}(\mathcal {Z},C) \}_{j = 1}^K \) and \( \textbf {p}^2_{i} = \{ \textbf {p}^2_{i,j}(\mathcal {Z},C) \}_{j = 1}^K \). Here, \(\tau \) is a temperature parameter, and we set its value to 0.1, similar to Caron et al. (2020). Note that we use \(\textbf {p}_{i}\) to denote the predicted cluster assignment probabilities for image i (when not referring to a particular view), and the shorthand notation \(\textbf {p}\) is used when i is clear from context.

2.2.2 Targets \(\textbf {q}\)

The idea of predicting the assignments \(\textbf {p}\) and then comparing them with the high-confidence estimates \(\textbf {q}\) (referred to as codes henceforth) of the predictions was proposed by Xie et al. (2016a). While Xie et al. (2016a) used pretrained features (from autoencoders) to compute the predicted assignments and the codes, the use of their approach in an end-to-end unsupervised manner might lead to degenerate solutions. Asano et al. (2019) avoided such degenerate solutions by enforcing an equipartition constraint (the prototypes equally partitioned the data) during code computation using the Sinkhorn-Knopp algorithm (Cuturi, 2013). Caron et al. (2020) followed a similar formulation but computed the codes for the two views separately in an online manner for each minibatch. The assignment codes are computed by solving the following optimization problem:

$$\begin{aligned} {\begin{matrix} Q^1 &{}= \mathop {\hbox {arg max}}\limits _{Q\in \mathcal {Q}} \text {Tr}(Q^TC_0^TZ_0^1) + \epsilon H(Q) \\ Q^2 &{}= \mathop {\hbox {arg max}}\limits _{Q\in \mathcal {Q}} \text {Tr}(Q^TC_0^TZ_0^2) + \epsilon H(Q), \end{matrix}} \end{aligned}$$
(3)

where \( Q = \{\textbf {q}_1, \ldots , \textbf {q}_B \} \in \mathbb {R}_{+}^{K\times B}\), \(\mathcal {Q}\) is the transportation polytope defined by

$$\begin{aligned} \mathcal {Q} = \{\textbf {Q}\in \mathbb {R}^{K\times B}_{+}~\text {s.t}~ \textbf {Q}\textbf {1}_B = \frac{1}{K}\textbf {1}_K, \textbf {Q}^T\textbf {1}_K = \frac{1}{B}\textbf {1}_B \} \end{aligned}$$

\(\textbf {1}_K\) is a vector of ones of dimension K and \( H(Q) = -\sum _{i,j}Q_{i,j}\log Q_{i,j} \). The above optimization is computed using a fast version of the Sinkhorn-Knopp algorithm (Cuturi, 2013), as described by Caron et al. (2020).

After computing the codes \(Q^1 \) and \(Q^2\), to maintain the consistency between the clustering results of the augmented views, the loss is computed using the probabilities \(\textbf {p}_{ij}\) and the assigned codes \(\textbf {q}_{ij}\) by comparing the probabilities of view 1 with the assigned codes of view 2 and vice versa, as in (1).

2.2.3 Defining variations of \(Z_0\) and \(C_0\)

To compute \(\{Z_1,...,Z_M \}\), we project the d-dimensional space \(Z_0\) to a D-dimensional space using a random projection matrix. We follow the same procedure to compute \(\{C_1,...,C_M \}\) from \(C_0\). At the beginning of the algorithm, we randomly initialize M such transformations and fix them throughout training. Suppose that by using a particular random transformation (a randomly generated matrix A), we obtain \(\tilde{\textbf {z}} = A\textbf {z},\; \tilde{\textbf {c}} = A\textbf {c}\). We then compute the softmax probabilities using the normalized vectors \(\tilde{\textbf {z}}/\Vert \tilde{\textbf {z}}\Vert \) and \(\tilde{\textbf {c}}/\Vert \tilde{\textbf {c}}\Vert \). This step is repeated with the M transformation results in the M predicted cluster assignment probabilities for each view. When the network is untrained, the embeddings \(\textbf {z}\) are random, and applying the random transformations, followed by computing the predicted cluster assignments, leads to a diverse set of soft cluster assignments. The parameter weights are trained by using the stochastic gradients of the loss for updates.

2.2.4 Backbone loss


To better capture exemplar consistency, based on previous evidence of successful clustering with the ID approach (Tao et al., 2021), we use ID (Wu et al., 2018) as one of the losses, as in Tao et al. (2021). The exemplar objective of ID is to classify each image as its own class.

Given n images and a neural network \(f_{\theta }\) for calculating features, we first normalize the features \(\bar{f}_{\theta }(x) = \frac{f_{\theta }(x)}{\Vert f_{\theta }(x) \Vert }\). Then, ID defines the probability of an example x being recognized as the i-th example as

$$\begin{aligned} P(i \vert f_{\theta }(x)) = \frac{\exp \left( \langle \bar{f}_{\theta }(x_i), \bar{f}_\theta (x) \rangle / \tau \right) }{\sum _{j=1}^n \exp \left( \langle \bar{f}_{\theta }(x_j), \bar{f}_\theta (x) \rangle / \tau \right) }. \end{aligned}$$
(4)

ID then uses the uniform distribution as a noise distribution \(P_n = \frac{1}{n}\) to compute the probability that data example x comes from a data distribution \(P_d\) as opposed to the noise distribution \(P_n\) as \(h(i, f_{\theta }(x)) := \frac{P(i\vert f_{\theta }(x))}{P(i\vert f_{\theta }(x)) + m P_n(i)}\). Assuming that the noise samples are m times more frequent than actual data samples, the ID loss is defined as

$$\begin{aligned} {\begin{matrix} L_{b}&= - E_{P_d} \left[ \log h(i, x)\right] - m E_{P_n} \left[ \log (1 - h(i, x')) \right] , \end{matrix}} \end{aligned}$$
(5)

where \(x'\) is the feature from a randomly drawn image other than image x in a given dataset. We exactly follow the framework developed in Wu et al. (2018) to implement the ID loss.

The final loss that we seek to minimize is the combination of the losses \(L_{\mathcal {Z}}\) ((1)) and \(L_b\) ((5)),

$$\begin{aligned} L_{\text {total}} = \alpha L_{\mathcal {Z}} + \beta L_b. \end{aligned}$$
(6)

where \(\alpha , \beta \) are nonnegative constants. Details of the algorithm are given Algorithm 1 and we also provide a PyTorch-style pseudocode in Algorithm 2 in the Appendix.

figure a

2.2.5 Computing the cluster metrics

In this section, we describe the approach used to compute the cluster assignments and the metrics chosen to evaluate their quality. Note that we assume that the number of true clusters (K) in the data is known.

There are two ways to compute the cluster assignments. The first way is to use the embeddings generated by the backbone; here, the embeddings are the outputs of the ID block \(f_{\theta }(x)\). The embeddings of all the images are computed, and then we perform k-means clustering.

The second method is to use the soft clustering block to compute the cluster assignments. It is sufficient to use the computed probability assignments \(\{\textbf {p}_i\}_{i=1}^N\) or the computed codes \(\{\textbf {q}_i\}_{i=1}^N\) and assign the cluster index as \(c_i = \arg \max _{k} \textbf {q}_{ik}\) for the \(i^{\text {th}}\) data point. Once the model is trained, in this second approach, cluster assignment can be performed online without requiring the computation of the embeddings of all the input data.

We evaluate the quality of the clusterings using metrics such as the cluster accuracy, normalized mutual information (NMI), and adjusted Rand index (ARI). To compute the clustering accuracy, we are required to solve an assignment problem (computed using a Hungarian match  (Kuhn, 1955, 1956)) between the true class labels and the cluster assignments. In our analysis, we observe that using k-means with the embeddings produced by the ID block achieves better clustering accuracy, and we use this method throughout the paper while evaluating our proposed algorithm.

2.3 Generating multiple clustering results

Fred and Jain (2005) discussed different ways to generate cluster ensembles; these methods are tabulated in Table 1. In our proposed algorithm, we focus on choosing of the appropriate data representation to generate cluster ensembles.

Table 1 Different ways to generate ensembles

By fixing a stable clustering algorithm, we can generate arbitrarily large ensembles by applying different transformations on the embeddings. Random projections were previously successfully used in consensus clustering (Fern and Brodley, 2003). By generating ensembles using random projections, we have control over the amount of diversity we can induce into the framework by varying the dimensionality of the random projection. In addition to random projections, we also use diagonal transformations (Hsu et al., 2018) where different components of the representation vector are scaled differently. Hsu et al. (2018) illustrated that such scaling enables a diverse set of clusterings, which is helpful for the meta learning task. We study ablations over the number of transformations needed and the dimensions of these transformations in Sect. 5.

3 Understanding the consensus objective

We investigate a potential hypothesis regarding “training driven by noisy cluster assignments" that can shed light on the success of ConCURLFootnote 2. The hypothesis stems from the following intuition. Using different clustering algorithms, the generated cluster assignments are noisy versions of the hypothetical ground truth; as the training process progresses, the noise in the cluster assignments is reduced, and eventually all the different clustering algorithms considered generate similar cluster assignments.

We verify this hypothesis empirically with the help of the following experiments on the STL-10 dataset: (i) We observe the noisy clusterings generated by using random projections and (ii) Verify that the noise in the cluster assignments is reduced as training progresses.

For the purpose of demonstrating noisy cluster assignments, we use synthetic data as follows. We generate three clusters in \(\mathbb {R}^2\), as shown in Fig. 2a, and compute the centroids of each cluster. Here, the centroids act as the prototypes. We then generate a Gaussian random projection matrix A with a dimensionality of \(\mathbb {R}^{2\times 2}\). We first normalize the embeddings (2-dimensional features) and the centroids (see Fig. 2b). Using the matrix A, we transform both the embeddings and prototypes for the new space and normalize the resultant vectors.

Fig. 2
figure 2

a: Three-cluster dataset = z, b: Normalized data

We follow the soft clustering framework discussed earlier and compute the soft cluster assignments for the original and transformed data. We observe that the cluster assignment probabilities in the new space are noisy versions of the cluster assignment probabilities in the original space (see Table 2).

Table 2 Predicted cluster assignment probabilities and target probabilities obtained from the Sinkhorn algorithm for four data points

To verify that the noise in the cluster assignment probabilities is reduced as training progresses, we perform the following experiment. We measure the similarity among the cluster assignments at every epoch to observe the effect of consensus as training progresses. For each random projection used, we use the cluster assignment probabilities \(\tilde{\textbf {p}}\) and compute cluster assignments by taking an \(\mathop {\hbox {arg max}}\limits \) on \(\tilde{\textbf {p}}\) for each image. We obtain M such cluster assignments due to the M random projections. We then compute a pairwise NMI (similar to the analysis of Fern and Brodley (2003)) between every two cluster assignments and compute the average and standard deviation of the pairwise NMI values across the \(\frac{M(M-1)}{2}\) pairs. An NMI score of 1.0 signifies that the two clusters perfectly correlate with each other, and a score of 0.0 implies that the two clusters are uncorrelated. We observe from Fig. 3 that the pairwise NMI increases as training progresses and becomes closer to 1. At the beginning of training, the cluster assignments are very diverse (small NMI scores with a large standard deviation), and as training progresses, the diversity is reduced (large NMI scores with a smaller standard deviation). This observation leads us to conclude that for the applied clustering algorithms (defined using different random projections), we have learned an embedding space where the different cluster assignments concur. In other words, “consensus consistency" is achieved. Additionally, it is evident from our empirical results in Sect. 4 that we achieve an improved overall clustering accuracy.

Fig. 3
figure 3

Pairwise NMI values as a way to measure the diversity in the ensemble; the results are obtained for the STL-10 dataset, and the pairwise NMI values for different random projection dimensions are shown (the original dimensionality of \(\textbf {z}\) is 256)

If noisy cluster assignments are the reason behind the improved performance, one might wonder if it is sufficient to simply add noise to the original cluster assignments rather than computing multiple cluster assignments. However, this may not be fruitful because if noise is added externally, one must define a scheduler to reduce the noise as training progresses. However, in the case of ConCURL, the end-to-end learning algorithm determines the rate of consensus or agreement between \(\textbf {p} \) and \(\tilde{\textbf {p}}\) itself. In the next section, we provide empirical evidence of the effectiveness of our method.

4 Empirical evaluation

Evaluating clustering algorithms is a notoriously hard problem. The reference text Jain and Dubes (1988) states the following: The validation of clustering structures is the most difficult and frustrating part of cluster analysis. Without a strong effort in this direction, cluster analysis will remain a black art accessible only to those true believers who have experience and great courage.

In the literature on representation learning for clustering, e.g., Li et al. (2021); Huang et al. (2020b); Tao et al. (2021), to evaluate the performance of different algorithms, the following methodology has been used: as a set of models with some hyperparameters are trained, these models are sorted by the observed clustering performance. Finally the best model’s results are reported. This methodology is called max performance in the remainder of the paper. We assess the quality of the learned embeddings by using five challenging image datasets for clustering and report their performance with the max-performance strategy. Although the max-performance procedure provides some insights into the performance of the method under consideration, we provide additional insights by significantly extending the evaluations. In practice, it is desirable that the learned models can be utilized for different datasets other than the training dataset. However, the max-performance method may not be suitable for this purpose. To address this, we design two additional experiments that focus on the performance of cross-model features under a distribution shift. Furthermore, we also assess the quality of the learned embeddings in image retrieval tasks. Finally, we present a detailed ablation study to assess the impact of the loss terms, data augmentation methods, hyperparameters and architecture choices utilized to obtain a more complete picture.

4.1 Image clustering with the max-performance strategy

We evaluate our algorithm and compare it with existing methods on some popular image datasets, namely, ImageNet-10, ImageNet-Dogs, STL-10, CIFAR-10, and CIFAR100-20.

For CIFAR100-20, we use the 20 meta classes as the class labels while evaluating the clustering results. For STL-10, similar to the earlier PICA (Huang et al., 2020a) and GATCluster (Niu et al., 2020a) approaches, we use both training and testing splits for training and evaluation. Note that PICA also uses an unlabeled data split with 100k points in STL-10, which we do not use. ImageNet-10 and ImageNet-Dogs are subsets of ImageNet, and we use only the training splits for these two datasets (Deng et al., 2009). We use the same classes as Chang et al. (2017a) for evaluation on the ImageNet-10 and ImageNet-Dogs datasets. The dataset summary is given in Table 3 and methods compared are given in Table 4. We evaluate the cluster accuracy, NMI, and ARI of each computed cluster assignment (see the Appendix for details).

Table 3 Dataset summary
Table 4 Methods compared

4.1.1 Comparison with state-of-the-art baselines

In our comparison, we consider some state-of-the-art methods that were developed for image clustering problems and targeted for end-to-end training scenarios with random initialization. We should note that we do not consider baselines that use prior information, e.g., the nearest neighbors algorithm derived by using pretrained models. The implementation details of ConCURL are provided in the Appendix, and the results are presented in Table 5.

Table 5 Clustering with the max-performance strategy

We observe that ConCURL outperforms the baseline algorithms considered in terms of all three metrics for all the datasets except STL-10. ConCURL improves the state-of-the art clustering accuracy by approximately \(17.5\%\) on ImageNet-Dogs, by \(12.7\%\) on CIFAR100-20 and by \(3.8\%\) on CIFAR-10. Although ConCURL improves upon the results of ID (Tao et al., 2021), please note that ID is the backbone used in this paper and is slightly worse than IDFD, as shown in Tao et al. (2021).

The proposed method achieves good clustering performance on popular computer vision datasets. Similar to all the algorithms considered, we assume that K, the number of clusters, is known. However, this may not hold true in practice in real-world applications. In such a case, we may assume an estimate for the upper bound on the number of clusters to use as the number of prototypes. Additionally, we also assume that the dataset is equally distributed among the K clusters. If this assumption (also common in the literature (Huang et al., 2020a; Niu et al., 2020a)) does not hold, the fast Sinkhorn-Knopp algorithm used to solve Eq. 3 may not be optimal.

4.1.2 Performance on the Test split

In the previous section, we studied the clustering performance on the training split used to train the algorithm. Here, we shall evaluate the clustering performance on a held out test set. We shall use the standard test split for the datasets CIFAR-10, CIFAR100-20, ImageNet-10, and ImageNet-Dogs. We used the trained models to extract the features for each test dataset and compute the clustering as above. We observe from Table 6 that the performance doesn’t get affected much on the test set. This shows that the algorithm is able to extract good feature representations for clustering on data not used for training which shows good generalization ability of the algorithm when the data is drawn from the same distribution.

Table 6 Clustering on the test dataset

4.1.3 Class-specific accuracy

We present the class-specific accuracies (percentage) and confusion matrices in Fig. 4. In each row i, \(j^{th}\) entry in the matrix represents the percentage of samples from category i belonging to the cluster of category j. For better visualization, we round each percentage to the nearest integer.Row sum may not be equal to 100 because we are rounding to the nearest integer. For perfect clustering, all elements along the diagonal should be equal to 100. Here, we note some interesting observations. For ImageNet-10, the airliner category shows the worst performance, with 7% of airliner samples being confused with the airship category. Additionally, 4% of the samples from the orange category are categorized as soccer balls. In ImageNet-Dogs, there are three types of spaniels- Blenheim spaniels, Brittany spaniels and Welsh springer spaniel, which look very similar to each other. Nineteen percent of the samples from the Blenheim spaniel category are categorized as Brittany spaniels, and 5% are categorized as Welsh springer spaniels. Similarly, 16% of the samples from the Brittany spaniel category are categorized as Blenheim spaniels, and 16% more are categorized as Welsh springer spaniels. Forty-four percent of the samples from the Welsh springer spaniel category are categorized as Brittany spaniels, and 15% are categorized as Blenheim spaniels. Kelpies and Dobermans are also confused with each other, where 39% of the kelpie samples are categorized as Dobermans, and 29% of the Doberman samples are categorized as kelpies. For CIFAR-10, 33% of the samples from the dog category are categorized as cats, and 16% of the samples from the cat category are categorized as dogs.

Fig. 4
figure 4

Confusion matrices

For STL-10, none of the samples from the cat category are categorized correctly. Even though STL-10 and CIFAR-10 have the same list of categories, STL-10 seems harder to cluster than CIFAR-10. Note that STL-10 has only 13000 images to learn representations, while in CIFAR-10, 60000 images are used. For CIFAR100-20, none of the samples from aquatic mammals are categorized correctly. Thirty-seven percent of the samples from aquatic mammals are categorized as fish, and 19% are categorized as large omnivores and herbivores. In the case of reptiles, only 15% of the examples are categorized correctly, but 23% are categorized as fish, 16% as large carnivores and 13% as insects. Surprisingly, 28% of the samples from trees are categorized as aquatic mammals.

4.2 Out-of-distribution results

In this section, we evaluate ConCURL  by performing clustering on dataset that is not used during training. We focus mainly on studying the clustering performance on datasets that maybe similar to the training dataset and datasets that may have a different number of clusters than the training dataset.

4.2.1 Cross-model accuracy

Here, we calculate the clustering performance achieved when the model is trained on one dataset but evaluated on a different dataset that may have a different number of clusters. For example, the first row in Table 7 gives the performance of the model trained on ImageNet-10 and evaluated on both ImageNet-10 and ImageNet-Dogs. Similarly, the second row shows the performance of the model trained on ImageNet-Dogs. We find that performance on ImageNet-10 is decreased to 35.6% when the model trained on ImageNet-Dogs is used instead of the model trained on ImageNet-10. Similarly, the performance on ImageNet-Dogs is decreased to 17.7 % when the model trained on ImageNet-10 is used instead of the model trained on ImageNet-Dogs. Table 8 provides the same performance metrics for CIFAR-10 and CIFAR100-20.

For cross-model performance to be high, the embedding function must be generalizable to the out of distribution dataset. It is important to observe that for each pair of datasets considered, the distributions of the datasets are very different due to the classes being completely different in both cases (ImageNet-10 vs ImageNet-Dogs, and CIFAR-10 vs CIFAR100-20). However, since we are considering datasets with a small number of datapoints and small number of classes (see Table 3), the representation power of the learnt embeddings is limited and this affects the cross-model accuracy. Moreover, the consensus loss \(L_{\mathcal {Z}}\) here assumes knowledge of the number of clusters in the dataset. Therefore, the embeddings learnt by optimizing the \(L_{total}\) loss on one dataset may be sub-optimal for evaluating clustering on a dataset with different number of clusters.

It is clear that these performance drops are significant, and the generalization performance of the learned embeddings needs to be assessed by taking out-of-distribution datasets into account. However, since there are only 2 groups of different datasets, it is difficult to reach a definitive conclusion. Hence, in the following section, we propose a new evaluation methodology that sheds light on the out-of-distribution performance of the learned embeddings.

Table 7 ImageNet-10 vs. ImageNet-Dogs: cross-model performance
Table 8 CIFAR-10 vs. CIFAR100-20: cross-model performance

4.2.2 ImageNet random-10 and random-15 accuracies

Here, we compare the baseline model trained with ID (Wu et al., 2018) with our proposed method ConCURL. We randomly sample 10 and 15 classes from the 1000-class ImageNet data and evaluate the clustering accuracy obtained on the training split of the data using the model trained on the original ImageNet-10 and ImageNet-Dogs sets. We repeat the process 100 times for both the 10-class and 15-class datasets and call them the random-10 and random-15 datasets, respectively. Note that we do not retrain the model on the randomly sampled dataset; we only evaluate the model on this set. We show the histogram of the obtained accuracies for these 100 random datasets. In Fig. 5, we compare the accuracy of the ConCURL model and the baseline ID model trained on ImageNet-10 on both random-10 and random-15. Along with the histogram, we show a Gaussian distribution (along the red dotted line) with first and second moments equal to the average and standard deviation of all accuracies, respectively. Similarly, in Fig. 6, we show the accuracies obtained based on models trained on ImageNet-Dogs. Among the models trained on ImageNet-10, the baseline ID model performs slightly better than the proposed ConCURL model. The trend is reversed for the evaluation based on the model trained on ImageNet-Dogs, where ConCURL performs better than the baseline model. Even though the proposed method performs best with the max-performance strategy, it performs slightly worse on random-10. This result strengthens our argument regarding the need to go beyond the traditional reporting of maximum performance based on the ACC, NMI and ARI metrics.

Fig. 5
figure 5

Histogram of clustering accuracies for models trained on ImageNet-10: (a) ConCURL model evaluated on random-10, (b) ConCURL model evaluated on random-15, (c) Baseline (ID) model evaluated on random-10, (d) Baseline (ID) model evaluated on random-15

Fig. 6
figure 6

Histogram of clustering accuracies for models trained on ImageNet-Dogs: (a) ConCURL model evaluated on random-10, (b) ConCURL model evaluated on random-15, (c) Baseline (ID) model evaluated on random-10, (d) Baseline (ID) model evaluated on random-15

4.3 Cluster visualizations

In this section, we provide two visualizations of the ImageNet-10 dataset. In Fig. 7, each row presents randomly drawn images from each cluster. Images that have red-and-yellow borders are categorized incorrectly and should belong to different clusters. For example, the first image in the fourth row should be in the truck category, but it is categorized as an airliner. In the soccer ball category, there are two mistakes: the first image should be categorized as a truck, and the fourth image should be categorized as a dog. In the ninth row, the last image should be categorized as an airship, but it is categorized as a truck.

Fig. 7
figure 7

Images from the same cluster: In ImageNet-10, we randomly sample six images from all ten clusters and show them above. Each row presents one cluster and images that have red-and-yellow borders are categorized incorrectly and should belong to different clusters

To check whether the learned representations that are closest to each other belong to the same category, we use a retrieval task. In Fig. 8, we show the results. The first image in each row was used as a query (random samples from the dataset), and the five images nearest to the query image were retrieved using their representations. Ideally, one would expect all retrieved images to belong to the same category as the query image. For the example from the soccer ball category, the first image retrieved does not belong to this category; however, both images have water as their main feature. In the last row, the second image retrieved is a penguin and ideally should not be one of the closest matches to an image in the orange category.

Fig. 8
figure 8

Image retrieval: The first image in each row was used as a query (random samples from the dataset ImageNet-10), and the five images nearest to the query image were retrieved using their representations. Images with red-and-yellow borders are retrieved incorrectly

5 Ablation studies

Although the proposed method is trained in an end-to-end manner, each component of the method may have a different impact on the results. We conduct various controlled experiments to quantify the impact of the losses (Sect. 5.1), the data augmentation methods (Sect. 5.2), the image resolution (Sect. 5.3), the number of transformations (Sect. 5.4), the dimensionality of each transformation (Sect. 5.4) and the architecture choice (Sect. 5.5).

5.1 Effects of the loss terms

We consider the following scenarios: train with only loss \(L_{\mathcal Z}\) (Consensus Loss), train with only loss \(L_{b}\) (ID), train with both losses \(L_{b} + L_{\mathcal Z}\) (ConCURL). We do this for CIFAR-10, CIFAR100-20  datasets. We then compare the clustering performance of all three scenarios and observe that training with both losses improves the performance. Additionally, we also compare the loss trajectories during training. We compare the \(L_b\) loss for the case when we train only \(L_b\) and for the case when we train only \(L_{\mathcal {Z}}\). We repeat this for loss \(L_{\mathcal {Z}}\). The results are summarized in Table 9, where we observe that training with both the losses provides a much better clustering performance as compared to training with the losses individually.

Table 9 Ablation on the losses during training

The exemplar consistency trains with the objective of classifying each data point into its own class. The population and consensus consistencies train without regarding for discrimination among the individual data points. An algorithm that is trained with only the consensus loss therefore is not effective in discriminating individual data points. From Fig. 9(a), we can observe that when we train with only \(L_{\mathcal {Z}}\), we do not observe any improvement in the loss \(L_{b}\); when we train with \(L_b+L_{\mathcal {Z}}\), we observe a similar trajectory as training only with \(L_b\). This shows that training with \(L_{\mathcal {Z}}\) does not conflict with \(L_b\) loss.

On the other hand, it is possible that an algorithm that is trained to discriminate individual data points can show some improvement on the consensus loss \(L_{\mathcal {Z}}\). From Fig. 9(b), we can observe that \(L_{\mathcal {Z}}\) is lesser when trained with \(L_b+L_{\mathcal {Z}}\) than when training only with \(L_{\mathcal {Z}}\). We observe a small decrease in \(L_{\mathcal {Z}}\) value when trained only with \(L_b\). This shows that optimizing \(L_b\) helps to some extent in achieving a better \(L_{\mathcal {Z}}\).

Fig. 9
figure 9

We perform an ablation over the losses used and compare the loss values for each of the cases for CIFAR100-20

From this discussion, we observe that both losses contribute differently to the training without much interference or conflicts. They indeed complement each other as we observe improved clustering performance for ConCURL from Table 9 and Fig. 10.

Fig. 10
figure 10

We compare the clustering metrics for ablation over the losses used for CIFAR100-20  and observe that Train with \(L_b+L_{\mathcal {Z}}\) (ConCURL) outperforms training with only \(L_b\) or \(L_{\mathcal {Z}}\)

5.2 Effects of data augmentation methods

Augmenting the training data is a standard technique for training deep learning methods (Shorten & Khoshgoftaar, 2019). The backbones used in this study rely on the different views that are generated by applying different augmentations to the input image. Recently, Tian et al. (2020) investigated the impact of data augmentation on contrastive learning methods and shed some light on this topic. In our setting, we would like to quantify the impacts of several data augmentations on the consensus loss.

In Table 10, we show the maximum accuracy achieved when all data augmentation approaches are used and when we skip one data augmentation technique at a time. When all data augmentation methods are used, the maximum accuracies achieved are 0.8459 and 0.4798 on CIFAR-10 and CIFAR100-20, respectively. When random resized cropping data augmentation is dropped, we obtain the maximum accuracy drops for both datasets, followed by color jitter. Other data augmentation techniques are important for obtaining the best possible accuracy but do not have as much of an effect as color jitter and random resized cropping. In Figs. 11 and 12, we show how the running mean of accuracy progresses during training for each of the experiments in Table 10.

Table 10 Data augmentation details
Fig. 11
figure 11

Effect of data augmentation on CIFAR-10

Fig. 12
figure 12

Effect of data augmentation on CIFAR100-20

5.3 Effect of image resolution

Image resolution is often considered a free parameter (Niu et al., 2020a), and however, its effect on clustering performance is not evaluated rigorously in most works. We try to quantify the effects of different resolutions to the greatest extent possible, given that some datasets are available only at specific resolutions. For STL-10, we use \(32\times 32\), \(64\times 64\) and \(96\times 96\) resolutions. For ImageNet-10 and ImageNet-Dog-15, we use \(96\times 96\), \(160\times 160\) and \(224\times 224\) resolutions. The results are given in Table 11.

Table 11 Effects of different resolutions for STL-10, ImageNet-10 and ImageNet-Dogs

The best performance for ImageNet-10 and ImageNet-Dogs is obtained at a resolution of 160, and for STL-10, the best performance is obtained at a resolution of 96. It is not clear why ImageNet-10 and ImageNet-Dogs do not yield the best performance at high resolutions, and further investigation is needed; we keep this as an open problem.

5.4 Distribution of accuracies across the set of hyperparameters

Table 12 Hyperparameters and the range values used for the experiments
Table 13 Hyperparameters for obtaining maximum performance
Fig. 13
figure 13

Components of the consensus loss; ablation of STL-10 and CIFAR100-20

The proposed consensus loss has two parameters. The first is the number of transformations used, and the second is the dimensionality of the projection space. To understand the proposed loss, we conduct a detailed experimental study on STL-10 and CIFAR100-20.Footnote 3 The hyperparameters used are given in Table 12.

Due to the sheer number of conducted experiments, we supply the summary statistics obtained on a random set. We report the empirical mean and standard deviation of the marginal distribution of the quantity under investigation. Let \(P_{\tau ,\eta ,d,l}\) be the joint distribution over the hyperparameters \(\tau \) (temperature parameter), l (learning rate), \(\eta \) (natural log of the number of transformations) and d (dimensionality of the projection space). We consider \(n_h\) as the number of distinct values used in the experiment for each hyperparameter \(h \in \{ \tau ,\eta ,d,l \}\) based on Table 12. We the denote accuracy from each experiment based on the hyperparameters used as \(a_{\tau ,\eta ,d,l}\) . Let \(P_{h_i \vert h_j}\) be the conditional marginal distribution of hyperparameter \(h_i\) given \(h_j\) and the conditional empirical mean of \(P_{h_i \vert h_j}\) be \(m(P_{h_i \vert h_j})\). In this case, the conditional empirical mean \(m(P_{h_i \vert h_j})\) when \(h_i = d\) and \(h_j=\tau \) can be calculated using \(m(P_{d \vert \tau }) = \frac{1}{n_{\eta } \times n_{l}} \sum _{\eta } \sum _{l} a_{\tau ,\eta ,d,l}\). The conditional empirical means and standard deviations of other hyperparameters are calculated in the same way. In Fig. 13, we show each conditional empirical mean with a blue dot, and each red line around a dot represents a standard deviation. For both STL-10 and CIFAR100-20, we see a trend regarding the number of projections. For STL-10, the smaller the number of random projections, the better the results are, and for CIFAR100-20, increasing the number of random projections is helpful for improving the clustering accuracy up to some point. Note that when the number of random projections is equal to zero, our setting is equivalent to the baseline ID model, and our approach always performs better than ID. This means that the optimal number of random projections is greater than or equal to one. There is no such clear trend in the number of dimensions of the random projections.

Fig. 14
figure 14

The dotted red lines show the accuracy of the baseline, i.e., ID Tao et al. (2021), on the corresponding dataset, and DE is the density estimate of the empirical distribution: (a) Empirical accuracy distribution for STL-10, (b) Empirical accuracy distribution for CIFAR100-20

The max-performance procedure provides some insights into the performance of the algorithms at hand, although it does not provide the whole picture because it does not consider the robustness of the performance differences. In Table 13, we give the hyperparameters that yield the max performance. In other words, finding a hyperparameter set that yields better performance than the baseline is the core idea behind the max-performance procedure. We ask the following question: given a hyperparameter grid, how likely is our method to achieve better accuracy than the baseline? In Fig. 14, we report the empirical accuracy distributions on STL-10 and CIFAR100-20 for all hyperparameters given in Table 12. The red dotted lines show the corresponding baseline accuracy for each dataset. For STL-10, only approximately \(12.5\%\) of the hyperparameter sets yield better results than the baseline. On the other hand, for CIFAR100-20, approximately \(90\%\) of the hyperparameter sets yield better results than the baseline. In other words, it does not require a significant amount of computational power to find a better model than the state-of-the art models for CIFAR100-20; however, the situation is the opposite for STL-10. The results given in Fig. 14 suggest that when comparing models, multiple metrics need to be considered, not only the max-performance procedure.

5.5 Effect of architecture choice

Fig. 15
figure 15

Empirical distribution of the performance difference between Residual Network (ResNet)-50 and ResNet-18. DE is the density estimate of the empirical distribution

In this work, we use ResNet-18 and ResNet-50 as network architectures. For both ResNet-18 and ResNet-50, we sweep over the same set of hyperparameter choices, i.e., the temperature, number of projections and projection dimensionality, and report the results for ImageNet-10 dataset with image resolution of 160\(\times 160\). Figure 15 shows the distribution of \(\Delta _{acc}\), which is defined as the accuracy difference between ResNet-50 and ResNet-18. Figure 15 indicates that ResNet-50 slightly outperforms ResNet-18, i.e., the mean difference is approximately \(0.5\%\).

5.6 Runtime comparison

Fig. 16
figure 16

Comparison of the runtimes per epoch for ID and ConCURL  on CIFAR-10  dataset. We vary the number of random transformations in the computation of the consensus loss and is mentioned in paranthesis

To study the runtime of the proposed method, we compare the time taken per epoch for the baseline ID algorithm and the proposed algorithm. Due to the additional loss computation, the time taken to run the proposed algorithm is higher which can be observed from Fig. 16. The additional time taken is mainly due to computing the consensus loss for the different number of transformations. The current implementation computes the forward pass for the different transformations sequentially thus increasing the runtime. However, a more time efficient implementation where the forward passes for all the different random transformations are computed in parallel can make the runtime more comparable to the baseline ID algorithm.

6 Conclusion

In this work, we introduce different notions of the consistency constraints that are enforced in different unsupervised/self-supervised learning algorithms. We propose a novel clustering algorithm that seamlessly incorporates all three consistency constraints (exemplar, population and consensus) and achieves state-of-the-art clustering results for four out of five popular and challenging computer vision datasets. Our work on consensus clustering is significantly different from earlier consensus clustering works that do not learn representations. Moreover, we initiate a discussion on the adequacy of the currently used methods for evaluating clustering algorithms. We significantly extend the evaluation procedure for clustering algorithms, thereby reflecting the challenges of applying clustering to real-world tasks. We provide evaluation results for ConCURL and other state-of-the-art clustering algorithms based on max-performance criteria, according to which ConCURL outperforms other algorithms on most datasets. However, its average performance according to out-of-distribution criteria highlights the need to use the proposed evaluation methods for deep clustering algorithms.