Sanitized Clustering against Confounding Bias

Real-world datasets inevitably contain biases that arise from different sources or conditions during data collection. Consequently, such inconsistency itself acts as a confounding factor that disturbs the cluster analysis. Existing methods eliminate the biases by projecting data onto the orthogonal complement of the subspace expanded by the confounding factor before clustering. Therein, the interested clustering factor and the confounding factor are coarsely considered in the raw feature space, where the correlation between the data and the confounding factor is ideally assumed to be linear for convenient solutions. These approaches are thus limited in scope as the data in real applications is usually complex and non-linearly correlated with the confounding factor. This paper presents a new clustering framework named Sanitized Clustering Against confounding Bias (SCAB), which removes the confounding factor in the semantic latent space of complex data through a non-linear dependence measure. To be specific, we eliminate the bias information in the latent space by minimizing the mutual information between the confounding factor and the latent representation delivered by Variational Auto-Encoder (VAE). Meanwhile, a clustering module is introduced to cluster over the purified latent representations. Extensive experiments on complex datasets demonstrate that our SCAB achieves a significant gain in clustering performance by removing the confounding bias. The code is available at \url{https://github.com/EvaFlower/SCAB}.


Introduction
Clustering is an essential technique for unsupervised data analysis, whose objective is to partition samples into groups so that the samples in the same group are similar while those from different groups are significantly different (Jain et al, 1999).Standard clustering methods (Cheng, 1995;Modha and Spangler, 2003;Xie et al, 2016) is capable of capturing the desired semantic structure embedded in the clean raw data.However, biases are inherently present in real-world datasets, as they emerge from data collected across diverse times, scenarios, or platforms (Listgarten et al, 2010;Jacob et al, 2016;Li et al, 2020).These biases may introduce confounding factors that bring spurious correlation (Wu et al, 2023), obscuring the true underlying clustering structure (Listgarten et al, 2010), named as confounding biases in this paper.Despite the inevitable presence of data biases, we argue that the bias information can be identified by domain experts (Chierichetti et al, 2017;Benito et al, 2004) and easily accessible (e.g., the data source usually denoted in metadata).In this study, we perform clustering while removing the negative effect of the bias.
Previous methods (Jacob et al, 2016;Gagnon-Bartsch and Speed, 2012) simply project raw data onto the subspace orthogonal to the space expanded by the confounding factor under the linear assumption before clustering.Specifically, they decompose the data into linear combinations of the desired clustering factor and the confounding factor.In the linear space, they remove the bias information by simply subtracting the confounding covariate from the data.In parallel, Benito et al (2004) applied an improved SVM which finds a linear hyperplane to separate two classes (i.e., the binary confounding factor that indicates the data source) in a supervised manner and then projects the raw data on this hyperplane.Such a method cannot scale to the scenario with multiple classes beyond the binary factor argued in Johnson et al (2007).In summary, former approaches are limited to the raw feature space, which may not capture high-level and representative features to describe the interested clustering factor as well as the confounding factor.In addition, they only consider linear dependence between the data and the confounding factor for clustering, which oversimplifies the real situations.The two flaws restrict the methods from applying to complex realworld data, where both the confounding factor caused by biases and the interested clustering factor are non-linearly embedded in the raw data.
In this paper, we introduce a new clustering framework (Fig. 2), Sanitized Clustering Against confounding Bias (SCAB), applicable to high-dimensional complex biased data.SCAB is equipped with a deep representation learning module and a non-linear dependence measure to effectively eliminate bias information for superior clustering in the latent space.Specifically, our SCAB learns a clustering-favorable representation invariant to the biases within the VAE architecture (Kingma and Welling, 2014).The removal of bias information is achieved by minimizing the mutual information between latent representation and the confounding factor induced by biases (also interpreted as the disentanglement between the representation and the confounding factor in the later part of the paper).A tailor-designed clustering module is incorporated into VAE to cluster over the invariant representation.Benefiting from the non-linear dependence measure, our SCAB can obtain a precise clustering structure in the latent space of complex data robust to the biases.We summarize the contributions as follows: • We propose the first deep clustering framework SCAB for clustering complex data contaminated by confounding biases.Unlike existing related studies, SCAB performs semantic clustering in the latent space while minimizing the non-linear dependence between the latent representation and the biases.• Our theoretical analyses reveal that in our SCAB, (1) the loss for clustering maximizes a lower bound of the mutual information between the data representation and the desired clustering structure; (2) the loss for removing the biases minimizes an upper bound of the mutual information between the data representation and the confounding factor induced by biases.• We conduct extensive experiments on seven biased image datasets.Empirical results demonstrate the superiority of our sanitized clustering with removing confounding biases, and our SCAB consistently achieves better results than existing baselines.

Problem statement and related work
We first introduce standard clustering that neglects the data biases.Then, we motivate our problem setting where data contains confounding biases and discuss the deficiencies of existing work.Last, we compare our setting with two related clustering branches and discuss the issues when their methodologies are applied to our setting.

Standard clustering
Consider a dataset X = [x 1 , x 2 , . . ., x N ] T ∈ R N ×D consisting of N samples with D features.Standard clustering is to partition the dataset X into K groups by maximizing intra-cluster similarity and minimizing inter-cluster similarity: min S K,x denotes all feasible K-partitions of X1 .S x is a K-partition in raw feature.F is the clustering objective, whose minimization aims at optimizing the quality of clustering.
For instance, the k-means clustering objective is , where e k is the k-th cluster centroid.s nk ∈ {0, 1} denotes the cluster assignment which equals 1 if x n is assigned to the k-th cluster and 0 otherwise.
While classical approaches (Cheng, 1995) conduct clustering in the raw feature space, recent deep clustering methods (Xie et al, 2016;Guo et al, 2017;Huang et al, 2020;Niu et al, 2022) explores clustering-favourable latent representation for a better structure discovery.However, when there are obvious variances resulting from biases present in the data, all standard clustering methods are unavoidably distracted by the confounding biases and the clustering performance will degenerate (see Table 3, 4).

Clustering data contaminated by confounding biases
When the data is collected from multiple sources or different conditions, each source may have its own biases.These biases could mask genuine similarities or differences between data points, distorting the desired clustering results (Jacob et al, 2016).In this case, the data source can be said a confounding factor that hinders the accurate clustering structure.In addition, confounding factors that bias the clustering results in other scenarios can also be identified by the domain experts.For instance, in the facial recognition task, whether people wearing glasses or not would impair the recognition results for identity (Sharif et al, 2016).
In order to deliver a precise clustering structure, we consider removing the influence of these confounding biases.We suppose such bias information can be always described by a label indicator, which is an effective encoding for the confounding factor (e.g., a source indicator indicating the data is from source 1, 2, or etc.).Given the complete instance-wise confounding factor, we define our problem setting in the following.Definition 1 (Sanitized clustering with the removal of confounding bias).Let N ×G be the corresponding labels with regards to a certain confounding factor c, where C i,j = 1 if x i belongs to class j and C i,j = 0 otherwise; G is the number of categories.Our goal is to find a partition S x ∈ S K,x , such that S x is uninformative of c.The objective is: min where ⊥ denotes that two variables are independent.
Existing work.Some work (Jacob et al, 2016;Listgarten et al, 2010;Gagnon-Bartsch and Speed, 2012) targeting the problem (Definition 1) are built on a linear model that assumes the confounding factor is linearly correlated with the data.Mathematically, let A ∈ {0, 1} N ×K denote a group assignment matrix, and each raw of α ∈ R K×D denote a cluster centroid.Supposing C ∈ {0, 1} N ×G represents the class matrix converted via the confounding factor c, and each raw of β ∈ R G×D denotes the centroid of the corresponding category.Then, the linear model is formulated as: where ε denotes some prior noise.β can be estimated via a regression model by setting Aα = 0 (Jacob et al, 2016).By subtracting the bias term Cβ, a purified dataset X is: Then, a regular clustering method like k-means is conducted on X to obtain a partition S x (i.e., A and α).Under the linear assumption, the obtained partition thus satisfies the independent constraint, namely, S x ⊥ c.Deficiencies that make existing approaches impractical for high-dimensional complex data.(1) They are developed in the raw feature space, which is insufficient to discover the underlying structures in terms of the interested factor as well as the confounding factor, i.e., α and β in Eq.( 3).( 2) Only linear dependence is explored.The removal of the confounding factor is simply via a linear projection, i.e., Eq.( 4), which will fail when the data has a non-linear dependence with the confounding factor.

Related clustering branches
Alternative clustering (Wu et al, 2018) suggests finding an alternative structure based on the existing clustering result to unveil a new perspective of the dataset.Niu et al (2013); Wu et al (2019) pursued a novel clustering while minimizing its dependence on the given clustering structure.In particular, the relevance is measured by a specific kernel independence measure, the Hilbert-Schmidt independence criterion (HSIC).Given a dataset X ∈ R N ×D , let Y = [y 1 , y 2 , . . ., y N ] T ∈ {0, 1} N ×K0 be an existing clustering result over X, where K 0 is the cluster number.y i,j = 1 if x i belongs to the j-th cluster and y i,j = 0 otherwise.The aim is to obtain an alternative structure U ∈ R N ×K with K clusters on a subspace with lower dimensions Q ≪ D.
The objective is defined as: where W ∈ R D×Q denotes the projection matrix.The solution of Eq.( 5) can be referred to Niu et al (2013); Wu et al (2018).
Alternative clustering vs. our setting (Def.1).Although starting from a different motivation, Eq.( 5) can be a practical implementation form for Eq.( 2) by replacing the given clustering structure with the confounding factor.However, obtaining the subspace irrelevant to the confounding factor by a linear projection is not suitable for the high-dimensional complex dataset where the factor is a high-level semantic feature.Meanwhile, such a technique requires storing a full batch of data for clustering, which incurs a heavy memory complexity of O(N2 ).
Fair clustering 2 that extends group fairness (Feldman et al, 2015) to clustering explores the clustering structure while ensuring a balanced proportion within each cluster regarding some specified sensitive attribute (Chierichetti et al, 2017).With a slight abuse of notation, suppose X can be represented as the disjoint union of H protected subgroups in terms of some sensitive attribute a, i.e., For a clustering result S x ∈ S K,x , the balance of each cluster S k and the whole clustering result S x can be respectively defined as: The higher the balance of each cluster, the fairer the clustering result will be.A (T, K)-fair clustering is defined as: where T controls the degree of fairness for clustering.Eq.( 7) pursues a partition where each cluster approximately maintains the same ratio over the sensitive attribute as that in the whole dataset (Chierichetti et al, 2017;Kleindessner et al, 2019).Fair clustering vs. our setting (Def.1).Both fair clustering and our problem setting require information about some specific attribute (factor) before conducting clustering.However, fair clustering aims to deliver a clustering structure that meets fairness criteria over a certain sensitive attribute.The clustering performance would degrade when imposing such an extra fairness constraint (Chierichetti et al, 2017).In contrast, our target is to improve clustering by eliminating the effect of the confounding factor that distracts the clustering results.Therefore, fair clustering methods (Eq.( 7)) cannot be applied to our setting, except a recent deep fair clustering (DFC) (Li et al, 2020).DFC was proposed to learn fair representation for clustering and claimed to adopt stronger fairness criteria than the balance criteria (Eq.( 6)).It introduced an adversarial training paradigm in the context of deep standard clustering to encourage clustering structures to be independent of the sensitive attribute.This form of fair clustering objective is the same as ours (Eq.( 2)) when the sensitive attribute is designated as the confounding factor.However, the adversarial training increases the difficulty of model training and requires an extra complex constraint to maintain the clustering structure.

Sanitized clustering against confounding bias
This section presents a new framework SCAB to deliver desired clustering structures on complex datasets contaminated by confounding biases.

Deep semantic clustering in the latent space
We perform clustering in the latent space to capture the semantic structure of complex data.Consider a general task (e.g., data reconstruction) that involves encoding the data x into its representation z via the posterior q(z | x).The objective of deep semantic clustering includes the objective L for representation learning and the objective F for clustering on the representations (Xie et al, 2016;Boubekki et al, 2021).Namely, min q, Sz∈S K,z L(q, x) + ηF (S z ). (8) S z denotes a partition in the space where z resides.S K,z is defined similarly as S K,x in Eq.( 1).η is a trade-off parameter that balances representation learning and clustering.
In particular, we choose Variational AutoEncoder (VAE) (Kingma and Welling, 2014) to compute L(q, x), because VAE includes modeling of q(z | x), and VAE based clustering can obtain good clustering-favorable representations and is effective for various complex datasets (Jiang et al, 2017).

Clustering on representations invariant to confounding factor
Eq.( 8) conducts semantic clustering without considering the existence of the confounding bias.To eliminate the negative impact of the bias on the target clustering structure S z , we propose deep semantic clustering independent of the confounding factor c. Recalling Eq.(2), our objective is formulated as: Since a partition S z is defined over the whole dataset while c is collected per sample, directly implementing S z ⊥ c is complex and incurs large computational costs.Instead, we impose an alternative independence constraint between the sample representation z and the confounding factor c, i.e., z ⊥ c, both of which are defined at the sample level.Proposition 1.Let Z be the representation space, and Z = {z 1 , z 2 , . . ., z N } T ∈ Z be the representation set of the dataset X. Suppose the clustering algorithm A takes Z as an input and returns a partition S z of Z. Namely, A : Z −→ S z .If z ⊥ c, then we naturally have S z ⊥ c.Proposition 1 demonstrates clustering over representations z that is invariant to the confounding factor c can derive a clustering structure S z that is uninformative of the confounding factor c. Thus, our objective can be reformulated as: The independence constraint z ⊥ c is still a strong condition and is difficult to optimize directly.We approximate it by minimizing the mutual information I(z, c) (Moyer et al, 2018).Adding the term I(z, c), the objective Eq.( 10) becomes: where η 1 and η 2 are the hyper-parameters that balance the three losses.In Eq.( 11), the interested clustering factor, which is embedded in the representation z, and the confounding factor c can be semantically described in the latent space (Xie et al, 2016;Vincent et al, 2010).Meanwhile, these two factors are disentangled in the latent space.By optimizing Eq.( 11), we can obtain a semantic clustering structure S z that is irrelevant to the confounding factor c.

The overall clustering framework: SCAB
To summarize, our framework jointly trains with three modules.First, the VAE structure is adopted as the feature extractor module for learning semantic features.Further, we introduce one disentangling module over the latent space derived by VAE, to disentangle the confounding factor c and other salient information z encoded in the data (i.e., z ⊥ c).Last, a clustering module based on soft k-means is incorporated within the VAE structure to perform clustering on the factor of interest (embedded in z) only.

Variational autoencoder
Accordingly, we can formulate the statistical (non-linear) dependence between x and c in the latent space, i.e., p(x, z, c) = p(z, c)p(x | z, c) where z is the latent variable of x.
Similar to VAE (Kingma and Welling, 2014), the variational lower bound for the expectation of conditional log-likelihood E (x,c) [log p(x | c)] can be deduced as follows: The conditional decoder p(x | z, c) takes both z and c as input.We simplify the distribution of z to solely depend on the input x, optimized by the encoder q(z | x).p(z) is the prior distribution that is defined as a Gaussian noise.We parameterize the approximate posterior q(z | x) with an encoder f ϕ that encodes a data sample x to its latent embedding z, and parameterize the likelihood p(x | z, c) with a conditional decoder g θ that produces a data sample conditioned both on the latent embedding z and the observed confounding factor c. Usually, a particle z n is sampled from q(z | x) for reconstructing x n (Kingma and Welling, 2014).Then, the loss function (minimization) based on the Monte Carlo estimation of the variational lower bound in Eq.( 12) is defined as: where ℓ r denotes the reconstruction loss, which can be instantiated with mean squared loss or cross-entropy loss.L VAE is used to calculate the first term L(q, x) in Eq.( 11).

Disentanglement by minimizing mutual information
By minimizing the mutual information between the latent variable z and the confounding factor c, the bias information is disentangled from other salient information in the latent space.
Lemma 1 (MI upper bound (Moyer et al, 2018)).The mutual information between the latent representation z and the confounding factor c, i.e., I(z, c), is subject to an upper bound: As I(z, c) is not directly computable, we use its upper bound (Eq.( 14)).The constant H(x | c) can be ignored.The second term is a reconstruction loss (Eq.( 13)).The third term on the right of Eq.( 14) is intractable to compute and is approximated by the pairwise distances as follows (Moyer et al, 2018): The loss function is finally defined as: The minimization of I(z, c), the second term in Eq.( 11), is thus replaced by the minimization of its upper bound, i.e., L MI .

Clustering over the c-invariant embedding
Eq.( 15) helps to filter out the information of the confounding factor c from the latent code z.For the sake of efficiency, we apply k-means algorithm to conduct clustering on the c-invariant embedding z.Particularly, the k-means clustering loss is defined as: L cluster is used to compute the third term F (S z ) in Eq.( 11).e = {e 1 , e 2 , . . ., e K } are the collection of K centroids.s nk ∈ {0, 1} refers to the group assignment that assigns the latent embedding z to its closest clustering centroid.Namely, where k = 1, 2, . . ., K. τ is the temperature and is set to 5 in the experiment.
Due to the reconstruction loss in VAE (Eq.( 13)), the latent representations would contain many sample-specific details, which is detrimental to clustering.We follow (Pan and Tsang, 2021) to introduce the following skip-connection formulation to unify the reconstruction goal and the clustering goal.Namely, ẑn = h ψ (z n , zn ) , where zn = Note that zn is one of K clustering centroids as s nk is a one-hot assignment.h ψ constructs a new latent representation ẑn that incorporates not only the original cinvariant embedding z n but also its belonging clustering centroid zn as the input of the decoder.h ψ is implemented as a linear layer.

Objective and optimization of SCAB
Integrating all three modules comes to our new framework Sanitized Clustering Against confounding Bias (SCAB) (Fig. 2).Its final objective is formulated as: where Θ = {θ, ϕ, ψ} denote the network parameters and e represent clustering parameters.η 1 and η 2 are the trade-off parameters.
In Eq.( 19), two types of parameters, i.e., network parameters Θ, and clustering parameters e, are coupled together, which hinders them from joint optimization.We adopt coordinate descent to alternatively optimize Θ and e.
To make our SCAB scalable to large-scale problems, we adopt stochastic gradient updates for all parameters.However, such an update for clustering centroids e would be unstable because the clustering centroids estimated by different mini-batch data may be of great discrepancy.To overcome this issue, we apply the exponential moving average (EMA) update for the centroids since the EMA update yields good stability (Van Den Oord et al, 2017).Specifically, each centroid e k is updated online using the assigned neighbor representations in the mini-batches {z b } B b=1 : k := γB where γ ∈ [0, 1] is a decay parameter (set to 0.995 by default).t is the iteration index.

Theoretical analysis
In this section, we theoretically analyze that optimizing network parameters Θ of SCAB in Eq.( 19) is equivalent to (1) maximizing the lower bound of the mutual information between the representation and the interested clustering structure, i.e., max z I(z, s), while (2) minimizing the upper bound of the mutual information between the representation and the confounding factor, i.e., min z I(z, c).Theorem 2. Assume a fixed clustering structure, i.e., the clustering centroids e = {e 1 , e 2 , . . ., e K } and the cluster assignments {s n } N n=1 , where s n is a K-dimensional one-hot vector and s nk is defined in Eq.( 17).The minimization of our clustering object L cluster is equivalent to maximizing the lower bound of the mutual information between the representation z and the interested clustering structure, represented by the group assignment s, i.e., I(z, s), given the clustering centroids e. p(s | x, c)dxdc is intractable, we introduce an auxiliary distribution q(s | z) as an approximation to p(s | z) (Alemi et al, 2017) 1 is valid since p(z, s) = p(x, c, z, s)dxdc = p(x, c)p(z | x, c)p(s | x, c)dxdc.
The auxiliary distribution q(s | z) can be naturally defined by our k-means clustering module (Section 3.3.3).Accordingly, we have q(s nk = 1 . Note that the posterior p(z | x, c) is approximated by the VAE encoder q(z | x) constrained with the minimization of I(z, c) and usually one particle z n is sampled from q(z|x) to reconstruct x n (Kingma and Welling, 2014).Together with the given cluster assignment s n ∼ p(s | x, c), we have 1 is valid because the value of q(s nk = 1 | z n ) approaches zero for all k except for the one corresponding to the smallest distance (Kulis and Jordan, 2012).Then, we have H(s) can be ignored since it is a constant.We complete the proof.
From Corollary 1, we conclude that the optimization for Θ given e is to learn a clustering-favorable representation, which is invariant to the confounding factor c.
Remark 1 (Continuous/Incomplete confounding factor).(1) Our method and theoretical analysis are applicable to the continuous confounding factors as well, as they do not specify the exact form of the confounding factor.We will conduct experiments to demonstrate the efficacy of our SCAB on the continuous confounding factor in Section 4. (2) For the known confounding factor without ready-to-use annotations, we additionally collect a small amount of supervision for it to avoid too much manual cost.Then, we can solve the problem in a semi-supervised manner, which will be explored in Section 4.4.

Experiments
Dataset.We conduct experiments on six image datasets (UCI-Face, Rotated Fashion, MNIST-USPS, Office-31, CIFAR10-C, Rotated Fashion-Con) and one signal-vector dataset (HAR) containing confounding factors that would bias the clustering results (see Table 1).In particular, Rotated Fashion is constructed by introducing the rotation factor into the Fashion-MNIST dataset.Specifically, we pick up images from cloth categories, i.e., "T-shirt/top", "Trouser", "Pullover", "Dress", "Coat" and "Shirt", for simplicity.We first randomly sample 1, 000 images from each of the six classes (zero degree).Then, each image is augmented with four views of 72, 144, 216, and 288 degrees, respectively.For Office-31, we select samples from Amazon and Webcam as training data following Li et al (2020).Rotated Fashion-Con is contructed similarly, but the rotation angle is set to a continuous range of 0 to 60 degrees.For CIFAR10-C, we consider one in each main category of corruptions, namely, frost, Gaussian blur, impulse noise, and elastic transform for simplicity.
Implementations.We employ the AE architecture described in Xie et al ( 2016) for all datasets.The encoder is a fully connected multi-layer perceptron (MLP) with dimensions D-500-500-2000-d.D is the dimension of input.d is the dimension of centroids, which is set to 10 for all datasets.All layers use ReLU activation except the last.The decoder is mirrored of the encoder.Compared with those AE-based clustering methods (Xie et al, 2016;Guo et al, 2017), our SCAB introduces only one extra linear layer for Eq.( 18), which bring negligible network parameter overhead.We apply SCAB to raw data for UCI-Face, Rotated Fashion, MNIST-USPS, HAR and Rotated Fashion-Con considering their simplicity.Inspired by the recent stateof-art (SOTA) clustering methods (Tsai et al, 2021;Niu et al, 2022), which rely on structured representations to achieve superior performance on complex datasets, we apply SCAB to the extracted features for Office-31 and CIFAR10-C considering their complexity.We use ImageNet-pretrained ResNet50 to extract features for Office-31 following the SOTA clustering method on Office-31 (Li et al, 2020).We use MoCo (He et al, 2020) to extract features for CIFAR10-C following the SOTA clustering method on CIFAR10-C (Niu et al, 2022).Note that these feature extractors do not utilize any supervision regarding the datasets.We adopt the Adam optimizer.The default learning rate, training epoch, and batch size are 5e-4, 1, 000, and 256, respectively.Baselines.The method that removes the confounding factor in the raw space via linear projection, i.e., RUV (Jacob et al, 2016) (Eq.(3),Eq.( 4)), is included as our first baseline.Further, we extend RUV to eliminate the confounding factor in the latent space.In Particular, we first train AE to obtain the latent representations for UCI-Face, Rotated Fashion, MNIST-USPS and HAR.We use the extracted features described above as the representations for Office-31 and CIFAR10-C.Then, we apply RUV to remove the bias information from the representations.We name these two baselines as RUV x and RUV z , respectively.We also consider Iterative Spectral Method (ISM) (Wu et al, 2019)) and Deep Fair Clustering (DFC) (Li et al, 2020) as our baselines since these two methods can be deemed as the same objective as ours (Eq.( 2)).We do not compare with other fair clustering methods since they have different goals from our setting (see Section 2.3).For a fair comparison, we take raw images of UCI-Face, Rotated Fashion and MNIST-USPS and extracted features of Office-31 and CIFAR10-C as input for all the baselines except for RUV x , which takes raw data as input.ISM, DFC and RUV are designed for the discrete confounding factor and cannot be applied to the continuous one, so they are not run on Rotated Fashion-Con.
Metrics.We evaluate different clustering methods with three widely-used clustering metrics, i.e., accuracy (ACC), normalized mutual information (NMI) and Adjusted Rand Index (ARI).For these metrics, values range between 0 and 1, and a higher value indicates better performance.

Performance comparison
Quantitative results of our SCAB and various baselines that can remove the confounding factor are summarized in Table 2.It shows that: (1) SCAB obtains superior results on all datasets.This is because it adopts an effective non-linear dependence (3) Latent space is better than raw space.Non-linear correlation is better than linear correlation.RUV z achieves better performance than RUV x , which shows that removing the confounding factor in the latent space is more effective than that in the raw space.RUV z obtains worse results than our SCAB on four datasets since RUV z simply adopts linear projection and heavily relies on the extracted representations beforehand, which cannot deal with these complex datasets where the desired clustering factor and the confounding factor are coupled non-trivially in the latent space.(4) DFC originally designed for two categories degenerates on the dataset with more categories (i.e., UCI-Faces, Rotated Fashion, and CIFAR10-C ).On one hand, more categories may increase the difficulty of adversarial training, making it unable to effectively remove the confounding factor.On the other hand, the constraint requires training a DEC (Xie et al, 2016) for each category of data.For example, it needs to train a DEC on around 93 images for UCI-Face, which would suffer from insufficient training samples.(5) ISM cannot be executed on large-scale datasets, i.e, Rotated Fashion, MNIST-USPS and CIFAR10-C.ISM requires a memory complexity of O(n 2 ) and needs to store a data matrix with a size larger than 10k × 10k for these datasets, which is beyond our computing capacity.

Efficacy of removing the confounding factor for clustering
To demonstrate the gain of clustering that takes into account the removal of the confounding factor, we include the comparison with standard clustering methodsk-means (Bishop, 2006), IDEC (Guo et al, 2017) 3 and Table 4.We apply PICA and SPICE only on Office-31 and CIFAR10-C considering that they were proposed for complex image datasets.For a fair comparison, we take raw images of UCI-Face, Rotated Fashion, MNIST-USPS, HAR and Rotated Fashion-Con and extracted features of Office-31 and CIFAR10-C as input for the methods except for PICA.PICA takes raw images of all datasets as input since it needs to conduct image augmentations for partition confidence maximization (Huang et al, 2020).

Improved by removing the confounding factor
Table 3 and Table 4 show that: compared with standard clustering methods, our SCAB achieves superior performance on all datasets.It verifies the claim that our SCAB which explicitly removes the influence of the confounding factor performs better than the standard clustering methods.Note that PICA obtains poor results since it conducts clustering on raw features (k-means on MoCo extracted feature achieves better results than PICA on raw features also reported in Tsai et al (2021)).And SPICE performs worse than IDEC because it applies a discriminative model for clustering, which is more vulnerable to the confounding factor than IDEC which is AE-based clustering.

Invariant representations
To further illustrate the effectiveness of removing the confounding factor, we visualize the latent representations and the clustering centroids for our SCAB and IDEC (i.e., standard clustering that ignores the confounding factor) on Rotated Fashion, respectively.From the t-SNE visualization of our SCAB (the first row of Fig. 3), we can see

Disentangled centroid reconstruction
We can reconstruct the centroids conditioned on the confounding factor for SCAB.Fig. 4 shows that (1) the latent embedding z and the confounding factor c are well disentangled.In particular, the information of the confounding factor is well captured by c. (2) The centroids can capture clear structures, i.e., rotation angles for Rotated Fashion, the pose angle for UCI-Face, and the digit type for MNIST-USPS, respectively.On Office-31 and CIFAR10-C, we do not reconstruct the centroids on these datasets as the extracted features are used as model input.(2) Its centroids do not capture all rotation angles in the dataset as they are distracted by the cloth categories.For example, e 1 and e 2 represent the shirt and the trouser with the same angle, respectively.

Ablation study
We study the effectiveness of each module by excluding it from our SCAB (Fig. 2).Table 5 shows that: (1) our SCAB gets the best results, which justifies the necessity of each module.(2) Without the disentanglement module to remove the confounding factor via mutual information, the clustering performance drops significantly since the confounding factor would distract desired the clustering results.(3) A poor clustering structure is obtained without the clustering module because it fails to derive clusteringfriendly representations.(4) The clustering performance is worse when excluding both the clustering module and the disentanglement module.

Extension to the incomplete confounding factor
We explore the performance of SCAB given different amounts of labeled data w.r.t. the confounding factor on Rotated Fashion.Applying SCAB to this semi-supervised setting, we first train a classifier on the labeled data and use it to predict labels for the remaining unlabeled data.Then SCAB is naturally applied to these fully-labeled data.Particularly, we employ a convolutional neural network classifier for the classification.IDEC is adopted as the baseline without removing the confounding factor following the same setting as SCAB.
We plot the test accuracy of the classifier (calculated on the remaining unlabeled data) and the clustering performance (ACC and NMI) of SCAB in Fig. 6 with the percentage of labeled data from 0.1% to 100%.It shows that (1) compared to IDEC which ignores the confounding factor, our SCAB can improve the clustering performance even with a very small amount of labeled data.x axis is the ratio of labeled data.
(2) When there are less than 0.5% labeled data, the test accuracy of the classifier is low, smaller than 0.5.Accordingly, the results of SCAB are relatively not so good since there are more than 50% samples assigned with wrong labels.(3) When the labeled data is larger than 1%, there are more than 50% samples assigned with true labels.Though the percentage of label noise is still very high, SCAB can perform well since the correct labels dominate and the structured representations can be robust to label noise.In conclusion, our SCAB can work well even given a small amount of labeled data regarding the confounding factor.

Conclusion
We have introduced a general framework SCAB for a new stream of clustering that aims to deliver clustering results invariant to the pre-designated confounding factor.SCAB is the first deep clustering framework that can eliminate the confounding factor in the semantic latent space of complex data via a non-linear dependence measure with theoretical guarantees.We have demonstrated the efficacy of SCAB on various datasets using label indicators of the confounding factor.In the future, we can extend our SCAB to more types of data, e.g., text/ time series data.In addition, while this study focuses on sanitized clustering given the known confounding factor with (partially) labeled supervision, it is interesting to explore clustering with unindicated confounding factors.Last, theoretical analysis on the confounding factor that is not fully observed is also a potential direction.

Fig. 1
Fig. 1 Motivation: raw images contain two important factors: gender and glass.Suppose clusters with green solid lines are the desired clustering results, where the partitions are based on the gender factor only.Standard clustering algorithms that neglect the unwanted factor obtain clusters distracted by the glass factor, denoted by red dash lines.Notably, the face (a man wearing glasses) with the warning sign is viewed incorrectly grouped.

Proof.
Based on the definition of mutual information, we have I(z, s) = p(z, s) log p(z, s) p(z)p(s) dzds = p(z, s) log p(s | z) p(s) dzds.Assume p(x, c, z, s) = p(x, c)p(z | x, c)p(s | x, c, z) = p(x, c)p(z | x, c)p(s | x, c), where p(s | x, c, z) = p(s | x, c) follows the conditional independence.Since p(s | z) = p(x, c, s | z)dxdc = p(z|x,c)p(x,c) p(z) Fig.3t-SNE on latent representations and clustering centroids from SCAB (1st raw) and IDEC (2nd raw) on Rotated Fashion, respectively.The big grey dots are the centroids.The small dots are the representations, of which the colors denote the ground truth category labels.
Fig.4Centroids' reconstruction of SCAB on Rotated Fashion, UCI-Face and MNIST-USPS, respectively.Each column is conditioned on the same clustering centroid.Each row is conditioned on different labels of the cloth category factor, the identity factor, and the digit source factor, respectively.

Fig. 5
Fig.5shows that (1) IDEC does not have the ability to disentangle the confounding factor c from the latent space.(2) Its centroids do not capture all rotation angles in the dataset as they are distracted by the cloth categories.For example, e 1 and e 2 represent the shirt and the trouser with the same angle, respectively.

Fig. 6
Fig.6Clustering performance of SCAB given partial labels w.r.t. the confounding factor on Rotated Fashion."Classifier ACC" is the test accuracy of the classifier.xaxis is the ratio of labeled data.

Table 1
Statistics of datasets.K denotes the number of clusters.G denotes the number of categories or range of the values.

Table 2
SCAB compared with baselines that can remove the confounding factor w.r.t.ACC (↑), NMI (↑) and ARI (↑).The best results are highlighted in bold.The second-best results are underlined.

Table 3
SCAB compared with standard clustering w.r.t.ACC (↑), NMI (↑) and ARI (↑) on four simple image datasets and one signal-vector dataset.

Table 5
Ablation study of SCAB on Rotated Fashion."Clu" means the clustering module."Dis" means the disentanglement module.