1 Introduction

Cluster Analysis can be defined as a set of techniques for finding homogeneous subsets (clusters) in a given dataset. The clusters should be homogeneous, in the sense that the elements within each subset should be quite similar to each other. Also, elements belonging to different clusters should be quite different. In other words, elements in the same cluster should show a high similarity, and elements belonging to different subsets should show a low similarity.

Different induction principles lead to a number of clustering techniques. According to Fraley and Raftery (1998), clustering techniques can be classified into hierarchical and partitioned methods. Han et al. (2012) classifies them into density-based, model-based and grid-based methods.

There is an extensive literature on this subject. Among the most frequently used algorithms for cluster analysis, we can mention, the CURE and ROCK algorithms (Guha et al. 2000, 2001), the K-Modes algorithm (Huang 1997a, b, 1998, 2009), the K-Prototypes algorithm (Huang 2005; Ji et al. 2020), the K-Means algorithm (McQueen 1967), the DBSCAN algorithm (Pietrzykowski 2017; Zhu et al. 2013) or the IDCUP algorithm (Altaf et al. 2020).

A data set can be considered as a matrix where the rows are the observations, individuals or elements, and the columns are the features, attributes or traits associated to these elements. Many well-known clustering algorithms, such as the popular K-Means, only work with numerical datasets, where all the attributes are numerically measured. K-Means (Forgy 1965; McQueen 1967) associates the clusters to their average values (centers of gravity) and assigns the elements to their nearest clusters; the algorithm then calculates the new centers of gravity and reallocates the elements to the new clusters. These steps shall be repeated until no more changes are observed or a maximum number of iterations is reached.

In some datasets, however, we find categorical data, with non-numerical attributes, and K-Means no longer works. The widely used K-Modes algorithm (Huang 1997a, b, 1998) is based on similar ideas and is adapted to work with categorical data. Instead of centers of gravity and Euclidean distances, K-Modes uses “centroids” defined from the modes of the categorical attributes, and measures of “dissimilarity” to quantify the distances between them.

The final results given by both K-Means and K-Modes often depend on the selection of the initial “seeds”. Since this selection process usually involves some randomization scheme, the results can be instable, i.e. running an algorithm several times on the same dataset can lead to different final allocations. Some solutions for this problem have been suggested in the literature: see Ahmad and Dey (2007a, b), Cao et al. (2009), Gan et al. (2005), Jiang et al. (2016), Khan and Ahmad (2012, 2013, 2015), Ng and Wong (2002), Sajidha et al. (2018), Santos and Heras (2020). It is also worth mentioning K-Means++ (Arthur and Vassilvitskii 2007), an important variation of K-Means, that improves the running time of Lloyd’s algorithm and the quality of the final solution. Moreover, it is implemented in most numerical packages, e.g.: scikit-learn or Matlab.

Besides the classification efficiency and the stability of the results, a new problem has received a lot of attention in the last years. Classification algorithms are increasingly applied to many important economic and social problems, such as prediction of criminal behaviour, screening of job applicants, mortgage approvals, marketing research or insurance rating, among many others. Human supervision of many decision making processes is progressively being replaced by automated data analysis, and there is a growing concern in our societies about the lack of human control of the outcomes.

For instance, an important potential problem is that the output of the algorithms could unreasonably harm or benefit some groups of people that share sensitive attributes, related to gender, race, religion, social status, etc. These discrimination problems are often unintended, due to the complexity of the algorithmic processing of huge amounts of data. As a consequence, the need to prevent these classification biases related to sensitive attributes has increased the interest in designing fair clustering algorithms. The meaning of “fairness” in this case is to ensure that the outputs of the algorithms are not biased towards or against specific subgroups of the population.

The literature on the issue of fair clustering is extensive: see, among others, Abraham et al. (2020), Chierichetti et al (2017), Chen et al. (2019), Esmaeili et al. (2020), Kleindessner et al. (2019) and Ziko et al. (2019). However, all these papers have studied the numerical case. To our knowledge, there are no previous papers devoted to the problem of fair clustering of pure categorical datasets.

In this paper, we put forward a modification of the Multicluster methodology proposed by Santos and Heras (2020) for clustering categorical data, in order to reach a compromise between fairness and classification efficiency. As we shall see, the output of the proposed algorithm combines a total stability with a high degree of fairness and efficiency.

The outline of the paper is as follows: in the first section (Fair clustering of categorical data) we give a brief description of the main ideas of the paper. In the second section (Methods) we explain how the fair clustering algorithm operates. In the third section (Experimental Results), several well-known real databases are used to illustrate the application of the methodology, showing good results in terms of clustering efficiency and fairness. Concluding remarks are presented in the last sections (Discussion and Conclusions).

2 Fair clustering of categorical data

Santos and Heras (2020) have proposed a new methodology for clustering categorical data, based on the so-called “multiclusters”. Each multicluster is associated to a non-empty combination of the attributes of the data set, so that the objects belonging to it show a total coincidence in the values of their attributes. However, since the number of multiclusters may be excessive, it is often required to reduce it, in order to reach the desired (usually small) number of final clusters. For this purpose, the algorithm takes the biggest clusters as “seeds” and associates them to the smaller clusters, taking into account the similarities between their attributes. This way, those Multiclusters showing a great number of coincidences between their attributes will be eventually tied together, giving rise to greater clusters sharing many (not all) of their attributes. The process ends when the desired number of final clusters is reached.

In this paper we show that this clustering algorithm for categorical data can easily be adapted to getting not only efficient but also fair clusters. Following previous works on fair clustering for numerical data (Chierichetti et al. 2017), we assume a protected attribute in the database, such as gender or ethnicity.

Under the legal doctrine of Disparate Impact, a decision making process is considered discriminatory or unfair if it has a disproportionately adverse impact on the protected classes (Barocas and Selbst 2016). Unlike the doctrine of Disparate Treatment, Disparate Impact is not concerned with intent or motivations, it only focuses on the outcomes. Under this doctrine, a clustering algorithm is fair if it leads to a set of fair clusters, and a cluster is fair if it has a proper representation of the values of the protected attribute: for instance, 50% males and 50% females, if the protected attribute is Gender. Notice, however, that the desired proportions of the values of that attribute are not necessarily identical in all cases: if the gender proportions in the dataset are highly unbalanced, forcing an equal representation of males and females in the final clusters may lead to unreasonable proportions of other attributes. For this reason, the desired proportions can also be defined as the proportions of the protected attribute in the dataset. If the gender ratio in the dataset is, for instance, 30–70%, then it should be the same or quite similar in the final clusters.

The Multicluster algorithm can be modified in order to increase the fairness of the obtained clusters. Of course, there is a trade-off between fairness and efficiency, so that, if we want to increase the fairness, we have to give up some classification efficiency. Yet it is possible to reach a reasonable compromise between these goals. The idea is to add a new step in the algorithm, in which we link two clusters when the distribution of the protected attribute after linking the clusters is closer to the desired distribution. This procedure is repeated until the desired number of clusters is reached.

3 Methods

In this section we explain how the “Fair Multicluster” algorithm for categorical data works. We assume the existence of a protected attribute in the data set, and also of desired ratios between its values. The goal of the algorithm is to split the total data base into a set of homogeneous and fair clusters: homogeneous, because each of them must contain only similar observations; and fair, because the proportions of the values of the protected attribute must be close to the desired proportions.

The algorithm works as follows:

Step 1

  1. 1.

    We identify the clusters for each single attribute with its different categorical values. For example, if a given attribute only has two values A and B, these are also the clusters associated to that criterion.

  2. 2.

    We merge all the possible single-attribute clusters in order to get the initial set of “Multiclusters”. For example, if there are only two attributes with values A, B and C, D, E, respectively, then there will be six “Multiclusters”: AC, AD, AE, BC, BD and BE. Notice that all the elements belonging to a given Multicluster show a total coincidence of the values of their attributes. This initial set of Multiclusters gives us the maximum number of clusters, which may be large. However, in real examples many of them are usually empty, so that the number of non-empty Multiclusters is much more reduced.

Step 2

  1. 1.

    For every couple of clusters, we compute the number of coincidences between their attributes. For example, the number of coincidences between AC and AD is one (A), and the number of coincidences between AC and BD is zero. This information is shown in the so-called Coincidence Matrix.

  2. 2.

    For every row of the Coincidence Matrix obtained before, select the column with the highest number of coincidences and merge the respective Multiclusters. The elements belonging to these new and bigger clusters share many (but not all) attributes. When two or more columns can be selected, we can break the tie by means of the Fleiss’ Kappa coefficient (Fleiss et al. 1969, 2003; Fleiss 1971), a widely used measure of the degree of similarity between objects with categorical attributes. Notice that, if we have already compared cluster “A” and cluster “B”, we don’t need to further compare cluster “B” and cluster “A”. For this reason, in this procedure we only need to work with the upper triangle of the matrix.

Step 3

  1. 1.

    We form a table with the optimal clusters obtained in the previous step, ranked in increasing order according to their size. For every row (cluster) of the table, we link it with other row (cluster) of the same table such that the resulting ratios of the values of the protected attribute are the closest to the desired ratios. This way, we obtain a new set of bigger clusters with a distribution of the protected attribute closer to the desired distribution.

  2. 2.

    We repeat the previous step until the predefined number of desired clusters is reached. The output of the algorithm is a set of clusters with a high degree of homogeneity and fairness.

To illustrate the methodology, the "German Credit" database from UCI Machine Learning Repository (Dua and Graff 2019) has been used as an unsupervised dataset; we work with a random sample of 20 observations and 9 categorical attributes, which we show in Table 1.

Table 1 A sample of 20 observations from the German Credit dataset

The first step of the algorithm is the calculation of the clusters for every single attribute, which correspond to their different values. Table 2 shows the distribution of clusters for each attribute, obtained in step 1.1. We choose Gender as protected attribute, with two values, Male (M) and Female (F). To ensure the reproducibility of the analysis, we rank the values of the attributes in increasing order according to their size.

Table 2 Cluster distribution of the attributes

In the step 1.2, we combine all the possible single-attribute clusters in order to get the initial set of multiclusters. The maximum number of multiclusters obtained this way can be very high, since it is the product of the number of clusters for every attribute of the dataset. In our case, the maximum number of multiclusters will be \(4{*}4{*}7{*}5{*}5{*}4{*}2{*}3{*}4{*}4{*}3{*}3{*}3{*}4{*}2{*}2{*}1 = 464.486.400\). However, almost all of them are empty. Actually, there are only 20 nonempty multiclusters, which are shown in Table 3.

Table 3 20 nonempty multiclusters

In Table 3, each multicluster contains only one single observation. To identify the multiclusters, we use the numbers associated to the values of the attributes in Table 2. Notice that the variables in Table 1 have the original labeling given in the dataset German_Credit: for instance, the values of the attributes of the first observation (775) are labeled as A13 (for the attribute “Status Account”), A34 (“Credit History”), A40 (“Purpose”), etc. To simplify the notation, in Table 2 these labels are substituted by numbers: according to Table 2, A13 will be “1”, A34 will be “3”, A40 will be “7”, etc. In Table 3 we label the Multiclusters with the numeric values attached to the values of their attributes in Table 2. Following this rule, the Multicluster containing observation 775, for example, will be labeled as “1372232343111122”.

According to the information given in Table 3, we build the Coincidence Matrix (Table 4):

Table 4 Coincidence Matrix between multiclusters

In order to reduce the number of clusters, we merge those Multiclusters that share the highest number of values of attributes. For each row, when there is only one column showing the highest value of coincidences, we merge the corresponding clusters. That is, we merge the clusters associated to that row and to the column corresponding to the highest value. This is the situation shown in Table 5, built from the second row of the Coincidence Matrix: in this case, the Multiclusters 1445522341323422 and 4445221241323422 should be merged, because they share the values of 12 attributes.

Table 5 An example of multicluster association with only one coincidence

When there are several columns with the highest value, we break the tie by means of the Fleiss-Kappa coefficient (Fleiss et al. 1969, 2003; Fleiss 1971). For example, in Table 6, built from the first row of the Coincidence Matrix, we find five columns with 6 coincidences. In this case, the Multiclusters 1372232343111122 and 3465332323313321 should be merged, because they get the highest value of the Kappa-Fleiss coefficient (0.957525773195876). If there are several Multiclusters having the same highest Kappa concordance value, the first of them should be selected following the top-down methodology.

Table 6 An example of multicluster association with more than one coincidence

Once the process explained before has been executed for all rows included in the Coincidence Matrix (Table 4), we obtain the Optimal Multiclusters Table (Table 7), with 11 nonempty optimal Multiclusters. Of course, the final number of clusters could be less than 11, if desired. Further details about Step 2 of the algorithm can be found in Santos and Heras (2020). Notice that the clusters with equal frequency in Table 7 are lexicographically sorted.

Table 7 Optimal multiclusters table

In the last step of the algorithm (Step 3), we focus in the fairness objective. We have chosen as the desired ratio of the attribute Gender the relative initial proportions of its values in the dataset, 30% (for Female) and 70% (for Male). Then, for every row (multicluster) of Table 7 and beginning from the first one, we calculate the (Euclidean) distance between the observed ratios of the protected attribute and the desired ratios, after joining it to any of the other following rows (multiclusters). That is, for every row (the “Transmitter” multicluster), we select each one of the following rows (the “Receiver” multicluster), join the elements of both “Transmitter” and “Receiver” multiclusters to form a bigger cluster and calculate the (Euclidean) distance between the ratios of the protected attribute in the new bigger cluster and the desired ratios (30%, 70%). The process is repeated with all the following rows, and we finally join those rows (multiclusters) such that the ratios of the new cluster are closest to the desired ratios.Footnote 1

For example, taking the second row in Table 7 (3465242333313320) as Transmitter, the minimum distance to the desired ratios (0.0471) is reached by joining it to the Receiver multicluster located in the eighth row (4435442342131410): see Table 8 for the details. Joining both rows in a new Table, the procedure is repeated until a predetermined number of clusters (k) is reached.

Table 8 An example of calculation of the distances between a multicluster transmitter (2nd row of Table 7) and the receivers

Table 9 shows the distribution of the protected attribute with two final clusters (k = 2), with a total fairness ratioFootnote 2 of 96%

Table 9 Observed and desired cluster distributions

4 Experimental results

4.1 Datasets used for evaluation

Table 10 shows the categorical databases that are used for the evaluation of the clustering efficiency of the algorithm. In all cases there is a response variable, defined as the real cluster in which every observation is placed, which is known in advance but not used as an input of the algorithm. This omitted information can be used to evaluate the clustering efficiency, by contrasting the real classification of the observations to that given by the algorithm (see, among others, Yu et al. (2018), and Zhu and Ma (2018)).

Table 10 The datasets used in the experimental analysis

As for the evaluation of the fairness of the classification, we measure the distance between the desired distribution of the protected attribute and its final distribution in the clusters given by the algorithm. In all the examples we choose the initial proportions of the values of the protected attribute in the data set as desired proportions to be approached in the final clusters. In other terms, the proportions of the values of the protected attribute in the final clusters (the output of the algorithm) should be close to their initial observed proportions in the whole data set. Of course, any alternative desired distribution could be selected.

4.2 Evaluation metrics

Many measures of the degree of similarity between different partitions of the same data set have been proposed in the literature: see, among others, Dom (2012), Headden et al. (2008), Meilâ (2007), Reichart and Rappoport (2009), Rosenberg and Hirschberg (2007), Vinh et al. (2010), Wagner and Wagner (2007), Walker and Ringger (2008). We have selected four well-known measures of the similarity between two partitions P and R of a given data set. In our applications, P will be the output of the clustering algorithm, and R the “real” partition observed in the data set.

(I) “Fowlkes-Mallows index” (Fowlkes and Mallows 1983). High values of the Fowlkes–Mallows index indicate a great similarity between the clusters. It is defined as:

$$ {\varvec{FMI}} = \sqrt {\frac{TP}{{TP + FP}} \cdot \frac{TP}{{TP + FN}}} $$

where

  • TP as the number of pairs of points that are in the same cluster in both P and R.

  • FP as the number of pairs of points that are in the same cluster in P but not in R

  • FN as the number of pairs of points that are in the same cluster in R but not in P

  • TN as the number of pairs of points that are in different clusters in both P and R

(II) “Maximum-Match Measure” (Meilâ and Heckerman 2001) is defined as

$$ {\varvec{MMM}} = \frac{1}{n}\mathop \sum \limits_{i = 1}^{k} max_{j} m_{ij} $$

where \(m_{ij}\) is the number of observations belonging to both clusters \(P_{i}\) and \(R_{j}\) and n is the total number of observations in the data set.

(III) “Normalized Variation of Information Measure” (Reichart and Rappoport 2009) is a normalized version of the VI-Variation of Information measure (Meila 2007); it is defined as:

$$ {\varvec{NVI}} = \left\{ {\begin{array}{*{20}l} {\frac{{H\left( {P{|}R} \right) + H(R|P)}}{H\left( P \right)}} \hfill & {H\left( P \right) \ne 0} \hfill \\ {H\left( R \right)} \hfill & {H \left( P \right) = 0} \hfill \\ \end{array} } \right. $$

where H(P) and H(R) are the entropies of the partitions P and R, and H(P|R) and H(R|P) are their conditional entropies.

“Overlap coefficient” (Vijaymeena and Kavitha 2016) also known as Szymkiewicz-Simpson coefficient, is a similarity measure based on the concept of the overlap between sets. Given two finite sets P and R, the overlap between them is defined as the size of the intersection divided by the smallest size of the two sets:

$$ {\varvec{OI}} = \frac{P \cap R}{{{\text{min}}\left( {\left| P \right|,\left| R \right|} \right)}} $$

For the evaluation of fairness, we use the Euclidean distance between the desired distribution of the protected attribute and its final distribution in the clusters given by the algorithm:

$$ Fairness\; ratio = \frac{{\mathop \sum \nolimits_{i = 1}^{i = k} (1 - euclidean\; distance\left( {Observed_{i} ;Desired} \right)}}{number \;of\; clusters \left( k \right)} $$

4.3 Performance results

Table 11 shows the clustering efficiency of three algorithms (Multicluster, Fair-Multicluster and K-Modes) for the data sets in Table 10, measured by means of the Fowlkes-Mallows measure, the Maximum-Match measure, the Normalized Variation of Information measure and the Overlap measure. The highest performances are shown by bold-faced numbers. We conclude that Multicluster and Fair-Multicluster outperform K-Modes in most cases.

Table 11 Comparison of classification efficiencies

To better understand the proposed fairness measure, we give a detailed calculation of its value for the “Human Resources FC2” data set. The elements of this dataset have 3 different values of the Fairness or protected attribute (Marital Status): Divorced (14%), Married (41%) and Single (45%). Therefore, the “desired” distribution of this attribute will be (0.14, 0.41, 0.45).

Table 12 shows the final distributions of this attribute in each of the four clusters given by the Fair-Multicluster algorithm:

Table 12 Final clustering distribution of fair-multicluster algorithm

Then, we can calculate the Fairness measure as the average of the distances between the observed vectors and desired vectors for each final cluster:

$$ Fairness ratio = \frac{{\mathop \sum \nolimits_{i = 1}^{i = k} (1 - euclidean distance\left( {Observed_{i} ;Desired} \right)}}{number of clusters \left( k \right)} = 0,981 \sim 98\% $$

Table 13 shows the Fairness measures of the final clusters given by the three algorithms. We conclude that, concerning the Fairness measure, Fair-Multicluster largely outperforms Multicluster and K-Modes in all cases.

Table 13 Comparative of fairness classification

On the basis of the results obtained before and shown in Tables 11 and 13, we conclude that the proposed Fair-Multicluster algorithm has an excellent performance in terms of the fairness measure (as expected), while at the same time it outperforms the well-known K-Modes algorithm in terms of classification efficiency. We also conclude that the K-Multicluster algorithm often gets better results in terms of this last objective. Actually, the figures in both Tables allow comparing the performances of the K-Multicluster and Fair-Multicluster algorithms, thus giving a numerical evaluation of the trade-off between efficiency and fairness: considering, for instance, the CARS_Insurance dataset, the efficiency ratios are (FMI = 0.859, MMM = 0.966, NVI = 1.000, OI = 0.933) for the K-Multicluster algorithm and (three of them) decrease to (FMI = 0.732, MMM = 0.801, NVI = 1.000, OI = 0.789) for the Fair-Multicluster algorithm, while at the same time the fairness ratio increases from 0.667 to 0.99.

5 Discussion

The key ideas behind the Fair-Multicluster algorithm are easy to understand in intuitive terms. Perhaps the main contribution is the way it combines the initial multiclusters in order to reach a compromise between the opposite goals of clustering efficiency and fairness: on the one side, Step 2 merges similar clusters, trying to get highly homogeneous clusters in the final classification; on the other side, merging clusters in Step 3 is looking for a fair distribution of the values of the protected attribute. In other terms, repeating Step 2 increases the efficiency of the final cluster classification, while repeating Step 3 increases the fairness. Since efficiency and fairness often go in opposite directions (improving one of them usually has the consequence of worsening the other), we have to predefine some compromise between them. In practise, this compromise can be achieved by selecting the number of iterations of Step 2 before starting Step 3. Actually, working with small and medium size databases, we have seen that it is usually enough one single repetition of Step 2 in order to reach a reasonable value of the efficiency. For example, working with the German Credit file, we have got a significant improvement of the efficiency only after 5 iterations of Step 2, with the unfortunate consequence of a great loss of fairness. For this reason, in this paper we have worked with only one iteration of Step 2 in the German Credit example, obtaining good values of both efficiency and fairness. Nevertheless, when working with bigger files, it may be necessary to perform several experiments to find the optimal number of Step 2 iterations. Of course, this procedure can be very time-consuming, and the high computing time is perhaps the main drawback of the proposed Fair-Multicluster algorithm.

6 Conclusions and future work

Assuming the existence of a protected attribute such as race, gender or social status, in this paper we propose a clustering algorithm for finding homogeneous and fair clusters. The clusters should be homogeneous, that is, formed by similar elements, and should also be fair, not biased towards or against specific subgroups of the population. Of course, there is a trade-off between fairness and efficiency, so that an increase in the fairness objective usually leads to a loss of classification efficiency. Yet the so-called Fair-Multicluster algorithm reaches a reasonable compromise between these goals. This algorithm can be considered as an adaptation of the K-Multicluster algorithm proposed by Santos and Heras (2020) for clustering categorical data bases, an algorithm which can be easily modified in order to get homogeneous and fair clusters.

The high performance of the Fair-Multicluster algorithm has been checked by comparing it with the Multicluster and the well-known K-Modes algorithms. Their classification efficiencies and fairness have been calculated in ten categorical data bases, using four well-known measures of efficiency and a measure of fairness based on the distance between the final distribution of the protected attribute and its desired distribution. As for the classification efficiency, Table 11 shows that both K-Multicluster and Fair-Multicluster algorithms outperform K-Modes in most cases. With respect to the fairness objective, Table 13 shows the highest performance in all cases of the Fair-Multicluster algorithm, reaching scores close to 100% in many cases. Besides, unlike K-Modes, the output of the Fair-Multicluster algorithm is not affected by randomness.Footnote 3 Replicability, classification efficiency and fairness are the major benefits of the proposed Fair-Multicluster algorithm.

Among the future developments of this methodology, we highlight its application to mixed data sets with both quantitative and qualitative attributes, and/or to data sets with several (more than one) protected attributes.