1 Introduction

This work focusses on how to find good initialisations (starting solutions) for \(k\)-modes clustering—an extension to \(k\)-means clustering that permits the clustering of categorical (i.e. ordinal, nominal or otherwise discrete) data as set out in the seminal works by Huang (1997a, 1997b, 1998).

The novelty of the initialisation method proposed in our work is how we extend the method presented by Huang (1998) by using results from game theory to lever learning opportunities presented by the data being clustered. In addition to the method presented by Huang, the next most commonly cited initialisation was presented by Cao et al. (2009). This initialisation forms the basis of a number of other initialisations where a notion of density is central to that method. Huang’s and Cao’s methods are described fully in Sects. 1.2.1 and 1.2.2.

The broad idea of our contribution is to use the so-called hospital-resident assignment problem (HR) to produce a good initialisation/starting point from which the \(k\)-modes algorithm begins, since it is well known that the performance of clustering algorithms is affected by the quality/selection of its initial solution. HR is described fully in Sect. 2 and can be described as the problem of matching residents to hospitals subject to constraints and preferences. In our setting, we consider HR as a problem of matching data to clusters where preference is dictated by the distance from a representative point of the cluster. Our hypothesis is that this could lead to a more ‘intelligent’ initialisation beyond random sampling (which is the fundamental approach of Huang in the above references) or selecting initial modes according to their density in the data (which is the fundamental approach of Cao).

Our paper concludes with an analysis of these methods against the proposed and it is demonstrated that the proposed method is often able to outperform them both in terms of accuracy of clustering and computational speed-up. The aim of our analysis is twofold: first, to highlight the merits of the method in a familiar setting by clustering well-known benchmark datasets and second, to provide a deeper insight into how the methods perform against one another by generating artificial datasets using the method set out in Wilde et al. (2019).

Our paper is structured as follows:

  • Sect. 1 introduces the \(k\)-modes algorithm and its established initialisation methods, namely those by Huang and Cao.

  • Sect. 2 provides a brief overview of matching games and their variants before a statement of our novel initialisation using HR is described.

  • Sect. 3 presents analyses of the initialisations on benchmark and new, artificial datasets, using the recent method of the authors presented in Wilde et al. (2019).

  • Sect. 4 concludes the paper and offers some discussion.

1.1 The \(k\)-modes algorithm

The following notation will be used:

  • Let \({\mathcal {A}}:= A_1 \times \cdots \times A_m\) denote the attribute space. In this work, only categorical attributes are considered, i.e. for each \(j \in \left\{ 1,2, \ldots , m \right\} \) it follows that \(A_j:= \left\{ a_1^{(j)}, \ldots , a_{d_j}^{(j)}\right\} \) where \(d_j = |A_j|\) is the size of the \(j^{th}\) attribute.

  • Let \({\mathcal {X}}:= \left\{ X^{(1)}, \ldots , X^{(N)}\right\} \subset {\mathcal {A}}\) denote a dataset where each \(X^{(i)} \in {\mathcal {X}}\) is defined as an \(m\)-tuple \(X^{(i)}:= \left( x_1^{(i)}, \ldots , x_m^{(i)}\right) \) where \(x_j^{(i)} \in A_j\) for each \(j \in \left\{ 1,2,\right. \left. \ldots , m \right\} \). Elements of \({\mathcal {X}}\) are referred to as data points or instances.

  • Let \({\mathcal {Z}}:= \left( Z_1, \ldots , Z_k\right) \) be a partition of a dataset \({\mathcal {X}} \subset {\mathcal {A}}\) into \(k \in {\mathbb {Z}}^{+}\) distinct, non-empty parts. Such a partition \({\mathcal {Z}}\) is called a clustering of \({\mathcal {X}}\).

  • Each cluster \(Z_l\) has a mode (see Definition 2) denoted by \(z^{(l)} = \left( z_1^{(l)},~\ldots ,~z_m^{(l)}\right) \in {\mathcal {A}}\). These points are also referred to as representative points or centroids. The set of all current cluster modes is denoted as \({\overline{Z}} = \left\{ z^{(1)}, \ldots , z^{(k)}\right\} \).

Definition 1 describes a dissimilarity measure between categorical data points.

Definition 1

Let \({\mathcal {X}} \subset {\mathcal {A}}\) be a dataset and consider any \(X^{(a)}, X^{(b)} \in {\mathcal {X}}\). The dissimilarity between \(X^{(a)}\) and \(X^{(b)}\), denoted by \(d\left( X^{(a)}, X^{(b)}\right) \), is given by:

$$\begin{aligned} d\left( X^{(a)}, X^{(b)}\right) := & {} \sum _{j=1}^{m} \delta \left( x_j^{(a)}, x_j^{(b)}\right) \quad \text {where} \nonumber \\ \delta \left( x, y\right)= & {} {\left\{ \begin{array}{ll} 0, &{} \text {if} \ x = y, \\ 1, &{} \text {otherwise.} \end{array}\right. } \end{aligned}$$
(1)

With this metric, the notion of a representative point of a cluster is addressed. With numeric data and \(k\)-means, such a point is taken to be the mean of the points within the cluster. With categorical data, however, the mode is used as the measure for central tendency. This follows from the concept of dissimilarity in that the point that best represents (i.e. is closest to) those in a cluster is one with the most frequent attribute values of the points in the cluster. The following definitions and theorem formalise this and a method to find such a point.

Definition 2

Let \({\mathcal {X}} \subset {\mathcal {A}}\) be a dataset and consider some point \(z = \left( z_1, \ldots , z_m\right) \in {\mathcal {A}}\). Then \(z\) is called a mode of \({\mathcal {X}}\) if it minimises the following:

$$\begin{aligned} D\left( {\mathcal {X}}, z\right) = \sum _{i=1}^{N} d\left( X^{(i)}, z\right) . \end{aligned}$$
(2)

Definition 3

Let \({\mathcal {X}} \subset {\mathcal {A}}\) be a dataset. Then \(n\left( a_s^{(j)}\right) \) denotes the frequency of the \(s^{th}\) category \(a_s^{(j)}\) of \(A_j\) in \({\mathcal {X}}\), i.e. for each \(A_j \in {\mathcal {A}}\) and each \(s = 1, \ldots , d_j\):

$$\begin{aligned} n\left( a_s^{(j)}\right) := \left| {\left\{ X^{(i)} \in {\mathcal {X}}: x_j^{(i)} = a_s^{(j)}\right\} } \right| . \end{aligned}$$
(3)

Furthermore, \(\frac{n\left( a_s^{(j)}\right) }{N}\) is called the relative frequency of category \(a_s^{(j)}\) in \({\mathcal {X}}\).

Theorem 1

(Huang (1998)) Consider a dataset \({\mathcal {X}} \subset {\mathcal {A}}\) and some \(U = (u_1, \ldots , u_m) \in {\mathcal {A}}\). Then \(D({\mathcal {X}}, U)\) is minimised if and only if \(n\left( u_j\right) \ge n\left( a_s^{(j)}\right) \) for all \(s=1, \ldots , d_j\), for each \(j = 1, \ldots , m\).

Theorem 1 defines the process by which cluster modes are updated in \(k\)-modes (see Algorithm 3), and so the final component from the \(k\)-means paradigm to be optimised is the objective (cost) function. This function is defined in Definition 4, and following that a practical statement of the \(k\)-modes algorithm is given in Algorithm 1 as set out in Huang (1998).

Definition 4

Let \({\mathcal {Z}} = \left\{ Z_1, \ldots , Z_k\right\} \) be a clustering of a dataset \({\mathcal {X}}\), and let \({\overline{Z}} = \left\{ z^{(1)}, \ldots , z^{(k)}\right\} \) be the corresponding cluster modes. Then \(W = \left( w_{i, l}\right) \) is an \(N \times k\) partition matrix of \({\mathcal {X}}\) such that:

$$\begin{aligned} w_{i, l} = {\left\{ \begin{array}{ll} 1, &{} \text {if} \ X^{(i)} \in Z_l\\ 0, &{} \text {otherwise.} \end{array}\right. } \end{aligned}$$

The cost function is defined to be the summed within-cluster dissimilarity:

$$\begin{aligned} C\left( W, {\overline{Z}}\right) := \sum _{l=1}^{k} \sum _{i=1}^{N} \sum _{j=1}^{m} w_{i,l} \ \delta \left( x_j^{(i)}, z_j^{(l)}\right) . \end{aligned}$$
(4)
figure a
figure b
figure c

1.2 Initialisation processes

The standard selection method to initialise \(k\)-modes is to randomly sample \(k\) distinct points in the dataset. In all cases, the initial modes must be points in the dataset to ensure that there are no empty clusters in the first iteration of the algorithm. The remainder of this section describes two well-established initialisation methods—those of Huang and Cao.

1.2.1 Huang’s method of initialisation

Amongst the original works by Huang, an initialisation method was presented that selects modes by distributing frequently occurring values from the attribute space among \(k\) potential modes (Huang 1998). The process, denoted as Huang’s method, is described in full in Algorithm 4. Huang’s method considers a set of potential modes, \({\widehat{Z}} \subset \mathcal A\), that is then replaced by the actual set of initial modes, \({\overline{Z}} \subset {\mathcal {X}}\). The statement of how the set of potential modes are formed is ambiguous in the original paper—as is alluded to in Jiang et al. (2016). Here, as is done in most computational implementations of \(k\)-modes, this has been interpreted as being done via a weighted random sample (see Algorithm 5).

figure d
figure e

1.2.2 Cao’s method of initialisation

The second initialisation process that is widely used with \(k\)-modes is known as Cao’s method (Cao et al. 2009). This method selects the initial modes according to their density in the dataset whilst forcing dissimilarity between them. Definition 5 formalises the concept of density and its relationship to relative frequency. The method, which is described in Algorithm 6, is deterministic—unlike Huang’s method which relies on random sampling.

Definition 5

(Cao et al. (2009)) Consider a dataset \({\mathcal {X}} \subset {\mathcal {A}} = \{A_1, \ldots , A_m\}\). The average density of any point \(X_i \in {\mathcal {X}}\) with respect to \({\mathcal {A}}\) is defined as:

$$\begin{aligned} \text {Dens}\left( X^{(i)}\right)= & {} \frac{ \sum _{j=1}^m \text {Dens}_{j}\left( X^{(i)}\right) }{m} \quad \text {where}\nonumber \\ \text {Dens}_{j}\left( X^{(i)}\right)= & {} \frac{ \left| \left\{ X^{(t)} \in {\mathcal {X}}: x_j^{(i)} = x_j^{(t)}\right\} \right| }{N}. \end{aligned}$$
(5)

Since \( \left| \left\{ X^{(t)} \in {\mathcal {X}}: x_j^{(i)} = x_j^{(t)}\right\} \right| = n\left( x_j^{(i)}\right) = \sum _{t=1}^N \left( 1 - \delta \left( x_j^{(i)}, x_j^{(t)}\right) \right) \,, \) then an alternative definition for (5) is:

$$\begin{aligned} \text {Dens}\left( X^{(i)}\right)= & {} \frac{1}{mN} \sum _{j=1}^m \sum _{t=1}^N \left( 1 - \delta \left( x_j^{(i)}, x_j^{(t)}\right) \right) \nonumber \\= & {} 1 - \frac{1}{mN} D\left( {\mathcal {X}}, X^{(i)}\right) . \end{aligned}$$
(6)
figure f

2 Matching games and our novel method

2.1 Summary of Huang’s and Cao’s method

Both of the initialisation methods described in Sect. 1.2 have a greedy component. Cao’s method essentially chooses the densest point that has not already been chosen whilst forcing separation between the set of initial modes. In the case of Huang’s, however, the greediness only comes at the end of the method, when the set of potential modes is replaced by a set of instances in the dataset. Specifically, this means that in any practical implementation of this method the order in which a set of potential modes is iterated over can affect the set of initial modes. Thus, there is no guarantee of consistency.

The initialisation proposed in this work extends Huang’s method to be order-invariant in the final allocation—thereby eliminating its greedy component—and provides a more intuitive starting point for the \(k\)-modes algorithm. This is done by constructing and solving a matching game between the set of potential modes and some subset of the data.

2.2 Matching games and the hospital-resident assignment problem

In general, matching games are defined by two sets (parties) of players in which each player creates a preference list of at least some of the players in the other party. The objective then is to find a ‘stable’ mapping between the two sets of players such that no pair of players is (rationally) unhappy with their matching. Algorithms to ‘solve’—i.e. find stable matchings to—instances of matching games are often structured to be party-oriented and aim to maximise some form of social or party-based optimality (Erdil and Ergin 2017; Fuku et al. 2006; Gale and Shapley 1962; Iwama and Miyazaki 2016; Kwanashie et al. 2015; Manlove et al. 2002).

The particular constraints of this case—where the \(k\) potential modes must be allocated to a nearby unique data point—mirror those of the so-called hospital-resident assignment problem (HR). This problem gets its name from the real-world problem of fairly allocating medical students to hospital posts. A resident-optimal algorithm for solving HR was presented in Gale and Shapley (1962) and was adapted in Roth (1984) to take advantage of the structure of the game. This adapted algorithm is given in Algorithm 7. A practical implementation of this algorithm has been implemented in Python as part of the matching library (The Matching library developers 2019) and is used in the implementation of the proposed method for Sect. 3.

Table 1 A summary of the relationships between the components of the initialisation for \(k\)-modes and those in a matching game \((R, H)\)

The game used to model HR, its matchings, and its notion of stability are defined in Definitions 68. A summary of these definitions in the context of the proposed \(k\)-modes initialisation is given in Table 1 before a formal statement of the proposed method in Algorithm 11. Before this formal description, we offer an intuitive explanation of how HR yields an initial clustering solution. The HR assignment problem is a type of two-sided matching problem that arises in the context of medical residency programs. In this problem, medical students (residents) are matched with hospitals, where they will complete their training. The goal of the assignment is to match residents to hospitals in a way that is fair and efficient.

There are several constraints and considerations that must be taken into account when solving the HR assignment problem. For example, the number of residents that a hospital can accommodate is typically limited, and some hospitals may have preferences for certain types of residents (e.g. those with a particular specialty or background). Additionally, residents may have their own preferences for which hospitals they would like to work at.

In our context, potential cluster modes are considered the residents, and the data points closest to these modes the set of possible hospitals to which a resident (mode) may be assigned. The ordered hospital preference list of a resident is determined by the similarity between a resident (mode) and a point (hospital). When we replace a potential mode, we have a match, hence the analogy with a matching game.

Definition 6

Consider two distinct sets \(R\) and \(H\), and refer to them residents and hospitals. Each \(h \in H\) has a capacity \(c_h \in {\mathbb {N}}\) associated with them. Each player \(r \in R\) and \(h \in H\) has associated with it a strict preference list of the other set’s elements such that:

  • Each \(r \in R\) ranks a non-empty subset of \(H\), denoted by \(f(r)\).

  • Each \(h \in H\) ranks all and only those residents that have ranked it, i.e. the preference list of \(h\), denoted \(g(h)\), is a permutation of the set \(\left\{ r \in R \ | \ h \in f(r)\right\} \). If no such residents exist, \(h\) is removed from \(H\).

This construction of residents, hospitals, capacities and preference lists is called a game and is denoted by \((R, H)\).

Definition 7

Consider a game \((R, H)\). A matching \(M\) is any mapping between \(R\) and \(H\). If a pair \((r, h) \in R \times H\) are matched in \(M\) then this relationship is denoted \(M(r) = h\) and \(r \in M^{-1}(h)\).

A matching is only considered valid if all of the following hold for all \(r \in R, h \in H\):

  • If \(r\) is matched then \(M(r) \in f(r)\).

  • If \(h\) has at least one match then \(M^{-1}(h) \subseteq g(h)\).

  • \(h\) is not over-subscribed, i.e. \(\left| M^{-1}(h)\right| \le c_h\).

A valid matching is considered stable if it does not contain any blocking pairs.

Definition 8

Consider a game \((R, H)\). Then, a pair \((r, h) \in R \times H\) is said to block a matching \(M\) if all of the following hold:

  • There is mutual preference, i.e. \(r \in g(h)\) and \(h \in f(r)\).

  • Either \(r\) is unmatched or they prefer \(h\) to \(M(r)\).

  • Either \(h\) is under-subscribed or \(h\) prefers \(r\) to at least one resident in \(M^{-1}(h)\).

figure g
figure h
figure i
figure j
figure k

3 Experimental results

3.1 Benchmark data

To give comparative results on the quality of the initialisation processes considered in this work, four well-known, categorical, labelled datasets—breast cancer, mushroom, nursery, and soybean (large)—will be clustered by the \(k\)-modes algorithm with each of the initialisation processes described in the paper.

Table 2 A summary of the benchmark datasets
Table 3 Summative metric results for the breast cancer dataset with \(k=8\)

Each dataset studied in this section is openly available under the UCI Machine Learning Repository (Dua and Graff 2017), and their characteristics are summarised in Table 2. For the purposes of this analysis, incomplete instances (i.e. where data is missing) are excluded and the remaining dataset characteristics are reported as ‘adjusted’. Throughout, when we refer to cost, we refer to the evaluation of the cost function as described in Definition 4, Eq. (4).

3.1.1 Source code and evaluative metrics

All of the source code used to produce the results and data in this analysis—including the datasets investigated in Sect. 3.2—are archived at DOI https://doi.org/10.5281/zenodo.3639282. In addition to this, the implementation of the \(k\)-modes algorithm and its initialisations is available under DOI https://doi.org/10.5281/zenodo.3638035.

This analysis does not consider evaluative metrics related to classification such as accuracy, recall or precision as is commonly done (Arthur and Vassilvitskii 2007; Cao et al. 2009, 2012; Huang 1998; Ng et al. 2007; Olaode et al. 2014; Schaeffer 2007; Sharma and Gaud 2015). Instead, only internal measures are considered such as the cost function defined in (4). This metric is label-invariant and its values are comparable across the different initialisation methods. Furthermore, the effect of each initialisation method on the initial and final clusterings can be captured with the cost function (4).

3.1.2 Choosing k

Re-call that Huang’s method of initialisation is obtained randomly where initial modes are sampled according to the relative frequencies of the categorical attributes. Cao’s method is essentially deterministic and initial modes are selected according to their density in the dataset. Our method extends that of Huang’s by offering ‘intelligent’ starting initial modes with criteria motivated by the HR assignment problem.

The final piece of information required for clustering is a choice for \(k\) for each dataset. An immediate choice is the number of classes that are present in a dataset, but this is not necessarily an appropriate choice since the classes may not be representative of true clusters (Mémoli 2011). However, this analysis will consider this case as there may be practical reasons to limit the value of \(k\). The other strategy for choosing \(k\) considered in this work uses the knee point detection algorithm introduced in Satopaa et al. (2011). The knee point detection algorithm was employed over values of \(k\) from 2 up to \(\lfloor \sqrt{N}\rfloor \) for each dataset. The number of clusters determined by this strategy is reported in the final column of Table 2.

3.1.3 Using knee point detection algorithm for \(k\)

Tables 3, 4, 5 and 6 summarise the results of each initialisation method on the benchmark datasets where the number of clusters has been determined by the knee point detection algorithm. Each column shows the mean value of each metric and its standard deviation in parentheses over 250 independent repetitions of the \(k\)-modes algorithm.

Table 4 Summative metric results for the mushroom dataset with \(k=17\)
Table 5 Summative metric results for the nursery dataset with \(k=23\)
Table 6 Summative metric results for the soybean dataset with \(k=8\)

By examining these tables, it would seem that the proposed method and Huang’s method are comparable across the board—although the proposed method is faster despite taking more iterations in general which may relate to a more intuitive initialisation. More importantly though, it appears that Cao’s method performs the best out of the three initialisation methods: in terms of initial and final costs Cao’s method improves, on average, by roughly \(10\%\) against the next best method for the three datasets that it succeeds with; the number of iterations is comparable; and the computation time is substantially less than the other two considering it is a deterministic method and need only be run once to achieve this performance.

However, in the \(k\)-means paradigm, a particular clustering is selected based on it having the minimum final cost over a number of runs of the algorithm—not the mean—and whilst Cao’s method is very reliable, in that there is no variation at all, it does not always produce the best clustering possible. There is a trade-off to be made between computational time and performance here. In order to gain more insight into the performance of each method, less granular analysis is required. Figures 1, 2, 3 and 4 display the cost function results for each dataset in the form of a scatter plot and two empirical cumulative density function (CDF) plots, highlighting the breadth and depth of the behaviours exhibited by each initialisation method.

Looking at Fig. 1, it is clear that in terms of final cost Cao’s method is middling when compared to the other methods. This was apparent from Table 3, and indeed, Huang’s and the proposed method are both very comparable when looking at the main body of the results. However, since the criterion for the best clustering (in practical terms) is having the minimum final cost, it is evident that the proposed method is superior; that the method produces clusterings with a larger cost range (indicated by the trailing right-hand side of each CDF plot) is irrelevant for the same reason.

This pattern of largely similar behaviour between Huang’s and the proposed method is apparent in each of the figures here, and in each case, the proposed method outperforms Huang’s. In fact, in all cases except for the nursery dataset, the proposed method achieves the lowest final cost of all the methods and, as such, performs the best in practical terms on these particular datasets.

In the case of the nursery dataset, Cao’s method is unquestionably the best performing initialisation method. It should be noted that none of the methods were able to find an initial clustering that could be improved on and that this dataset exactly describes the entire attribute space in which it exists. This property could be why the other methods fall behind Cao’s so decisively in that Cao’s method is able to definitively choose the \(k\) most dense-whilst-separated points from the attribute space as the initial cluster centres whereas the other two methods are in essence randomly sampling from this space. That each initial solution in these repetitions is locally optimal remains a mystery.

Fig. 1
figure 1

Summative plots for the breast cancer dataset with \(k=8\)

Fig. 2
figure 2

Summative plots for the mushroom dataset with \(k=17\)

Fig. 3
figure 3

Summative plots for the nursery dataset with \(k=23\)

Fig. 4
figure 4

Summative plots for the soybean dataset with \(k=8\)

Table 7 Summative metric results for the breast cancer dataset with \(k=2\)
Table 8 Summative metric results for the mushroom dataset with \(k=2\)
Table 9 Summative metric results for the nursery dataset with \(k=5\)
Table 10 Summative metric results for the soybean dataset with \(k=15\)

3.1.4 Using number of classes for \(k\)

As is discussed above, the often automatic choice for \(k\) is the number of classes present in the data; this subsection repeats the analysis from the subsection above but with this traditional choice for \(k\). Tables 7, 8, 9 and 10 contain the analogous summaries of each initialisation method’s performance on the benchmark datasets over the same number of repetitions.

An immediate comparison to the previous tables is that for all datasets bar the soybean dataset, the mean costs are significantly higher and the computation times are lower. These effects come directly from the choice of \(k\) in that higher values of \(k\) will require more checks (and thus computational time) but will typically lead to more homogeneous clusters, reducing their within-cluster dissimilarity and therefore cost.

Looking at these tables on their own, Cao’s method is the superior initialisation method on average: the means are substantially lower in terms of initial and final cost; there is no deviation in these results; again, the total computational time is a fraction of the other two methods. It is also apparent that Huang’s method and the proposed extension are very comparable on average. As before, finer investigation will require finer visualisations. Figures 5, 6, 7 and 8 show the same plots as in the previous subsection except the number of clusters has been taken to be the number of classes present in each dataset.

Figures 5 and 6 indicate that a particular behaviour emerged during the runs of the \(k\)-modes algorithm. Specifically, each solution falls into one of (predominantly) two types: effectively no improvement on the initial clustering, or terminating at some clustering with a cost that is bounded below across all such solutions. Invariably, Cao’s method achieves or approaches this lower bound and unless Cao’s method is used, these particular choices for \(k\) mean that the performance of the \(k\)-modes algorithm is exceptionally sensitive to its initial clustering. Moreover, the other two methods are effectively indistinguishable in these cases and so if a robust solution is required, Cao’s method is the only viable option.

Figure 7 corresponds to the nursery dataset results with \(k=5\). In this set of runs, the same pattern emerges as in Fig. 3 where sampling the initial centres from amongst the most dense points (via Huang’s method and the proposed) is an inferior strategy to one considering the entire attribute space such as with Cao’s method. Again, no method is able to improve on the initial solution except for one repetition with the matching initialisation method.

Fig. 5
figure 5

Summative plots for the breast cancer dataset with \(k=2\)

Fig. 6
figure 6

Summative plots for the mushroom dataset with \(k=2\)

Fig. 7
figure 7

Summative plots for the nursery dataset with \(k=5\)

Fig. 8
figure 8

Summative plots for the soybean dataset with \(k=15\)

Fig. 9
figure 9

Histograms of fitness for the top performing percentile in each case

3.1.5 Conclusion of this analysis

The primary conclusion from this analysis is that whilst Huang’s method is largely comparable to the proposed extension, there is no substantial evidence from these use cases to use Huang’s method over the one proposed in this work. In fact, Fig. 8 is the only instance where Huang’s method was able to outperform the proposed method. Other than this, the proposed method consistently performing better (or as well as) Huang’s method in terms of minimal final costs and computational time over a number of runs in both the cases where an external framework is imposed on the data (by choosing \(k\) to be the number of classes) and not. Furthermore, though not discussed in this work, the matching initialisation method has the scope to allow for expert or prior knowledge to be included in an initial clustering by using some ad hoc preference list mechanism.

3.2 Artificial datasets

3.2.1 Generating data

This stage of the analysis relies on a method for generating artificial datasets introduced in Wilde et al. (2019). In essence, this method is an evolutionary algorithm which acts on entire datasets to explore the space in which potentially all possible datasets exist. The key component of this method is an objective function that takes a dataset and returns a value that is to be minimised; this function is referred to as the fitness function.

In order to reveal the nuances in the performance of Cao’s method and the proposed initialisation on a particular dataset, two cases are considered: where Cao’s method outperforms the proposed, and vice versa. Both cases use the same fitness function (with the latter using its negative) which is defined as follows:

$$\begin{aligned} f\left( {\mathcal {X}}\right) = C_{\textrm{cao}} - C_{\textrm{match}} \end{aligned}$$
(7)

where \(C_{\textrm{cao}}\) and \(C_{\textrm{match}}\) are the final costs when a dataset \({\mathcal {X}}\) is clustered using Cao’s method and the proposed matching method, respectively, with \(k = 3\). For the sake of computational time, the proposed initialisation was given 25 repetitions as opposed to the 250 repetitions in the remainder of this section. Apart from the sign of \(f\), the dataset generation processes used identical parameters in each case and the datasets considered here are all of comparable shape. This process yielded approximately 35,000 unique datasets for each case, and the ensuing analysis only considers the top-performing percentile of datasets from each. Figure 9 shows the fitness distribution of the top percentile in each case. It should be clear from (7) that large negative values are preferable here. With that, and bearing in mind that the generation of these datasets was parameterised in a consistent manner, it appears that the attempt to outperform Cao’s method proved somewhat easier. This is indicated by the substantial difference in the locations of the fitness distributions.

Fig. 10
figure 10

Distribution plots for the a variance, b skewness and c kurtosis of the first principal components in each case

Fig. 11
figure 11

Distribution plots for the a interquartile range, b lower decile and c upper decile of the first principal components in each case

3.2.2 Analysis of generated data

Given the quantity of data available, to understand the patterns that have emerged, they must be summarised; in this case, univariate statistics are used. Despite the datasets all being of similar shapes, there are some discrepancies. With the number of rows, this is less of an issue, but any comparison of statistics across datasets of different widths is difficult without prior knowledge of the datasets. Moreover, there is no guarantee of contingency amongst the attributes, and the comparison of more than a handful of variables becomes complicated even when the attributes are identifiable. To combat this and bring uniformity to the datasets, each dataset is represented as their first principal component obtained via centred principal component analysis (PCA) (Jolliffe 1986). This representation captures the most important characteristics of each dataset in a single variable meaning they can be compared directly.

Since the transformation by PCA is centred, all measures for central tendency are moot. In fact, the mean and median are not interpretable here given that the original data are categorical. As such, the univariate statistics used here describe the spread and shape of the principal components and are split into two groups:

  • Central moments: variance, skewness and kurtosis.

  • Empirical quantiles: interquartile range, lower decile and upper decile.

3.2.3 Results of analysis

Figures 10 and 11 show the distributions of the six univariate statistics listed above, across all of the principal components in each case. In addition to this, they show a fitted Gaussian kernel density estimate (Bashtannyk and Hyndman 2001) to accentuate the general shape of the histograms. What becomes immediately clear from each of these plots is that for datasets where Cao’s method succeeds, the general spread of their first principal component is much tighter than in the case where the proposed initialisation method succeeds. This is particularly evident in Fig. 10a where relatively low variance in the first case indicates a higher level of density in the original categorical data.

The patterns in the quantiles further this. Although Fig. 11a suggests that the components of Cao-preferable datasets can have higher interquartile ranges than in the second case, the lower and upper deciles tend to be closer together as is seen in Fig. 11b, c. This suggests that despite the body of the component being spread, its extremities are not.

In Fig. 10b, c, the most notable contrast between the two cases is the range in values for both skewness and kurtosis. This supports the evidence thus far that individual datasets have higher densities and lower variety (i.e. tighter extremities) when Cao’s method succeeds over the proposed initialisation. In particular, larger values of skewness and kurtosis translate to high similarity between the instances in a categorical dataset which is equivalent to having high density.

Overall, this analysis has revealed that if a dataset shows clear evidence of high-density points, then Cao’s method should be used over the proposed method. However, if there is no such evidence, the proposed method is able to find a substantially better clustering than Cao’s method.

4 Conclusion

In this paper, a novel initialisation method for the \(k\)-modes algorithm was introduced that built on the method set out in the seminal paper (Huang 1998), where an initial clustering solution was found using HR.

Following a thorough description of the \(k\)-modes algorithm and the established initialisation methods, a comparative analysis was conducted amongst the three initialisations using both benchmark and artificial datasets. This analysis revealed that our novel initialisation was able to outperform both of the other methods when the choice of \(k\) was optimised according to a mathematically rigorous elbow method. However, the proposed method was unable to beat Cao’s method (established in Cao et al. (2009)) when an external framework was imposed on each dataset by choosing \(k\) to be the number of classes present.

We believe that this work offers advantages in raising the question of how to generate good initial solutions for clustering algorithms, since there must be scope for thinking beyond random sampling or selecting initial modes according to their density in the data. An advantage of our approach is that it offers the potential for computational speed-up and in large, complex data settings, is likely to make more challenging clusterings computationally tractable. Our proposed method should be employed over Cao’s when there are no hard restrictions on what \(k\) may be, or if there is no immediate evidence that the dataset at hand has some notion of high density. Our future work will be to offer further empirical analysis to demonstrate this.