The KAMILA clustering algorithm is a scalable version of k-means well suited to handle mixed-type data sets. It overcomes the challenges inherent in the various extant methods for clustering mixed continuous and categorical data, i.e., either they require strong parametric assumptions (e.g., the normal–multinomial mixture model), they are unable to minimize the contribution of individual variables (e.g. Modha–Spangler weighting), or they require an arbitrary choice of weights determining the relative contribution of continuous and categorical variables (e.g., dummy/simplex coding and Gower’s distance).
The KAMILA algorithm combines the best features of two of the most popular clustering algorithms, the k-means algorithm (Forgy 1965; Lloyd 1982) and Gaussian-multinomial mixture models (Hunt and Jorgensen 2011), both of which have been adapted successfully to very large data sets (Chu et al. 2006; Wolfe et al. 2008). Like k-means, KAMILA does not make strong parametric assumptions about the continuous variables, and yet KAMILA avoids the limitations of k-means described in Sect. 2.2. Like Gaussian-multinomial mixture models, KAMILA can successfully balance the contribution of continuous and categorical variables without specifying weights, but KAMILA is based on an appropriate density estimator computed from the data, effectively relaxing the Gaussian assumption.
Notation and definitions
Here, we denote random variables with capital letters, and manifestations of random variables with lower case letters. We denote vectors with boldfont, and scalars in plaintext.
Let \(\mathbf {V}_1\), ..., \(\mathbf {V}_i\), ..., \(\mathbf {V}_N\) denote an independent and identically distributed (i.i.d.) sample of \(P \times 1\) continuous random vectors following a mixture distribution with arbitrary spherical clusters with density h such that \( \mathbf {V}_i = (V_{i1}, ..., V_{ip}, ..., V_{iP})^T\), with \( \mathbf {V}_i \sim f_{\mathbf {V}}(\mathbf {v}) = \sum _{g=1}^G \pi _g h(\mathbf {v}; \varvec{\mu }_g)\), where G is the number of clusters in the mixture, \(\varvec{\mu }_g\) is the \(P \times 1\) centroid of the gth cluster of \(\mathbf {V}_i\) and \(\pi _g\) is the prior probability of drawing an observation from the gth population. Let \(\mathbf {W}_1\), ..., \(\mathbf {W}_i\), ..., \(\mathbf {W}_N\) denote an i.i.d. sample of \(Q \times 1\) discrete random vectors, where each element is a mixture of multinomial random variables such that \( \mathbf {W}_i = (W_{i1}, ..., W_{iq}, ..., W_{iQ})^T\), with \(W_{iq} \in \{1, ..., \ell , ..., L_q \}\), and \(\mathbf {W}_i \sim f_{\mathbf {W}}(\mathbf {w}) = \sum _{g=1}^G \pi _g \prod _{q=1}^Q m(w_q ; \varvec{\theta }_{gq})\), where \(m(w; \varvec{\theta }) = \prod _{\ell =1}^{L_q} \theta _{\ell }^{I\{w=\ell \}}\) denotes the multinomial probability mass function, \(I\{\cdot \}\) denotes the indicator function, and \(\varvec{\theta }_{gq}\) denotes the \(L_q \times 1\) parameter vector of the multinomial mass function corresponding to the qth random variable drawn from the gth cluster. We assume \(W_{iq}\) and \(W_{iq'}\) are conditionally independent given population membership \(\forall \; q \ne q'\) (a common assumption in finite mixture models Hunt and Jorgensen 2011). Let \(\mathbf {X}_1\), ..., \(\mathbf {X}_i\), ..., \(\mathbf {X}_N\) denote an i.i.d. sample from \(\underset{(P+Q) \times 1}{\mathbf {X}_i} = (\mathbf {V}_i^T, \mathbf {W}_i^T)^T\) with \(\mathbf {V}_i\) conditionally independent of \(\mathbf {W}_i\), given population membership.
In the general case, where categorical variables are not independent, we model them by supplanting them with a new categorical variable with a categorical level for every combination of levels in the dependent variables. For example, if \(W_{i1}\) and \(W_{i2}\) are not conditionally independent and have \(L_1\) and \(L_2\) categorical levels respectively, then they would be replaced by the variable \(W_i^*\) with \(L_1 \times L_2\) levels, one for each combination of levels in the original variables. If categorical and continuous variables are not conditionally independent, then the location model (Krzanowski 1993; Olkin and Tate 1961) can be used, although see the discussion of the location model in Sect. 2.1. KAMILA can be modified to accommodate elliptical clusters; we discuss at the end of Sect. 3.3 below methods for extending KAMILA in this way, and illustrate one such implementation in simulation C. The decision to use KAMILA to identify spherical or elliptical clusters must be specified before the algorithm is run. As in other mixture modeling problems, this decision must be made based on a priori knowledge of the data and clustering goals, or through comparing the performance of the different models using, for example, measures of internal cluster validity. We avoid endorsing any particular measure of cluster validity as their appropriateness is entirely dependent on the particular problem at hand.
At iteration t of the algorithm, let \(\hat{\varvec{\mu }}_g^{(t)}\) denote the estimator for the centroid of population g, and let \(\hat{\varvec{\theta }}_{gq}^{(t)}\) denote the estimator for the parameters of the multinomial distribution corresponding to the qth discrete random variable drawn from population g.
Kernel density estimation
We seek a computationally efficient way to evaluate joint densities of multivariate spherical distributions. We proceed using kernel density (KD) estimates. However, for multivariate data, KD estimates suffer from the problems of unreasonable computation times for high-dimensional data and overfitting the observed sample, yielding density estimates for observed points that are too high and density estimates for points not used in the KD fitting that are too low (Scott 1992, Chapter 7).
The proposed solution is first derived for spherically distributed clusters, and later extended to elliptical clusters. Using special properties of these distributions, we can obtain KD estimates that are more accurate and faster to calculate than the standard multivariate approaches.
Note that we are not referring to data scattered across the surface of a sphere (e.g. Hall et al. 1987): we are interested in data with densities that are radially symmetric about a mean vector (Kelker 1970); that is, densities that only depend on the distance from the sample to the center of the distribution.
KAMILA depends upon a univariate KD estimation step for the continuous clusters. The densities of the continuous clusters are estimated using the transformation method, a general framework for estimating densities of variables that have been transformed by some known function (Bowman and Azzalini 1997, pp. 14–15). Briefly, for KD estimation of a random variable X, this method involves constructing the desired KD estimate as \(\hat{f}(x) = \hat{g}(t(x)) t'(x)\), where t is some differentiable function (e.g. log(x) or \(\sqrt{x}\); in this case we use a continuous distance measure), g denotes the PDF of t(X) with KD estimate \(\hat{g}\), and \(t'(x)\) denotes the derivative of t with respect to x.
We now make use of the following proposition.
Proposition 2
Let \(\mathbf {V} = (V_1, V_2, \ldots , V_p)^T\) be a random vector that follows a spherically symmetric distribution with center at the origin. Then
$$\begin{aligned} f_{\mathbf {V}}(\mathbf {v}) = \frac{f_R(r) \, {\varGamma }(\frac{p}{2} + 1)}{pr^{p-1} \pi ^{p/2}}, \end{aligned}$$
where \(r = \sqrt{\mathbf {v}^T\mathbf {v}}\), \(r \in [0, \infty ).\)
Proof
See Appendix.
Under spherical cluster densities, we set the function t to be the Euclidean distance, and we obtain a density estimation technique for \(\hat{f}_{\mathbf {V}}\) by replacing \(f_R\) with the univariate KD estimate \(\hat{f}_R\), thus avoiding a potentially difficult multidimensional KD estimation problem.
Algorithm description
Pseudocode for the KAMILA procedure is provided in Algorithm 1. First, each \(\hat{\mu }_{gp}^{(0)}\) is initialized with a random draw from a uniform distribution with bounds equal to the minimum and maximum of the \(p^{th}\) continuous variable. Each \(\hat{\varvec{\theta }}_{gq}^{(0)}\) is initialized with a draw from a Dirichlet distribution (Kotz et al. 2004) with shape parameters all equal to one, i.e., a uniform draw from the simplex in \(\mathbb {R}^{L_q}\).
The algorithm is initialized multiple times. For each initialization, the algorithm runs iteratively until a pre-specified maximum number of iterations is reached or until population membership is unchanged from the previous iteration, whichever occurs first. See the online resource, Section 3, for a discussion on selecting the number of initializations and the maximum number of iterations. Each iteration consists of two broad steps: a partition step and an estimation step.
Given a complete set of \(\hat{\mu }_{gp}^{(t)}\) and \(\hat{\varvec{\theta }}_{gq}^{(t)}\)’s at the \(t^{th}\) iteration, the partition step assigns each of N observations to one of G groups. First, the Euclidean distance from observation i to each of the \(\hat{\varvec{\mu }}_g^{(t)}\)’s is calculated as
$$\begin{aligned} d_{ig}^{(t)} = \sqrt{\sum _{p=1}^P [\xi _p(v_{ip} - \hat{\mu }_{gp}^{(t)})]^2}, \end{aligned}$$
where \(\xi _p\) is an optional weight corresponding to variable p. Next, the minimum distance is calculated for the ith observation as \(r_i^{(t)} = \underset{g}{\text {min}}(d_{ig}^{(t)})\). The KD of the minimum distances is constructed as
$$\begin{aligned} \hat{f}_R^{(t)}(r) = \frac{1}{Nh^{(t)}} \sum _{\ell =1}^N k \left( \frac{r - r_{\ell }^{(t)}}{h^{(t)}} \right) , \end{aligned}$$
(2)
where \(k(\cdot )\) is a kernel function and \(h^{(t)}\) the corresponding bandwidth parameter at iteration t. The Gaussian kernel is currently used, with bandwidth \(h = 0.9An^{-1/5}\), where \(A=\text {min}(\hat{\sigma }, \hat{q}/1.34)\), \(\hat{\sigma }\) is the sample standard deviation, and \(\hat{q}\) is the sample interquartile range (Silverman 1986, p. 48, equation 3.31). The function \(\hat{f}_R^{(t)}\) is used to construct \(\hat{f}_{\mathbf {V}}^{(t)}\) as shown in Sect. 3.2.
Assuming independence between the Q categorical variables within a given cluster g, we calculate the log probability of observing the ith categorical vector given population membership as \(\log (c_{ig}^{(t)}) = \sum _{q=1}^Q \xi _q \cdot \log ( \text {m}(w_{iq}; \; \hat{\varvec{\theta }}_{gq}^{(t)}))\), where m\((\cdot ; \cdot )\) is the multinomial probability mass function as given above, and \(\xi _q\) is an optional weight corresponding to variable q.
Although it is possible to run KAMILA with weights for each variable as described above, these weights are not intended to be used to balance the contribution of continuous and categorical variables; setting all weights equal to 1 will accomplish this. Rather, the weights are intended to allow compatibility with other weighting strategies.
Object assignment is made based on the quantity
$$\begin{aligned} H_i^{(t)}(g) = \log \left[ \hat{f}_{\mathbf {V}}^{(t)}(d_{ig}^{(t)}) \right] + \log \left[ c_{ig}^{(t)} \right] , \end{aligned}$$
(3)
with observation i being assigned to the population g that maximizes \(H_i^{(t)}(g)\). Note that \(\hat{f}_{\mathbf {V}}^{(t)}\) is constructed using just the minimum distances as described above, but it is then evaluated at \(d_{ig}^{(t)}\) for all g, not just the minimum distances.
Given a partition of the N observations at iteration (t), the estimation step calculates \(\hat{\mu }_{gp}^{(t+1)}\) and \(\hat{\varvec{\theta }}_{gq}^{(t+1)}\) for all g, p, and q. Let \({\varOmega }_g^{(t)}\) denote the set of indices of observations assigned to population g at iteration t. We calculate the parameter estimates
$$\begin{aligned} \hat{\varvec{\mu }}_g^{(t+1)}= & {} \frac{1}{ \left| {\varOmega }_g^{(t)} \right| } \sum _{i \in {\varOmega }_g^{(t)}} \mathbf {v}_i\\ \hat{\theta }_{gq\ell }^{(t+1)}= & {} \frac{1}{ \left| {\varOmega }_g^{(t)} \right| } \sum _{i \in {\varOmega }_g^{(t)}} I\{w_{iq} = \ell \} \end{aligned}$$
where \(I\{ \cdot \}\) denotes the indicator function and \(|A|=\text {card}(A)\).
The partition and estimation steps are repeated until a stable solution is reached (or the maximum number of iterations is reached) as described above. For each initialization, we calculate the objective function
$$\begin{aligned} \sum _{i=1}^N \underset{g}{\text {max}} \{ H_i^{(final)}(g) \}. \end{aligned}$$
(4)
The partition that maximizes equation 4 over all initializations is output.
The number of clusters may be obtained using the prediction strength algorithm (Tibshirani and Walther 2005), as illustrated in Sect. 6. We choose the prediction strength algorithm due to the flexibility with which it can be adapted to many clustering algorithms, as well as the logical and interpretable rationale for the solutions obtained. Given an existing clustering of a data set, the prediction strength requires a rule that allocates new points into clusters, where the new points might not have been used to construct the original clusters. Tibshirani and Walther (2005) further discuss a strategy for adapting the prediction strength algorithm to hierarchical clustering techniques. The gap statistic (Tibshirani et al. 2001) might be used, although it would need to be adapted to mixed-type data. Information-based methods such as BIC (Schwarz 1978) might be applicable, although whether the KAMILA objective function behaves as a true log-likelihood should be carefully investigated, particularly with regard to asymptotics. Many internal measures of cluster validity, such as silhouette width (Kaufman and Rousseeuw 1990), require a distance function defined between points. In this case, the distance function as given in equation 3 is only defined between a cluster and a point; standard internal measures of cluster validity are thus not immediately applicable to KAMILA without further study. Popular internal measures for k-means such as pseudo-F (Calinski and Harabasz 1974) and pseudo-T (Duda and Hart 1973) are also not readily applicable since within- and between-cluster sums of squares do not have any obvious analogue in the current approach.
In certain special cases, if the distribution of \(\mathbf {V}\) is specified, the distribution of R is known [e.g. normal, t, Kotz distributions, and others (Fang et al. 1989)]. However, in our case we allow for arbitrary spherical distributions with distinct centers, which we estimate using the KDE in (2) and proposition 2. An investigation of the convergence of \(\hat{f}_R(r)\) to the true density \(f_R(r)\) requires an examination of the mean squared error and mean integrated squared error (Scott 1992), which is beyond the scope of the current paper.
KAMILA can be generalized to elliptical clusters as follows. For a given data set with \(N \times P\) continuous data matrix V, we partition the total sum of squares and cross-products matrix of the continuous variables as described in Art et al. (1982), Gnanadesikan et al. (1993), yielding an estimate of the covariance matrix of the individual clusters, \(\hat{{\varSigma }}\). The continuous variables can then be rescaled so that individual clusters have approximately identity covariance matrix by taking \(V^* = V \hat{{\varSigma }}^{-1/2}\). The KAMILA clustering algorithm can then be used on the transformed data set. In simulation C we show that this strategy gives reasonable results on mixtures with elliptical clusters in the continuous variables.
Identifiability considerations
A referee posed the question of identifiability of radially symmetric distributions. Identifiability in finite mixture models is important for inference purposes, and there is an extensive literature discussing this issue in parametric, semi-, and non-parametric contexts. See, for example, Titterington et al. (1985), Lindsay (1995), and McLachlan and Peel (2000). Recent developments in the semi-parametric context include (Hunter et al. 2007), who obtain identifiability for univariate samples by imposing a symmetry restriction on the individual components of the mixture. Further work in the univariate case includes (Cruz-Medina and Hettmansperger 2004), who assume that the component distributions are unimodal and continuous (Ellis 2002; Bordes et al. 2006).
Of relevance to our work is Holzmann et al. (2006). These authors prove identifiability of finite mixtures of elliptical distributions of the form
$$\begin{aligned} f_{\alpha ,p}(\mathbf {x}) = |{\varSigma }|^{-1/2} \, f_p \left[ (\mathbf {x} - \varvec{\mu })^T {\varSigma }^{-1} (\mathbf {x} - \varvec{\mu }); \varvec{\theta } \right] \end{aligned}$$
(5)
where \(\mathbf {x} \in \mathbb {R}^p\), \(\varvec{\alpha } = (\varvec{\theta }, \varvec{\mu }, {\varSigma }) \in \mathcal {A}^p \subset \mathbb {R}^{k \times p \times p(p+1)/2}\), \(f_p: \; [0,\infty ) \rightarrow [0,\infty )\) is a density generator, that is, a nonnegative function such that \(\int f(\mathbf {x}^T \mathbf {x}; \varvec{\theta }) \mathrm {d} \mathbf {x} = 1\). To prove identifiability of mixtures of spherically and elliptically distributed clusters we use theorem 2 of Holzmann et al. (2006).
Our proposition 2 expresses the density of a spherically symmetric random vector \(\mathbf {V}\) as a function of a general density \(f_R(r)\), \(r=\sqrt{\mathbf {v}^T\mathbf {v}}\). A univariate kernel density estimator of \(f_R\) is given in Eq. (2). We first need the following conditions on the kernel \(k(\cdot )\):
-
A1. The kernel \(k(\cdot )\) is a positive function such that \(\int k(u)du = 1\), \(\int u\,k(u)du = 0\), and \(\int u^2 \, k(u)du > 0\).
-
A2. The kernel function \(k(\cdot )\) is a continuous, monotone decreasing function such that \(\underset{u \rightarrow \infty }{lim} k(u) = 0\).
-
A3. The kernel \(k(\cdot )\) is such that
$$\begin{aligned} \underset{z \rightarrow \infty }{lim} \frac{k(z \, \gamma _{2})}{k(z \, \gamma _{1})} = 0, \end{aligned}$$
where \(\gamma _{1}\), \(\gamma _{2}\) are constants such that \(\gamma _{2} > \gamma _{1}\).
-
A4. The number of clusters G is fixed and known, with different centers \(\varvec{\mu }_j, \; j = 1, 2, ..., G\).
-
A5. The density functions of the different clusters come from the same family of distributions and differ in terms of their location parameters, i.e. they are \(f(\mathbf {v} - \varvec{\mu }_j), \; j = 1, 2, ..., G\).
Assumptions A1 and A2 are standard in density estimation (Scott 1992). Assumption A3 is satisfied by the normal kernel that is widely used in density estimation. Furthermore, this condition is satisfied by kernels of the form \(k(z) = A_2 \, \text {exp}[-A_1 \, |z|^s]\), where \(-\infty< z < \infty \), and \(A_1\), \(A_2\), s are positive constants linked by the requirement that k(z) integrates to unity. This class of kernels is an extension of the normal kernel and examples include the standard normal and double exponential kernels. Generally, for identifiability to hold, it is sufficient that the kernel has tails that decay exponentially. Assumption A4 is similar to declaring the number of clusters in advance, as for example, in the k-means algorithm. Although the problem of identification of the number of clusters is interesting in its own right, the identifiability proof is based on having the number of clusters fixed in advance.
Proposition 3
Under assumptions A1–A5 the density generator resulting from Proposition 2 with density estimator given in (2) satisfies condition (5) of Theorem 2 of Holzmann et al. (2006).
Proof
See Appendix.
Remark
Theorem 2 of Holzmann et al., and hence Proposition 3 above, establish identifiability in the case of elliptical densities. The identifiability of spherical clusters is a special case which follows by setting \({\varSigma }\) to the identity matrix.