Advertisement

Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

A Probabilistic Distance Clustering Algorithm Using Gaussian and Student-t Multivariate Density Distributions

Abstract

A new dissimilarity measure for cluster analysis is presented and used in the context of probabilistic distance (PD) clustering. The basic assumption of PD-clustering is that for each unit, the product between the probability of the unit belonging to a cluster and the distance between the unit and the cluster is constant. This constant is a measure of the classifiability of the point, and the sum of the constant over units is called joint distance function (JDF). The parameters that minimize the JDF maximize the classifiability of the units. The new dissimilarity measure is based on the use of symmetric density functions and allows the method to find clusters characterized by different variances and correlation among variables. The multivariate Gaussian and the multivariate Student-t distributions have been used, outperforming classical PD clustering, and its variation PD clustering adjusted for cluster size, on simulated and real datasets.

Introduction

Cluster analysis refers to a wide range of numerical methods aiming to find distinct groups of homogeneous units. Clustering in two or three dimensions is a natural task that humans can often do visually; however, machine approaches are needed for all but such low dimensions. We focus on partitioning clustering methods; given a number of clusters K, partitioning methods assign units to the K clusters optimizing a given criterion. These methods are generally divided into not model-based and model-based, according to the distributional assumptions. Model-based clustering or finite mixture model clustering assumes that the population probability density function is a convex linear combination of a finite number of density functions; accordingly, they are very well suited to clustering problems. A variety of methods and algorithms have been proposed for finite mixture model parameter estimation. The most widely used strategy is to find the parameters that maximize the complete-data likelihood function using the expectation-maximization (EM) algorithm, which was proposed in 1977 [10] building on prior work (e.g., [5, 7, 23, 24]). The non-model-based methods generally optimize a criterion based on distance or dissimilarity measures. Different dissimilarity measures can be used based on the type of data, in this paper we focus on continuous data.

Formally, let us consider an \(n \times J\) data matrix \(\mathbf {X}\), with generic row vector \(\varvec{x}_i = (x_{i1}, \ldots ,x_{iJ})\). Partitioning algorithms aim to find a set of K clusters, \({{\mathscr {C}}}_k\), with \(k=1, \ldots , K\), such that the elements inside a cluster are homogeneous and \({{\mathscr {C}}}_1 \cup {{\mathscr {C}}}_2 \cup \cdots \cup \mathscr{C}_K=X\). If, for any pair \(\{k,k^{\prime }\} \in 1,\ldots , K\), \(\mathscr{C}_k \cap {{\mathscr {C}}}_{k'}=\emptyset\), then the clustering technique is called hard or crisp, otherwise it is called fuzzy or soft. In the latter case, each unit can belong to more than one cluster with certain membership degree.

The most frequently used non-model-based methods for continuous data are k-means [20] and its fuzzy analogue c-means [4], which minimize the sum of the within groups sum of squares over all variables. In spite of their simplicity, the optimal solution can only be found applying an iterative intuitively reasonable procedure. More recently, [3] proposed probabilistic distance (PD) clustering, a distribution free fuzzy clustering technique (i.e., non-model-based), where the membership degree is defined as heuristic probability. PD clustering optimization problems represents a special case of the Weber–Fermat’s problem, when the number of the ‘attraction points’ is greater or equal to three, see [16] among others. In this framework, PD clustering assumes that the product of the probability of a point belonging to a cluster and the distance of the point from the center of the cluster is constant, and this constant is a measure of the classificability of the point. The method obtains the centers that maximize the classificability of all the points. A newer version of the algorithm that considers clusters of different size, PDQ-clustering, was proposed by [14] and an extension for high-dimensional data was proposed by [35, 36].

Generally, non-model-based clustering techniques are only based on the distances between the points and the centers; therefore, they do not take into account the shape and the size of the clusters. Accordingly, these techniques may fail when clusters are either non-spherical or spherical with different radii. To overcome this issue we propose a new dissimilarity measure based on symmetric density functions that have the advantage of considering the variability and the correlation among the variables. We use two different density functions, the multivariate Gaussian and the multivariate Student-t, but it could be extended to other symmetric densities. We then integrate this measure with PD-clustering and obtain new more flexible clustering techniques. Preliminary results can be found in [29].

After a background section on PD-clustering and PDQ-clustering, Sect. "Background", we introduce the new dissimilarity measure and the new techniques, Sect. "Flexible Extensions of PD-Clustering". We then compare them with some model-based and distance-based algorithms on simulated and real datasets, Sect. "Empirical Evidence from Simulated and Real Data".

Background

In this section we briefly introduce PD-clustering [3], a distance-based soft clustering algorithm, and its extension, PD-clustering adjusted for cluster size [14].

Probabilistic Distance Clustering

Ben-Israel and Iyigun [3] proposed a non-hierarchical distance-based clustering method, called probabilistic distance (PD) clustering. They then extended the method to account for clusters of different size, i.e., PDQ [14]. Tortora et al. [35] proposed a factor version of the method to deal with high-dimensional data. Recently, [29] further extended the method to include more flexibility.

In PD-clustering, the number of clusters K is assumed to be a priori known, and a wide review on how to choose K can be found in [8]. Given some random centers, the probability of any point belonging to a cluster is assumed to be inversely proportional to the distance from the center of that cluster [13]. Suppose we have a data matrix \(\mathbf {X}\) with N units and J variables, and consider K (non-empty) clusters. PD-clustering is based on two quantities: the distance of each data point \(\varvec{x}_i\) from each cluster centre \(\mathbf {c}_k\), denoted \(d(\varvec{x}_i,\mathbf {c}_k)\), and the probability of each point belonging to a cluster, i.e., \(p(\varvec{x}_i,\mathbf {c}_k)\), for \(k=1,\ldots , K\) and \(i=1,\ldots , N\).

For convenience, define \(p_{ik}{:}{=}p(\varvec{x}_i,\mathbf {c}_k)\) and \(d_{ik}{:}{=}d(\varvec{x}_i,\mathbf {c}_k)\). PD-clustering is based on the principle that the product of the distances and the probabilities is a constant depending only on \(\varvec{x}_i\) [3]. Denoting this constant as \(F(\varvec{x}_i)\), we can write this principle as

$$\begin{aligned} { p_{ik}d_{ik}=F(\varvec{x}_i), } \end{aligned}$$
(1)

where \(F(\varvec{x}_i)\) depends only on \(\varvec{x}_i\), i.e., \(F(\varvec{x}_i)\) does not depend on the cluster k. As the distance from the cluster centre decreases, the probability of the point belonging to the cluster increases. The quantity \(F(\varvec{x}_i)\) is a measure of the closeness of \(\varvec{x}_i\) to the cluster centres, and it determines the classificability of the point \(\varvec{x}_i\) with respect to the centres \(\mathbf {c}_k\), for \(k=1, \ldots , K\). The smaller the \(F(\varvec{x}_i)\) value, the higher the probability of the point belonging to one cluster. If all of the distances between the point \(\varvec{x}_i\) and the centers of the clusters are equal to \(d_i\), then \(F(\varvec{x}_i)=d_i/K\) and all of the probabilities of belonging to each cluster are equal, i.e., \(p_{ik}=1/K\). The sum of \(F(\varvec{x}_i)\) over i is called joint distance function (JDF). Starting from (1), it is possible to compute \(p_{ik}\), i.e.,

$$\begin{aligned} p_{ik}=\frac{\prod _{m\ne k}d_{im}}{\sum _{m=1}^K\prod _{r\ne m}d_{ir}}, \end{aligned}$$
(2)

for \(k=1,\ldots K\), and \(i=1, \ldots , N\). The whole clustering problem consists in the identification of the centers that minimize the JDF:

$$\begin{aligned} \text {JDF} = \sum _{i=1}^n \sum _{k=1}^K d_{ik} p_{ik}. \end{aligned}$$
(3)

Extensive details on PD clustering are given in [3], who suggest using \(p^2\) in (3) because it is a smoothed version of the problem. It follows that the optimized functions become

$$\begin{aligned} \text {JDF} = \sum _{i=1}^n \sum _{k=1}^K d_{ik} p_{ik}^2, \end{aligned}$$
(4)

and the centers can be computed as

$$\begin{aligned} \mathbf {c}_{k}=\frac{\sum _{i=1}^N u_{ik} \varvec{x}_i}{\sum _{j=1}^N u_{jk}}, \end{aligned}$$
(5)

with \(u_{ik}={p_{ik}^2}/{d_{ik}}\).

It is worth noting that the function \(p_{ik}\) respects all necessary conditions to be a probability and yet no assumptions are made on the distribution of this function; further, \(p_{ik}\) can only be computed given \(\varvec{x}_i\) and for every \(\mathbf {c}_k\) [28]. Following [13], we refer to \(p_{ik}\) as subjective probabilities, which are based on degree of belief (see [2]).

PD Clustering Adjusted for Cluster Size

The probabilities obtained using 1 do not consider the cluster size, and the algorithm tends to fail when clusters are unbalanced. Moreover, the resulting clusters have similar variance and covariance matrices. To overcome these issues [14] proposed PD-clustering adjusted for cluster size (PDQ). They assume that

$$\begin{aligned} { \frac{p_{ik}^2d_{ik}}{q_k}=F(\varvec{x}_i), } \end{aligned}$$
(6)

where \(q_k\) is the cluster size, with the constraint that \(\sum _{k=1}^Kq_k=N\). The \(p_{ik}\) can then be computed via

$$\begin{aligned} p_{ik}=\frac{\prod _{m\ne k}d_{im}/q_{m}}{\sum _{m=1}^K\prod _{r\ne m}d_{ir}/q_{r}}. \end{aligned}$$
(7)

The cluster size is considered a variable, the value of \(q_k\) that minimizes (6) is

$$\begin{aligned} q_{k}=N\frac{\left( \sum _{i=1}^Nd_{ik}p_{ik}^2\right) ^{1/2}}{\sum _{k=1}^K\left( \sum _{i=1}^Nd_{ik}p_{ik}^2\right) ^{1/2}}, \end{aligned}$$
(8)

for \(k= 1,\ldots , K-1\), and

$$\begin{aligned} q_K=N-\sum _{k=1}^{K-1}q_k. \end{aligned}$$

Flexible Extensions of PD-Clustering

Gaussian PD-Clustering

The PDQ algorithm can detect clusters of different sizes and with different within-cluster variability; however, it can still fail at detecting the clustering partition when variables are correlated or when there are outliers in the data. To overcome these issues we proposed a new dissimilarity measure based on a density function. Let \(M_k= \max \{f(\varvec{x}_i;{\varvec{\mu }}_k,{\varvec{\theta }}_k)\}\) and define the quantity

$$\begin{aligned} \delta _{ik}=\log \left( M_kf(\varvec{x}_i;{\varvec{\mu }}_k,{\varvec{\theta }}_k)^{-1}\right) , \end{aligned}$$
(9)

which is a dissimilarity measure where \(f(\varvec{x}_i;{\varvec{\mu }}_k,{\varvec{\theta }}_k)\) is a symmetric unimodal density function with location parameter \({\varvec{\mu }}_k\) and parameter vector \({\varvec{\theta }}_k\). See appendix for the proof.

Recall that the density of a multivariate Gaussian distribution is

$$\begin{aligned} \phi (\varvec{x}_i; \varvec{\mu }, \varvec{\varSigma })=\frac{1}{(2\pi ) ^{\frac{J}{2}}} |\varvec{\varSigma }|^{-\frac{1}{2}}\exp \left\{ -\frac{1}{2}(\varvec{x}-\varvec{\mu })' \varvec{\varSigma }^{-1}(\varvec{x}-\varvec{\mu })\right\} , \end{aligned}$$
(10)

and define \(\phi _{ik}{:}{=}\phi (\varvec{x}_i; \varvec{\mu }_k, \varvec{\varSigma }_k)\), where \(k=1,\ldots ,K\). Using (10) in (9) and the result in (6), the JDF becomes

$$\begin{aligned} \begin{aligned} \text {JDF}&=\sum _{i=1}^n \sum _{k=1}^K\frac{p_{ik}^2}{q_k}\log (M_k)+\sum _{i=1}^n \sum _{k=1}^K \frac{1}{2}\frac{p_{ik}^2}{q_k} \mathrm{log}\left( (2\pi ) ^{J} |\varvec{\varSigma }_k| \right) \\&\quad + \sum _{i=1}^n \sum _{k=1}^K\frac{1}{2}\frac{p_{ik}^2}{q_k}(\varvec{x}_i-\varvec{\mu }_k)'\varvec{\varSigma }_k^{-1}(\varvec{x}_i-\varvec{\mu }_k). \end{aligned} \end{aligned}$$
(11)

This technique is called Gaussian PD-clustering, GPDC, and it offers many advantages when compared to PD-clustering. The new dissimilarity measure already takes into account the impact of different within cluster variances and the correlation among variables.

The clustering problem now consists in the estimation of \(\varvec{\mu }_k\) and \(\varvec{\varSigma }_k\), with \(k=1,\ldots , K\) , that minimize (11). A differentiation procedure leads to these estimates. An iterative algorithm is then used to compute the belonging probabilities and update the parameter estimates. More specifically, differentiating (11) with respect to \(\varvec{\mu }_k\) gives

$$\begin{aligned} \frac{\partial \text {JDF}}{\partial \varvec{\mu }_k}=-\frac{1}{2}\sum _{i=1}^n \frac{p_{ik}^2}{q_k}\ \varvec{\varSigma }_k^{-1}(\varvec{x}_i- \varvec{\mu }_k). \end{aligned}$$
(12)

Setting (12) equal to zero and solving for \(\varvec{\mu }_k\) gives

$$\begin{aligned} \varvec{\mu }_k= \frac{\sum _{i=1}^n p_{ik}^2 \varvec{x}_i}{\sum _{i=1}^n p_{ik}^2} \end{aligned}$$
(13)

Now, differentiating (11) with respect to \(\varvec{\varSigma }_k\) gives

$$\begin{aligned} \begin{aligned} \frac{\partial \text {JDF}}{\partial \varvec{\varSigma }_k}&=\sum _{i=1}^n \frac{1}{2}\frac{p_{ik}^2}{q_k}\varvec{\varSigma }_k^{-1}-\varvec{\varSigma }_k^{-1}\sum _{i=1}^n \frac{1}{2}(\varvec{x}_i- \varvec{\mu }_k)(\varvec{x}_i- \varvec{\mu }_k)'\frac{p_{ik}^2}{q_k}\varvec{\varSigma }_k^{-1}\\&=\frac{1}{2}\varvec{\varSigma }_k^{-1} \left[ \sum _{i=1}^n \frac{p_{ik}^2}{q_k}- \sum _{i=1}^n (\varvec{x}_i- \varvec{\mu }_k)(\varvec{x}_i- \varvec{\mu }_k)'p_{ik}^2\varvec{\varSigma }_k^{-1}\right] . \end{aligned} \end{aligned}$$
(14)

Setting (14) equal to zero and solving for \(\varvec{\varSigma }_k\) gives

$$\begin{aligned} \varvec{\varSigma }_k= \frac{\sum _{i=1}^n (\varvec{x}_i- \varvec{\mu }_k)(\varvec{x}_i- \varvec{\mu }_k)'p_{ik}^2}{\sum _{i=1}^n p_{ik}^2}. \end{aligned}$$
(15)

It follows that, at generic iteration \((t+1)\), the parameters that minimize the (11) are:

$$\begin{aligned} \varvec{\mu }_k^{(t+1)}= \frac{\sum _{i=1}^n p_{ik}^2\varvec{x}_i}{\sum _{i=1}^n p_{ik}^2}, \end{aligned}$$
(16)
$$\begin{aligned} { \varvec{\varSigma }_k^{(t+1)}= \frac{\sum _{i=1}^n (\varvec{x}_i- \varvec{\mu }_k^{(t+1)})(\varvec{x}_i- \varvec{\mu }_k^{(t+1)})'p_{ik}^2 }{ \sum _{i=1}^n p_{ik}^2 }. } \end{aligned}$$
(17)

Our iterative procedure for Gaussian mixture model-based clustering parameter estimation can be summarized as follows:

figurea

Generalization to a Multivariate Student-t Distribution

The same procedure can be generalized to any symmetric distribution. In this subsection we use the multivariate Student-t distribution, generating an algorithm identified as Student-t PD-Clustering (TPDC). TPDC can detect clusters characterized by heavy tails; furthermore, the Student-t distribution has been often used on datasets characterized by outliers [17]. Now, replace (10) with a multivariate Student-t distribution, i.e.,

$$\begin{aligned} f(\varvec{x},\varvec{\mu },\varvec{\varSigma },v)= \frac{\varGamma \left( \frac{v+J}{2} \right) |\varvec{\varSigma }|^{-\frac{1}{2}}}{(\pi v)^{\frac{1}{2}J}\varGamma \left( \frac{v}{2} \right) \left\{ 1+\frac{\delta \left( \varvec{x},\varvec{\mu }, \varvec{\varSigma }\right) }{v}\right\} ^{\frac{1}{2}(v+J)}}, \end{aligned}$$
(18)

where \(\delta \left( \varvec{x},\varvec{\mu }, \varvec{\varSigma }\right) =(\varvec{x}-\varvec{\mu })'\varvec{\varSigma }^{-1}(\varvec{x}-\varvec{\mu })\), and proceed as in Sect. 3.1 Then, the JDF becomes:

$$\begin{aligned} \begin{aligned} \text {JDF}&= \sum _{i=1}^n \sum _{k=1}^K\frac{p_{ik}^2}{q_k}\log (M_k)\\&\quad + \sum _{i=1}^n \sum _{k=1}^K\frac{p_{ik}^2}{q_k}\left[ -\log \left\{ \varGamma \left( \frac{v_k+J}{2}\right) |\varvec{\varSigma }_k|^{-\frac{1}{2}} \right\} \right] \\&\quad + \sum _{i=1}^n \sum _{k=1}^K\frac{p_{ik}^2}{q_k}\log \left\{ \!\!(\pi v_k)^{\frac{J}{2}}\varGamma \left( \frac{v_k}{2} \right) \left( \!1\!+\frac{\delta \left( \varvec{x}_i,\varvec{\mu }_k, \varvec{\varSigma }_k\right) }{v_k}\right) ^{\frac{v_k+J}{2}} \right\} . \end{aligned} \end{aligned}$$
(19)

The parameters that optimize (19) can be found by differentiating with respect to \(\varvec{\mu }_k\), \(\varvec{\varSigma }_k\), and \(v_k\), respectively. Specifically, at a generic iteration \((t+1)\), the parameters that minimize (19) are:

$$\begin{aligned} \varvec{\mu }_k^{(t+1)}= & {} \frac{\sum _{i=1}^n w_{ik}\varvec{x}_i}{\sum _{i=1}^nw_{ik}}, \end{aligned}$$
(20)

with \(w_{ik}={p_{ik}^2}/[{v_k^{(t)}+ \delta (\varvec{x}_i,\varvec{\mu }_k^{(t)}, \varvec{\varSigma }_k^{(t)})}]\),

$$\begin{aligned} \varvec{\varSigma }_k^{(t+1)}= & {} \frac{\sum _{i=1}^n p_{ik}^2(\varvec{x}_i-\varvec{\mu }_k^{(t+1)})(\varvec{x}_i-\varvec{\mu }_k^{(t+1)})'s_{ik}}{\sum _{i=1}^n p_{ik}^2}, \end{aligned}$$
(21)

with \(s_{ik}={(v_k^{(t)}+J)}/[{v_k^{(t)}+\delta (\varvec{x}_i,\varvec{\mu }_k^{(t+1)}, \varvec{\varSigma }_k^{(t)})}]\), and the degrees of freedom update \(v_k^{(t+1)}\) is the solution to the following equation:

$$\begin{aligned}&\sum _{i=1}^n p_{ik}^2\left[ \psi \left( \frac{v_k}{2}\right) -\psi \left( \frac{v_k+J}{2}\right) + \frac{J}{2v_k}\right] \nonumber \\&\quad +\sum _{i=1}^n p_{ik}^2\left[ \frac{1}{2} \log \left( 1+ \frac{\delta \left( \varvec{x}_i,\varvec{\mu }_k^{(t+1)}, \varvec{\varSigma }_k^{(t+1)}\right) }{v_k^{(t)}}\right) \right] \nonumber \\&\quad -\frac{1}{2} \frac{v_k+J}{v_k} \sum _{i=1}^n p_{ik}^2 \frac{\delta \left( \varvec{x}_i,\varvec{\mu }_k^{(t+1)}, \varvec{\varSigma }_k^{(t+1)}\right) }{v_k^{(t)}+\delta \left( \varvec{x}_i,\varvec{\mu }_k^{(t+1)}, \varvec{\varSigma }_k^{(t+1)}\right) }=0, \end{aligned}$$
(22)

where

$$\begin{aligned} \psi \left( v\right) =\left( \frac{1}{\varGamma \left( v\right) }\right) \frac{\delta \varGamma \left( v\right) }{\delta v}. \end{aligned}$$

Our iterative algorithm can be summarized as follows:

figureb

Algorithm Details

All the proposed techniques require a random initialization. Random starts can lead to unstable solutions, to avoid this problem the algorithms use multiple starts. Moreover, the functions include the option to use PD-clustering or partition around medoids (PAM; [15]) to start. As for many other clustering techniques, the optimized function, the JDF in (4), is not convex—not even quasi-convex—and may have other stationary points. For a fixed value of \(\varvec{\varSigma }_k\), the JDF is a monotonically decreasing function, this guarantees that the function converges to a minimum, not necessarily a global minimum. The proposed techniques, GPDC and TPDC, introduce the estimate of \(\varvec{\varSigma }_k\), giving much more flexibility, albeit the JDF is no longer monotonically decreasing. Using (9) in (4), we obtain

$$\begin{aligned} \text {JDF}=\sum _{i=1}^n \sum _{k=1}^Kp_{ik}^2(\log M_k-\log \phi (\varvec{x}_i;\varvec{\mu }_k,\varvec{\varSigma }_k)) \end{aligned}$$

with \(M_k \ge \phi (\varvec{x}_i;\varvec{\mu }_k,\varvec{\varSigma }_k)\). Therefore, for every \(k=1,\ldots ,K\), the function is upper-bounded for non-degenerate density functions. The convergence of the algorithm cannot depend on the JDF but is based on \(\varvec{\mu }_k\). The time complexity of the algorithm is comparable to the EM algorithm, both algorithms require the inversion and the determinant of a \(J \times J\) matrix, therefore, the time complexity is of \(O(n^3JK)\), where n is the number of observations, J the number of variables, and K the number of clusters.

Empirical Evidence from Simulated and Real Data

The proposed algorithm has been evaluated on real and simulated datasets. The simulated datasets have been used to illustrate the ability of the algorithms to recover the parameters of the distributions and to compare the new techniques with some existing methods. In the following sessions we used the software R [26], the functions for both GPDC and TPDC are included in the R package FPDclustering [37].

Simulation Study

The same design was used twice, the first time each cluster was generated from a multivariate Gaussian distribution with three variables and \(K=3\) clusters. The second time, using a multivariate Student-t distribution with five degrees of freedom, same number of variables and clusters. We set the parameter using a four factor full factorial design. There are two factors per each level, where the levels are

  • Overlapping and not overlapping clusters

  • Different number of elements per clusters

  • Unitary variance and variance bigger than 1

  • Uncorrelated and correlated variables

Table 1 shows the parameters used in the simulation study.

Table 1 Model parameters used to generate the simulated datasets

The datasets have been generated using the R package mvtnorm [11]. Tables 5, 6, 7, 8, 9, 10, 11,  12 in Appendix B.2 show the true and the average estimated values of the parameters obtained from 50 runs of the GPDC and TPDC algorithms. For sake of space, comments are limited to groups of scenarios. The factors that affect the estimates the most are the change in variances and the amount of overlap. Specifically, when data are simulated using multivariate Gaussian distributions, in cases 5–8 and 13–16, the variances are not homogeneous and the GPDC tends to underestimate the bigger variances and overestimate the smaller ones. The TPDC is less affected by this issue, i.e., it underestimates some of the variances but the degrees of freedom recover; however, in the two extreme scenarios, 8 and 16, it cannot recover the cluster structures. Similar outcomes occur when data are simulated using a multivariate Student-t distribution; moreover, as expected on those datasets, the GPDC tends to overestimate the variances and TPDC tends to underestimate the variances and compensate with the degrees of freedom.

On the same datasets we used the functions gpcm, option VVV, of the R package mixture [6] for the Gaussian mixture models (GMM) and the function teigen, option UUUU, of the homonymous R package [1] for the mixtures of multivariate Student-t distributions (TMM). The k-means algorithm is part of the stats package [27], and the PDQclust function for PDQ clustering is part of the FPDclustering package [37].

Table 2 Average ARI and standard deviation on 50 datasets per scenario

To compare the clustering performance of the methods we used the adjusted Rand index (ARI) [12]. It compares predicted classifications with true classes. The ARI corrects the Rand index [30] for chance, its expected value under random classification is 0, and it takes a value of 1 when there is perfect class agreement. Steinley 2004  [31] gives guidelines for interpreting ARI values. Table 2 shows the average ARI and the standard deviation on 50 runs for each algorithm.

As pointed out in the previous sections, GPDC and TPDC are framed in a non-parametric view; however, to evaluate the performance we compare them with the GMM and TMM. The performance is not expected to be better than those techniques, although in most scenarios GPDC and TPDC perform as well as finite mixture models. As expected, k-means results are impacted by correlations and not homogeneous variances. PDQ cannot recover the correct clustering partition in case of overlapping and not homogeneous variance. It is not affected by changes in group size or correlation. The proposed techniques GPDC and TPDC outperform k-means and PDQ in most scenarios, they show weakness in the two most extreme situations, i.e., scenarios 8 and 16. Specifically, when clusters have different variances and the biggest variance is associated with the smallest cluster, they fail detecting the clustering partitions. Figures 5, 6, 7, 8, 9, 10, 11, 12 in Appendix B.2 show examples of simulated datasets for each scenario.

Real Data Analysis

We performed a real data analysis on three datasets that differ in size and number of clusters (details in Table 3). We performed variable selection prior to cluster analysis (details in Appendix B.1). The seed datasetFootnote 1 contains information about kernels belonging to three different varieties of wheat—Kama, Rosa and Canadian—with 70 observations per variety (see Fig. 1). We used the variables: compactness, length of kernel, width of kernel, and asymmetry coefficient. The hematopoietic steam cell transplant (HSCT) data were collected in the Terry Fox Lab at the British Columbia Cancer Agency. It contains information about 9780 cells, each stained with four fluorescent dyes. Experts identified four clusters; moreover, 78 cells were deemed “dead”, leaving a total of 9702 observation, we selected the three most informative variables. Figure 2 shows the partitions defined by the experts. The Australian Institute of sport datasetFootnote 2. contains data on 102 male and 100 female athletes for the Australian institute of sports. We selected the variables: height in cm, hematocrit, plasma ferritin concentration, and percent body fat, see Fig. 3.

Table 3 Number of units, variables, and clusters for the three real datasets
Table 4 Adjusted Rand index for the real datasets

Table 4 shows the ARI on the three datasets. On the seed dataset, GPDC and TPDC perform better than PDQ and k-means. The improvement from PDQ is noticeable, PDQ gives an ARI of 0.17, while GPDC gives an ARI of 0.41. On this dataset, TMM gives the best performance. On the HSCT dataset, GPDC, TPDC, PDQ, and TMM have a very high ARI. On the AIS dataset, GPDC and TPDC give the best performance.

Fig. 1
figure1

Seed dataset, each color and symbol representing a different variety of wheat

Fig. 2
figure2

HSCT dataset, each color and symbol representing a partition defined by the experts

Fig. 3
figure3

AIS dataset, each color and symbol representing male and female athletes

Conclusion

A new distance measure based on density functions is introduced and used in the context of probabilistic distance clustering adjusted for cluster size (PDQ). PDQ assumes that, for a generic unit, the product between the probability of belonging to a cluster and its distance from the cluster is constant. The minimization of the sum of these constants over the units leads to clusters that maximize the classifiability of the data. We introduce two algorithms based on PDQ that use distance measures based on the multivariate Gaussian distribution and on the multivariate Student-t distribution. Using simulated and real datasets we show how the new algorithms over-perform PDQ and the well known k-means algorithm.

The algorithm could be extended using different distributions. Further to this point, we mentioned outliers as a possible motivation for the PDQ approach with the multivariate Student-t distribution (Sect. 3). However, if the objective is dealing with outliers, it will be better to consider the PDQ approach with the multivariate contaminated normal distribution [25] and this will be a topic of future work. Other approaches for handling cluster concentration will also be considered (e.g., [9]) as will methods that accommodate asymmetric, or skewed, clusters (e.g., [18, 19, 21, 22, 32, 34]).

Notes

  1. 1.

    http://archive.ics.uci.edu/ml/.

  2. 2.

    GLMsData R package.

References

  1. 1.

    Andrews JL, Wickins JR, Boers NM, McNicholas PDT. An R package for model-based clustering and classification via the multivariate t distribution. J Stat Softw. 2018;83:7.

  2. 2.

    Barnett V. Comparative statistical inference. 3rd ed. Hoboken: Wiley; 1999.

  3. 3.

    Ben-Israel A, Iyigun C. Probabilistic d-clustering. J Classif. 2008;25(1):5–26.

  4. 4.

    Bezdek JC, Ehrlich R, Full W. Fcm: the fuzzy c-means clustering algorithm. Comput Geosci. 1984;10(2–3):191–203.

  5. 5.

    Blight B. Estimation from a censored sample for an exponential family. Biometrika. 1970;57:389–95.

  6. 6.

    Browne RP, ElSherbiny A, McNicholas PD. mixture: mixture models for clustering and classification; R package version 1.4. 2015. https://cran.r-project.org/web/packages/mixture/index.html

  7. 7.

    Buck S. A method of estimation of missing values in multivariate data suitable for use with an electronic computer. J R Stat Soc B. 1960;22:302–6.

  8. 8.

    Chiang M, Mirkin B. Intelligent choice of the number of clusters in k-means clustering: an experimental study with different cluster spreads. J Classif. 2010;27(1):3–40.

  9. 9.

    Dang UJ, Browne RP, McNicholas PD. Mixtures of multivariate power exponential distributions. Biometrics. 2015;71(4):1081–9. https://doi.org/10.1111/biom.12351.

  10. 10.

    Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc B. 1977;39(1):1–38.

  11. 11.

    Genz A, Bretz F, Miwa T, Mi X, Leisch F, Scheipl F, Hothorn T. mvtnorm: multivariate Normal and t Distributions; R package version 1.0-8. 2018.

  12. 12.

    Hubert L, Arabie P. Comparing partitions. J Classif. 1985;2(1):193–218.

  13. 13.

    Iyigun C. Probabilistic Distance Clustering. Ph.D. thesis, New Brunswick Rutgers, The State University of New Jersey. 2007.

  14. 14.

    Iyigun C, Ben-Israel A. Probabilistic distance clustering adjusted for cluster size. Prob Eng Inf Sci. 2008;22(04):603–21.

  15. 15.

    Kaufman L, Rousseeuw P. Finding groups in data: an introduction to cluster analysis. New York: Wiley; 1990.

  16. 16.

    Kulin HW, Kuenne RE. An efficient algorithm for the numerical solution of the generalized weber problem in spatial economics. J Reg Sci. 1962;4(2):21–33. https://doi.org/10.1111/j.1467-9787.1962.tb00902.x.

  17. 17.

    Lange KL, Little RJ, Taylor JM. Robust statistical modeling using the t distribution. J Am Stat Assoc. 1989;84(408):881–96.

  18. 18.

    Lee SX, McLachlan GJ. Finite mixtures of multivariate skew t-distributions: some recent and new results. Stat Comput. 2014;24(2):181–202.

  19. 19.

    Lin TI. Robust mixture modeling using multivariate skew t distributions. Stat Comput. 2010;20(3):343–56.

  20. 20.

    MacQueen J. Some methods for classification and analysis of multivariate observations. Proc Fifth Berkeley Symp. 1967;1:281–97.

  21. 21.

    McNicholas SM, McNicholas PD, Browne RP. A mixture of variance-gamma factor analyzers. In: Ahmed SE, editor. Big and complex data analysis: methodologies and applications. Cham: Springer International Publishing; 2017. p. 369–85.

  22. 22.

    Murray PM, McNicholas PD, Browne RB. A mixture of common skew-\(t\) factor analyzers. Statistics. 2014;3(1):68–82.

  23. 23.

    Newcomb S. A generalized theory of the combination of observation so as to obtain the best result. Am J Math. 1886;8:343–66.

  24. 24.

    Orchard T, Woodbury M. A missing information principle: Theory and applications. In: C.U.o.C.P. Berkley (ed.) Proceedings of the 6th Berkeley Symposium on Mathematical Statistics and Probability; 1972, vol 1, pp. 697–715

  25. 25.

    Punzo A, McNicholas PD. Parsimonious mixtures of multivariate contaminated normal distributions. Biometr J. 2016;58(6):1506–37.

  26. 26.

    R Core Team: R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. 2018.

  27. 27.

    R Core Team and contributors worldwide: stats: the R Stats Package 2014; R package version 3.1.2. 2014.

  28. 28.

    Rachev ST, Klebanov LB, Stoyanov SV, Fabozzi FJ. The methods of distances in the theory of probability and statistics. Berlin: Springer; 2013.

  29. 29.

    Rainey C, Tortora C, Palumbo F. A parametric version of probabilistic distance clustering. In: Greselin F, Deldossi L, Vichi M, Bagnato L, editors. Advances in statistical models for data analysis, studies in classification, data analysis, and knowledge organization. Cham: Springer; 2019. p. 33–43.

  30. 30.

    Rand WM. Objective criteria for the evaluation of clustering methods. J Am Stat Assoc. 1971;66:846–50.

  31. 31.

    Steinley D. Properties of the Hubert-Arable adjusted Rand index. Psychol Methods. 2004;9(3):386.

  32. 32.

    Tang Y, Browne RP, McNicholas PD. Flexible clustering of high-dimensional data via mixtures of joint generalized hyperbolic distributions. Statistics. 2018;7(1):e177.

  33. 33.

    Theodoridis S, Koutroumbas K. Pattern recognition. 2nd ed. New York: Academic Press; 2003.

  34. 34.

    Tortora C, Franczak BC, Browne RP, McNicholas PD. A mixture of coalesced generalized hyperbolic distributions. J Classif. 2019;36(1):26–57.

  35. 35.

    Tortora C, Gettler Summa M, Marino M, Palumbo F. Factor probabilistic distance clustering (FPDC): a new clustering method for high dimensional data sets. Adv Data Anal Classif. 2016;10(4):441–64.

  36. 36.

    Tortora C, Gettler Summa M, Palumbo F. Factor PD-clustering. In: Berthold UL, Dirk V (eds) Algorithms from and for Nature and Life; 2013, p. 115–123.

  37. 37.

    Tortora C, McNicholas PD. FPDclustering: PD-clustering and factor PD-clustering. R package version 1.4. 2019.

Download references

Funding

this study was funded by a Discovery Grant from the Natural Sciences and Engineering Research Council of Canada and the Canada Research Chairs program (McNicholas). During the development of the present work, Prof. Francesco Palumbo had a short term visiting at the San Jose State University (CA) financially supported by the International short mobility program with foreign universities and research centers of the Università degli Studi di Napoli Federico II (DR 2243).

Author information

Correspondence to Cristina Tortora.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Dissimilarity Measure

A general measure \(d(\varvec{x},\mathbf{y})\) is a dissimilarity measure if the following conditions are verified [33, p.404]:

  1. 1.

    \(\mathrm{d}(\varvec{x},\mathbf{y})\ge 0\)

  2. 2.

    \(\mathrm{d}(\varvec{x},\mathbf{y})=0 \Leftrightarrow \varvec{x}= \mathbf{y}\)

  3. 3.

    \(\mathrm{d}(\varvec{x},\mathbf{y})=d(\varvec{x},\mathbf{y}).\)

Let \(f(\varvec{x}_i;{\varvec{\mu }}_k,{\varvec{\theta }}_k)\) be the generic symmetric unimodal multivariate density function of the random variable \(\mathbf {X}\) with parameter \({{\varvec{\theta }}}_k\) and location parameter \({\varvec{\mu }}_k\) then

$$\begin{aligned} \mathrm{d}(\varvec{x}_i, {\varvec{\mu }}_k)=\log { \left( \frac{M_k}{f(\varvec{x}_i;{\varvec{\mu }}_k, {\varvec{\theta }}_k)}\right) }, \end{aligned}$$
(23)

satisfies all the three properties and it is a dissimilarity measure for \(k=1,\ldots , K\).

1. \(\mathrm{d}(\varvec{x}_i, {\varvec{\mu }}_k) > 0,\; \forall \varvec{x}_i\).

Proof

$$\begin{aligned} \quad \quad 0<\frac{f(\varvec{x}_i;\varvec{\mu }_k,{\varvec{\theta }}_k)}{M_k}\le 1\Rightarrow & {} \frac{M_k}{f(\varvec{x}_i;\varvec{\mu }_k,{\varvec{\theta }}_k)} \ge 1\Rightarrow \\\Rightarrow & {} \log \left( \frac{M_k}{f(\varvec{x}_i;\varvec{\mu }_k,{\varvec{\theta }}_k)}\right) \ge 0. \end{aligned}$$

2. \(\mathrm{d}(\varvec{x}_i, {\varvec{\mu }}_k)=0 \Leftrightarrow \varvec{x}_i={\varvec{\mu }}_k\).

2a. \(\varvec{x}_i ={\varvec{\mu }}_k \Rightarrow d(\varvec{x}_i, {\varvec{\mu }}_k)=0 \; \forall \varvec{x}_i\). Proof

$$\begin{aligned} \varvec{x}_i=\varvec{\mu }_k \Rightarrow f(\varvec{x}_i;\varvec{\mu }_k,{\varvec{\theta }}_k)=&f(\varvec{\mu }_k;\varvec{\mu }_k,{\varvec{\theta }}_k)= M_k \Rightarrow \\ \nonumber\Rightarrow & {} \frac{M_k}{M_k} =1 \Rightarrow \log \left( 1\right) =0, \end{aligned}$$

2b. \(\mathrm{d}(\varvec{x}_i, {\varvec{\mu }}_k)=0 \Rightarrow \varvec{x}_i ={\varvec{\mu }}_k, \; \forall \varvec{x}_i\).

Proof

$$\begin{aligned} \log \left( \frac{M_k}{f(\varvec{x}_i;\varvec{\mu }_k,{\varvec{\theta }}_k)}\right)&=0\Rightarrow\frac{M_k}{f(\varvec{x}_i;\varvec{\mu }_k,{\varvec{\theta }}_k)} =1 \Rightarrow \\&\Rightarrow {} f(\varvec{x}_i;\varvec{\mu }_k,{\varvec{\theta }}_k)=M_k \\&=f(\varvec{\mu }_k;\varvec{\mu }_k,{\varvec{\theta }}_k)\Rightarrow \varvec{x}_i=\varvec{\mu }_k. \end{aligned}$$

3. \(\mathrm{d}(\varvec{x}_i, {\varvec{\mu }}_k)=d({\varvec{\mu }}_k,\varvec{x}_i), \; \forall \varvec{x}_i\) Proof Given \({\varvec{\theta }}_k\)

$$\begin{aligned} \qquad f(\varvec{x}_i;\varvec{\mu }_k,{\varvec{\theta }}_k) =f(\varvec{\mu }_k;\varvec{x}_i,{\varvec{\theta }}_k), \Rightarrow\log \left( \frac{M_k}{f(\varvec{x}_i;\varvec{\mu }_k,{\varvec{\theta }}_k)}\right) =\log \left( \frac{M_k}{f(\varvec{\mu }_k;\varvec{x}_i,{\varvec{\theta }}_k)}\right) \end{aligned}$$

\(\square\)

Addition Details for Data Analyses

Variable Selection

On each dataset we selected one variable per group using hierarchical clustering (Fig. 4).

Fig. 4
figure4

Variable selection

Simulated Datasets

Table 5 Simulated datasets using multivariate Gaussian distributions Scenarios 1-4
Table 6 Simulated datasets using multivariate Gaussian distributions Scenarios 5-8
Table 7 Simulated datasets using multivariate Gaussian distributions Scenarios 9-12
Table 8 Simulated datasets using multivariate Gaussian distributions Scenarios 13–16
Table 9 Simulated datasets using multivariate Student-t distributions Scenarios 1–4
Table 10 Simulated datasets using multivariate Student-t distributions Scenarios 5–8
Table 11 Simulated datasets using multivariate Student-t distributions Scenarios 9–12
Table 12 Simulated datasets using multivariate Student-t distributions Scenarios 13–16
Fig. 5
figure5

Simulated datasets using multivariate Gaussian distributions Scenarios 1–4, each color and symbol representing a different cluster

Fig. 6
figure6

Simulated datasets using multivariate Gaussian distributions Scenarios 5–8, each color and symbol representing a different cluster

Fig. 7
figure7

Simulated datasets using multivariate Gaussian distributions Scenarios 9–12, each color and symbol representing a different cluster

Fig. 8
figure8

Simulated datasets using multivariate Gaussian distributions Scenarios 13–16, each color and symbol representing a different cluster

Fig. 9
figure9

Simulated datasets using multivariate Student-t distributions Scenarios 1–4, each color and symbol representing a different cluster

Fig. 10
figure10

Simulated datasets using multivariate Student-t distributions Scenarios 5–8, each color and symbol representing a different cluster

Fig. 11
figure11

Simulated datasets using multivariate Student-t distributions Scenarios 9–12, each color and symbol representing a different cluster

Fig. 12
figure12

Simulated datasets using multivariate Student-t distributions Scenarios 13–16, each color and symbol representing a different cluster

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Tortora, C., McNicholas, P.D. & Palumbo, F. A Probabilistic Distance Clustering Algorithm Using Gaussian and Student-t Multivariate Density Distributions. SN COMPUT. SCI. 1, 65 (2020). https://doi.org/10.1007/s42979-020-0067-z

Download citation

Keywords

  • Cluster analysis
  • PD-clustering
  • Multivariate distributions
  • Dissimilarity measures