1 Introduction

In an extended and inclusive definition, given a set of multivariate units, the final aim of any clustering approach is to recognize similar units and produce a meaningful grouping of objects in the data (Aggarwal 2014, Chap. 1). Data clustering is considered unsupervised learning and is formulated as a numerical optimization problem that exploits computers’ computing power to maximize a given within-group homogeneity (or between-group heterogeneity) criterion. In many domains, the complexity of the data sets available has risen. This has devised a need for methods that do not rely on assumptions on the distribution of the data and can work on variables recorded on diverse measurement scales (nominal, ordinal, and metric), namely mixed-type data (Megahed and Jones-Farmer 2015). This has generated a renewed interest in geometrical clustering approaches and new methods for the treatment of mixed-data (Ahmad and Khan 2019; Van de Velden et al. 2019; Mbuga and Tortora 2021), in particular, towards numerical optimization algorithms that can effectively exploit the last generation multicore CPUs (Oyewole and Thopil 2023; Tjur 2011).

Generally, the number K of groups is unknown a priori, and the choice of the best K remains an open issue. Because of the huge number of possible data partitions for any fixed K, even in the geometrical context and with a moderately large data set, finding the optimal solution that groups the data into homogeneous clusters remains a numerically untreatable problem, suggesting to focus on methods computationally feasible but that could lead to sub-optimal solutions. In this framework, partitioning methods are among the most efficient. They generally start from a given number of (randomly) selected centroids and, until convergence, iterate two steps: (i) (re)assigns the units to the closest centroids; (ii) updates the centroids. See Gordon (1999) for an exhaustive treatment.

Therefore, when dealing with mixed data types, with many variables and a huge number of units, namely complex data treating, partitioning algorithms remain more appealing, flexible, and likely the most broadly used. In this family of methods, beyond any doubt, the most known and widely used approach is the K-means algorithm. Interested readers may refer to Jain (2009) for a historical perspective of the K-means and to the more recent paper from Ikotun et al. (2022) that reconsiders the K-means algorithm in the era of Big Data. Several extensions of K-means clustering for mixed-type data have been proposed, the most popular being k-prototypes (Huang 1998) and KAy-means for MIxed LArge data (KAMILA) (Foss and Markatou 2018). A fuzzy version of k-means, where the cluster memberships are belonging probabilities, is called fuzzy k-means Bezdek (2013). Given the popularity of K-means clustering, the algorithm is available in most statistical software. In (R Core Team 2016), the function kmeans is available in the core package stats, while some extensions that jointly combine dimensionality reduction and K-means, namely Factorial K-means (Vichi and Kiers 2001) and reduced K-means (De Soete and Carroll 1994) are implemented in the package clustrd (Markos et al. 2019). The extensions for mixed type data, k-prototypes, and KAMILA are available using the packages clustMixType (Szepannek 2018) and kamila (Foss and Markatou 2018), respectively. Fuzzy k-means is available in the package fclust (Ferraro et al. 2019).

Within the large family of geometric clustering algorithms and their applications to diverse data types, this work considers probabilistic distance (PD) clustering (Iyigun and Ben-Israel 2008; Ben-Israel and Iyigun 2008). PD clustering is a distribution-free, probabilistic clustering method that assigns the units to the given K clusters by solving a numerical optimization problem. The algorithm assigns the units to the clusters according to a K-dimensional membership array, whose values sum to one for each unit and may be interpreted as the probabilities of belonging to a cluster, where K indicates the number of clusters (defined a priori). It grounds on the following geometric assumption: given a point and K cluster centers lying in any J dimensional space, the product between the belonging probability and the distance for any given point to each cluster center is a constant. The value of the constant is inversely proportional to the classifiability of the point; therefore, PD clustering aims to find those centers that maximize the overall classifiability. This is equivalent to minimizing the sum of the products between the distance from the cluster centers and the probability that the point belongs to the cluster computed for each point in the data. Because these two quantities are unknown and mutually dependent, the algorithm runs over two steps: alternately, it computes the distances of each point from all the centers and then updates the belonging probabilities and, consequently, the centers. PD clustering is computationally efficient and extremely flexible; moreover, its factor reformulation, proposed by Tortora et al. (2016), can treat many continuous variables.

This paper reviews the whole PD clustering family; it offers a perspective on PD clustering usability in various data contexts. To this end, it gives a detailed and illustrated introduction to the method by presenting the recently updated FPDclustering package functionalities (Tortora et al. 2024). The package is enriched with effective interpretation wizards, helping interpret the results that have never been systematically presented before. The paper allows the readers to appreciate the PD clustering capabilities under several conditions; three applications with real data sets are used.

The remainder of the paper is organized as follows: Sect. 2 contains an overview of the discussed methods, with appropriate references for interested readers. Section 3 details the algorithms used per each method while Sect. 4 gives an introduction of the package FPDclustering. Section 5 shows the package’s functionalities on three real data sets, and section 6 concludes the paper.

2 Statistical methods

2.1 Probabilistic distance clustering

Let \(\textbf{X}\) be a data matrix with n units, J variables, and generic row vector \(\textbf{x}_i\) (\(i=1,\ldots ,n)\), given K clusters that are assumed not to be empty having as center the row vectors \(\textbf{c}_k\) (\(k=1,\ldots ,K\)), PD-clustering (Ben-Israel and Iyigun 2008) criterion uses two quantities: the distance, \(d(\textbf{x}_i,\textbf{c}_k)\), of each data point \(\textbf{x}_i\) from the K cluster centers \(\textbf{c}_k\), and the probabilities, \(p(\textbf{x}_i,\textbf{c}_k)\), of each point belonging to a cluster. Then, the probability of any generic point belonging to each cluster \(p(\textbf{x}_i,\textbf{c}_k)\) is assumed to be inversely proportional to the distance from the centers of the clusters \(d(\textbf{x}_i,\textbf{c}_k)\). The relation between them is the basic principle of the method. Formally, replacing \(p(\textbf{x}_i,\textbf{c}_k)\) with \(p_{ik}\) and \(d(\textbf{x}_i,\textbf{c}_k)\) with \(d_{ik}\) for the sake of compactness the following equation

$$\begin{aligned} d_{ik}p_{ik}=F(\textbf{x}_{i}), \end{aligned}$$
(1)

holds for any \(i=1,\ldots , n\) and any set of \(k = 1, \ldots , K\) centers. Equation 1 states that the product between the distances and the probabilities is a constant depending only on \(\textbf{x}_i\). Therefore, when \(d_{ik}\) decreases, then the probability \(p_{ik}\) of the point \(\textbf{x}_i\) belonging to the cluster k increases. The distance \(d_{ik}\) can be computed according to different criteria; the squared norm is one of the most commonly used

$$\begin{aligned} d_{ik} = || \textbf{x}_i-\textbf{c}_k||, \end{aligned}$$

and \(F(\textbf{x}_i)\) measures the classifiability of the point \(\textbf{x}_i\) with respect to the cluster centers \(\textbf{c}_k\), \(k=1, \ldots , K\).

Fig. 1
figure 1

Toy data set to illustrate PDC. The dots are the data colored based on cluster membership, the X symbols are the centers of the clusters, and the blue square is the point considered for clustering

Table 1 Values of \(d_{ik}\), \(p_{ik}\), and \(F(\textbf{x}_i)\) corresponding to the blue squared points in Fig. 1 A, B, and C

Summing \(F(\textbf{x}_i)\) over n defines the classification criterion, namely the Joint Distance Function (JDF); it measures the closeness of \(\textbf{x}_i\) from all the cluster centers. If a point coincides with one cluster center \(F(\textbf{x}_i)=0\); if all the distances between the point \(\textbf{x}_i\) and the centers of the clusters are equal, then \(F(\textbf{x}_i)=d_{ik}/k\) and all the probabilities of belonging to each cluster are equal. The smaller the JDF value, the higher the point classifiability to one cluster. The toy example in Fig. 1 shows three simple scenarios. The dots are the data colored based on cluster membership, and the symbols X in the plot are the centers of the clusters, i.e., (0,0), (1,6), and (6,6). The blue square is the point considered for clustering of coordinates (0,0) in Figure A, (3,3) in Figure B, and (5,6) in Figure C. The values of \(d_{ik}\), \(p_{ik}\), and \(F(\textbf{x}_i)\) are shown in Table1. In Figure A, the point coincides with the center of cluster number 1; therefore, the distance from that cluster is 0, and the probability of belonging to cluster 1 is 1. In scenario B, the point is approximately at the same distance from all the centers, while in scenario 3, the point is closer to cluster 3. By construction, for each point, the value of \(F(\textbf{x}_i)\) is the same for each cluster; it is 0 in scenario A, of best classifiability, and highest in scenario B, of worst classifiability.

Using the fact that \(p_{ik}d_{ik}\) is a constant, the equation for the probabilities can be obtained as

$$\begin{aligned} p_{ik}=\frac{\prod _{m\ne k}d_{im}}{\sum _{m=1}^K\prod _{r\ne m}d_{ir}}. \end{aligned}$$

The clustering problem is solved by minimizing the sum of \(F(\mathbf{x_i})\) over i; this summation does not depend on k.

To ensure convexity, Ben-Israel and Iyigun (2008) suggest using \(p_{ik}^2\) instead of \(p_{ik}\), then the clustering problem solved by minimizing the JDF becomes:

$$\begin{aligned} \textrm{JDF} = \sum _{i=1}^n\sum _{k=1}^K||\textbf{x}_i - \textbf{c}_{k}|| p_{ik}^2. \end{aligned}$$
(2)

The summation over k is introduced to avoid the effect of rounding errors. The row vectors \(\textbf{c}_k\) that minimize the JDF are

$$\begin{aligned} \textbf{c}_k=\sum _{i=1}^{n}\left( \frac{u_k(\textbf{x}_i)}{\sum _{j=1}^{n}u_k(\textbf{x}_j)}\right) \textbf{x}_i, \end{aligned}$$
(3)

where

$$\begin{aligned} u_k(\textbf{x}_i)=\frac{p_{ik}^2}{d_{ik}}. \end{aligned}$$

Each unit is assigned to the \(k^{th}\) cluster according to the highest probability.

The most used fuzzy clustering method is fuzzy k-means (FKM) (Bezdek 2013), which minimizes \(\sum _{i=1}^{n}\sum _{k=1}^K u_{ik}^v ||\textbf{x}_i - \textbf{c}_{k}|| \), where \(u_{ik}\) is a positive membership grade with \(\sum _{k=1}^K u_{ik}=1\), and v is a fuzzier. Although setting \(v=2\) makes the equation equal to Formula 2, the lack of the constraint on the product between distance and probability impacts the formulas for computing the centers and the probabilities. Figure 2 shows an example of the application of PDC and FKM on a toy data set generated by merging two bivariate data sets pseudo-randomly generated from two normal distributions with different means. The size of the points is proportional to the probability of belonging to the cluster to which they are assigned.

Fig. 2
figure 2

Toy data set to illustrate the difference between FKM and PDC. The data are colored based on cluster membership; the size of the points is proportional to the belonging probability to the cluster they are assigned to

Focusing on FKM, Fig. 2 shows that the sizes of the points, i.e., the cluster belonging probabilities, are smaller in between clusters. Points in between two clusters have smaller probabilities of belonging to either cluster and, consequently, a smaller weight when computing the centers. In PDC, the probabilities are lower as the points get further from the center in any direction, making it robust by down-weighting.

2.2 PDQ: PDC adjusted for cluster size

When the true data structure consists of clusters of different sizes, as PDC does not consider the cluster size, it may fail to identify the most appropriate cluster memberships. To overcome this issue, Iyigun and Ben-Israel (2008) proposed PDC adjusted for cluster size (PDQ). PDQ relies on the following adjusted definition of \(F(\textbf{x}_i)\)

$$\begin{aligned} \frac{p_{ik}^2d_{ik}}{q_k}=F(\textbf{x}_i), \end{aligned}$$
(4)

where \(q_k\) is the cluster size, with the constraint that \(\sum _{k=1}^Kq_k=n\). Then, \(p_{ik}\) can be computed according to the following expression:

$$\begin{aligned} p_{ik}=\frac{\prod _{m\ne k}d_{im}/q_{m}}{\sum _{m=1}^K\prod _{r\ne m}d_{ir}/q_{r}}. \end{aligned}$$

The cluster size is considered a variable, and the value of \(q_k\) that minimizes (4) is

$$\begin{aligned} q_{k}=n\frac{\left( \sum _{i=1}^nd_{ik}p_{ik}^2\right) ^{1/2}}{\sum _{k=1}^K\left( \sum _{i=1}^nd_{ik}p_{ik}^2\right) ^{1/2}}, \end{aligned}$$

for \(k= 1,\ldots , K-1\), and \(q_K=n-\sum _{k=1}^{K-1}q_k\).

2.3 Flexible PDQ

In its original formulation, PD clustering does not account for the correlation between variables and is sensible to outliers. Tortora et al. (2020) proposed a kernel version of the PD clustering to increase the method flexibility to data sets characterized by correlated variables and outliers, based on a symmetric unimodal density function family \(f(\varvec{x};{\varvec{\mu }},{\varvec{\theta }})\), with location parameter vector \(\varvec{\mu }\) and parameter vector \({\varvec{\theta }}\), which specify the function. As the PD clustering cannot be separated from a definition of a dissimilarity or a distance measure, Tortora et al. (2020) proposed the following dissimilarity measure

$$\begin{aligned} d_{ik}=\log (M_kf(\varvec{x}_i;{\varvec{\mu }}_k,{\varvec{\theta }}_k)^{-1}), \end{aligned}$$
(5)

where \(M_k\) is defined as \(M_k = \max \{f(\varvec{x};{\varvec{\mu }}_k,{\varvec{\theta }}_k)\}\), for any \(k=1\ldots ,K\). According to the dissimilarity measure in 5, two PD clustering variants were developed, Gaussian PDC (GPDC) and Student-t PDC (TPDC), based on the multivariate Gaussian and Student-t distributions, respectively.

Fig. 3
figure 3

Toy data sets to illustrate with which kind of cluster shape PDC, PDQ, and GPDC perform best. The data are colored based on cluster membership

Figure 3 shows three fictitious situations in which PDC, PDQ, and GPDC perform best, respectively.

2.4 PDC for frequency data

The function PDQ includes an option to accommodate frequency data. Let \(\textbf{F}\) be a contingency table, i.e., the data represent frequencies, with n rows and J columns, having general term \(f_{ij}\). The \(\chi ^2\) distance is defined as follows:

$$\begin{aligned} d_{ii^\prime }=\sqrt{\sum _{j=1}^{J} \frac{1}{f_{\cdot j}} \left( \frac{f_{ij}}{f_{i\cdot }}- \frac{f_{i^\prime j}}{f_{i^\prime \cdot }} \right) ^2}, \end{aligned}$$

where \(f_{i\cdot }\) and \(f_{\cdot j}\) are the generic row marginal and column marginal frequencies, respectively.

2.5 PDC for mixed-type data

When \(\textbf{X}\) is characterized by mixed-type data, the function PDQ using the Gower’s distance should be used. A different kind of distance is used for different types of variables, and the centers are updated using equation (3) for continuous variables, while the mode is used for categorical and binary variables. More details about PDQ for mixed-type data can be found in Tortora and Palumbo (2022).

2.6 High dimensional data sets

The package also includes Factor PDC (FPDC) for data sets with many variables. The iterative algorithm alternately performs a linear transformation of original variables into a reduced number of orthogonal ones (factors), and the PD clustering onto the reduced space minimizes the JDF at each step (Tortora et al. 2013; Tortora 2011).

Tortora (2011) and Tortora et al. (2013) formally demonstrated the accordance between the Tucker3 method (Kroonenberg 2008) with the PD clustering JDF criterion, and empirical evidence demonstrates that the algorithm ensures convergence to a local minimum, at least. Specifically, let us define with \(\textbf{G}\) a 3-way \(n \times J \times K\) distance array of general elements \(g_{ijk}=|x_{ij}-c_{kj}|\), where n is the number of units, J the number of variables, and K the number of clusters. The Tucker3 method decomposes the three-way array \(\textbf{G}\) in three matrices, one for each way, in a full core array plus an error term. The generic term \(g_{ijk}\) is then approximated into a lower dimensional space according to the following expression

$$\begin{aligned} g^{*}_{ijk} = |x^{*}_{ij}-c^{*}_{kj}| = \sum _{r=1}^{R}\sum _{q=1}^{Q}\sum _{s=1}^{S}\lambda _{rqs}(u_{ir}b_{jq}v_{ks}), \end{aligned}$$

where \(\lambda _{rqs}\) is a weighting parameter, R, Q, and S are the number of retained components (factors) for each of the three ways of \(\textbf{G}\).

The algorithm solves the optimization problem by alternatively computing the coordinates of the observations in the space of variables obtained through Tucker3 decomposition and then running the PD clustering until the JDF

$$\begin{aligned} JDF = { \sum _{i=1}^n\sum _{q=1}^Q\sum _{k=1}^K p_{ik}^2|x_{iq}^* - c_{kq}^*|}, \end{aligned}$$

no longer decreases, where \(x_{iq}^*\) and \(c_{kq}^*\) are the data and the centers’ projections in the reduced subspace.

As in all factor methods, factor axes in the Tucker3 model are sorted according to explained variability. The first factor axes explain the greatest part of the variability. Therefore, the choice of the parameters R, Q and S is a ticklish issue, as the aim is to minimize the number of factors retaining the significant value of the explained variability according to the specific context. The package FPDclustering has a tool to help select the best number of factors.

3 PDC algorithm

All the described PD clustering-based algorithms are iterative methods that get the solution through successive approximations. The PD clustering algorithm requires a minimum set of input parameters that consist of the \(n\times J\) data matrix \(\textbf{X}\) and the number of clusters K. PDC, PDQ, GPDC, and TPDC common core procedure can be summarized in the following steps:

  1. 0.

    center initialization;

  2. 1.

    computation of distance matrix;

  3. 2.

    computation of belonging probabilities array and other parameters where appropriate;

  4. 3.

    update of the centers,

  5. 4.

    iteration of steps 1–3 until convergence is reached.

For PD clustering, the function PDC sets the first two centers at the combination of all minimum and maximum values of the input variables in \(\textbf{X}\), respectively. If \(K>2\), then PDC randomly chooses the other \(K-2\) centers.

The convergence is reached when, between two successive iterations, the JDF or the centers’ matrix variation is in the order \(\le 0.001\). The algorithm also stops and returns a warning if the convergence is not reached after 1000 iterations. Given the vectors of the centers \(\textbf{c}_1, \textbf{c}_2, \ldots , \textbf{c}_K\), points are assigned to the clusters according to the highest probability. For PDQ, the function PDQ initializes the centers using the different options available for the parameter ini. Available options are random, kmedoids (Van der Laan et al. 2003), and center (the user provides the matrix of the center starts in the function input parameters). When choosing the option kmedoids, if the data are mixed-type, the function automatically uses the k-modes algorithm (Chaturvedi et al. 2001) for the categorical variables. PDQ also has another optional input, dist, to select the distance measure used (options available are Euclidean (‘euc’), Gower (‘gower’), and chi-square (‘chi’)). If the data are of mixed type, the following inputs need to be used to specify the column indices corresponding to each variable type, specifically: ord for ordinal variables, cat for categorical variables, bin for binary variables, and cont for continuous variables. The optional input w, a numerical vector the same length as the columns of the data, can be used to give different weights to each variable.

The functions GPDC and TPDC also have the option to change initialization criteria using the parameter method; available options are random, kmedoids, and PDC. If random is chosen, the user can use nr to specify the number of random starts. The number of iterations until convergence for GPDC and TPDC is set to 100 and can be changed using the parameter iter.

The FPDC algorithm has a higher complexity; see Tortora et al. (2016) for details. Moreover, it requires two more input parameters than PDC: the number of factors for variables and the number of factors for units. The option TuckerFactors, available in the FPDclustering, allows calling a graphical procedure to help the user to choose heuristically. In particular, the plot represents the explained variance at several combinations of retained factors for the two ways (units and variables, with occasions being set to \(K-1\)) on a Cartesian space. More details and an example are available in the following sections.

4 FPDclustering package overview

The FPDclustering package contains the functions listed in Table 2.

Table 2 Summary of the functions contained in the FPDclustering package version 2.3.1 available at the CRAN (The Comprehensive R Archive Network) http://cran.r-project.org/

The following sections are devoted to presenting one example per data type. The results are displayed by using the commands plot and summary. They recognize the functions’ output format and produce a different output depending on the function used, the data type, and the dimension of the data set. In all the following sections, the phrase “belonging probability” refers to the belonging degree of a certain unit to each cluster, namely the membership function; the unit is assigned to a cluster in accordance with the maximum degree.

4.1 Visualization

The package also provides a graphical approach to assess the algorithm’s overall ability to find well-separated and homogeneous groups. Inspired by the Density-based Silhouette, first proposed by Menardi (2011) as a parametric adaption of the well-known Rousseeuw silhouettes (Rousseeuw 1987), the Density-based Silhouette plot represents a helpful tool to evaluate ex-post the final result. The equation (6) defines the \(dbs_i\) for the generic observation \(\textbf{x}_i\)

$$\begin{aligned} dbs_i = \frac{ \log \left( \frac{p_{ik}}{p_{il}}\right) }{\max \limits _{i=1, \ldots , n} \left| \log \left( \frac{p_{ik} }{p_{il}}\right) \right| }, \end{aligned}$$
(6)

where k is such that \(\textbf{x}_i\) belongs to cluster k, i.e. \(p_{ik}\) is the biggest probability given i, and l is such that \(p_{il}\) is maximum for \(l\ne k\) (Tortora et al. 2013). Clusters are ordered in decreasing order according to the dbs index and then plotted on a bar chart. The input function parameter, called Silh, is the probability \(n\times K\) matrix P, and the output is the average dbs.

The FPDclustering package also includes a plot function that produces a silhouette plot and a scatter plot with different colors per cluster, as well as a parallel coordinate plot.

5 Applications

In this section, we applied the described functions to three different data sets.

5.1 Continuous data

The data set Unsupervised Learning on Country Data contains ten variables recorded on 167 countries (Kokkula 2022). The goal is to categorize the countries using socio-economic and health indicators that determine the country’s overall development. The data set has been donated by the HELP International organization, an international humanitarian NGO that needs to identify the countries that need aid and asked the analysts to categorize the countries. The data set does not require data cleaning. Since the units are different per column, the data need to be scaled, and the name of Micronesia needs to be converted to the standard name fsm. We fixed the number of clusters equal to 4 and used the default kmedoids to start the algorithms.

figure d

The results of the three methods implemented in the package for continuous data, PDQ, GPDC, and TPDC are quite different.

To compare the clustering results among the methods, we use the Adjusted Rand Index (ARI) (Sundqvist et al. 2023); the ARI expected value is zero under random classification, and it is equal to one when there is a perfect class agreement. The ARI can be obtained using the ARI() function of the MixGHD package (Tortora et al. 2019).

figure f

Using the function Silh we obtain the average silhouette value, which is a measure of the quality of the partition.

figure g
Fig. 4
figure 4

Output of the function plot on PDQ output, silhouette plot and parallel coordinate plot where each color represents a cluster

Fig. 5
figure 5

Output of the function plot on PDQ output, scatter plot where each color represents a different cluster

As expected, the ARI shows that PDQ gives different results from GPDC, which is more similar to TPDC. The function Silh also shows the silhouette plot; an example using PDQ can be seen in Fig. 4a. In this case, clusters 1 and 4 have some countries with very high belonging probability, and cluster 3 has a lower probability for all the points. The plot also emphasizes some borderline countries with lower belonging probability. This can also be observed in the scatter plot matrix in Fig. 5, where the clusters clearly overlap.

In this example, GPDC gives the best result using the silhouette value as a quality measure. Two functions can help in interpreting the results, summary and plot, which give the plots in Figs. 4a, b, and 5.

figure h

The summary shows the number of elements per cluster, and the plot shows the silhouette plot, a parallel coordinate plot, and a scatter plot matrix; in the latter two plots, each cluster corresponds to a different color.

To visualize the differences among the algorithms, Figs. 6, 7, and 8 show the world map colored according to the clustering partition and the corresponding parallel coordinate plot for PDQ, GPDC, and TPDC, respectively. The map is obtained using the package rworldmap (South 2011). The following code shows how to create the map using the results from PDQ; the other two maps can be created analogously.

figure i
Fig. 6
figure 6

Map and parallel coordinate plot where each color represents a cluster obtained using PDQ

Fig. 7
figure 7

Map and parallel coordinate plot where each color represents a cluster obtained using GPDC

Fig. 8
figure 8

Map and parallel coordinate plot where each color represents a cluster obtained using TPDC

PDQ does not account for the clusters’ differences in the correlations between variables nor the variability within the clusters and, therefore, partitions the countries into four relatively balanced clusters. Cluster 1 mainly contains African countries with higher child mortality and fertility but lower life expectancy. Cluster 2 mainly contains Eastern European countries, and it is primarily characterized by higher exports and imports but lower fertility. Units in Cluster 3 are mostly from Asia and South America; the cluster exhibits average values in most categories. Cluster 4 contains countries from North America, Western Europe, Australia and Japan, and it is the only cluster identified by all three different algorithms. On average, the countries in this cluster have higher per capita income, imports, and life expectancy. The obtained partitions are different considering the cluster variability and correlation among variables, i.e., using GPDC and TPDC. In both cases, there is a cluster, cluster 3 for GPDC and 4 for TPDC, which contains the countries with higher income, imports, and life expectancy values on average.

A few countries, like Brazil, can be seen as outliers, with only TPDC putting it in the same cluster as North America and Western Europe. The other major difference between those two methods is the smaller clusters. The GPDC algorithm finds that Croatia, Poland, Slovenia, and Uruguay form a cluster with average variables. Bahrain, Kuwait, and Saudi Arabia form another cluster with average values but lower health and higher income. TPDC, instead, finds that Bulgaria, Croatia, Czech Republic, Latvia, Lithuania, Poland, and South Korea form a cluster with overall average values but lower fertility. Barbados and El Salvador are their own cluster.

This example emphasizes how PDQ finds clusters based on a different criterion compared to GPDC and TPDC. The correlations among variables have a big impact on GPDC and TPDC. Therefore, the resulting clusters may have observations with different values per variable in terms of magnitude but a similar correlation structure. PDQ, instead, does not use this information, it is mainly based on each variable’s average value. According to the goal, a different method can be a better fit.

More details on the clusters can be obtained by looking at the estimated parameters; here below, we show merely, for example, the first two columns of centers of the clusters obtained using TPDC, i.e. child_mort and exports.

figure j

When available, the centers, the sigma values, and degrees of freedom (df) can be used for interpretation. It complements what is already shown in the plots, for example, that, when using TPDC, cluster 1, on average, has higher child mortality and lower exports.

5.2 Mixed-type data

To show the usage of PDQ with mixed-type data, we use a data set collected in 2022 that contains ten variables recorded on a convenience sample of 253 students enrolled in the first year at the University of Naples Federico II and attending an introductory Statistics course,Footnote 1 Gender (male, female) and prior knowledge of Statistics (yes, no) are binary variables; High School type (lyceum, technical, vocational), course modality of attendanceFootnote 2 (in presence, at home, mixed), and parents’ education degree are categorical variables; the others are continuous and refer to five scales assessing: the statistical anxiety (Chiesi et al. 2011, SAS), the mathematical prerequisites for psychometric (Galli et al. 2008, PMP), the relative autonomy index (Alivernini and Lucidi 2008, RAI), the self-efficacy (Bonanomi et al. 2018, S_EFF), and the cognitive competence (Chiesi and Primi 2009, COG). These variables were centered and rescaled to variance equal to one. The package requires specifying the type of data in each column using the position as in the following example.

figure k

Binary variables are in columns 1 and 3, categorical variables are in columns 2, 4, and 5, and continuous variables are in columns 6 to 10. Binary and categorical variables are defined as factor types, while continuous variables (numerical) must be standardized.

figure l

We can now apply the PDQ function. Continuous variables are standardized, and their variance is 1, the variability of the binary and categorical variables is higher. To avoid over-weighting the non-continuous variables, specific weights can be assigned to each variable using the input parameter w. Choosing the most appropriate weights is a matter that interested several authors; however, a definitive answer does not exist (Foss et al. 2019). In this case, the weight of each noncontinuous variable is set to 0.033 based on the variability measured using the normalized Gini heterogeneity index. As the number of clusters is unknown, we explore the results with 2 and 3 clusters and then choose the most appropriate by observing the Silhouette plots in Fig. 9. The distance parameter is set to gower since the data are of mixed-type.

figure m

The average silhouette values are very similar, slightly lower for more clusters, which is expected. Looking at the shape of the plots, the 3 cluster plot has less steep slopes, which means more points with higher silhouette values; we, therefore, choose 3 clusters.

Fig. 9
figure 9

Silhouette plots for 2 and 3 clusters on the Student data

Since no initialization criterion is specified, the default one is used: i.e., kmedoids for continuous variables, kmodes for categorical variables. The summary function displays the number of elements per cluster.

figure n

The function plot produces Figs. 10 and 11. Categorical variables can be described using bar plots; see Fig. 12, the following code shows how to obtain the plot for the variable Sex using the package ggplot2 Wickham (2016). The code can be easily extended to plot the other variables.

figure o
Fig. 10
figure 10

Parallel coordinate plot of the numerical variables of the Student data set, each color represents a cluster

Fig. 11
figure 11

Scatter plot matrix of the numerical variables of the Student data set, each color represents a cluster

Fig. 12
figure 12

Bar plots of the categorical variables of the Student data set

The parallel coordinate plot in Fig. 10 is a powerful tool for interpreting a cluster analysis output with continuous variables. Each line corresponds to a statistical unit: a bundle of the same color lines represents a group. Therefore, looking at the plot, we pinpoint that Cluster 1, the smallest one, contains “average” students for most variables. Cluster 2 includes the students who expressed high values of anxiety for learning Statistics (SAS) and low values of Self-efficacy. The normalized barplots in Fig. 12 show the distributions of the clusters within the binary and nominal variable categories. It is worth noting that most of the students in Cluster 2 hold a diploma from a psycho-pedagogical, linguistic or technical high school address. The biggest cluster, Cluster 3, includes less anxious students with high self-efficacy. Most students in the cluster hold a Lyceum diploma. This example shows how all the categorical, binary, and continuous variables have contributed to the cluster partition.

5.3 High-dimensional data

In this section, we will use the Wisconsin diagnostic breast cancer (WDBC) data (Mangasarian et al. 1995) available in the package mclust (Scrucca et al. 2016). The data set contains data for 569 patients, and there are 30 variables corresponding to the features of the cell nuclei obtained from a digitized image of a fine needle aspirate (FNA) of a breast mass. For each patient, the cancer was diagnosed as malignant or benign; therefore, we assume there are two clusters. The quite large number of variables recommends reducing the dimensionality. Factor PD clustering exploits the Tucker3 proprieties to reduce the whole problem dimensionality. Tortora et al. (2016) demonstrated the formal consistency between Tucker3 and the PD clustering. Therefore, the function input list of parameters must also include the number of factors to retain. The function TuckerFactors can guide the user through the choice. The function produces a table with the explained variability per combination of factors and the plot in Fig. 13.

figure q
Fig. 13
figure 13

TuckerFactor output: explained variability per combination of Tucker3 factors of the wdbc data set

The plot suggests four factors for the variables and four factors for the units since, after that, the increase in explained variability is smaller. We can now run the function FPDC.

figure r

Using the function table, we can see that the method identifies malignant and benign tumors with an error rate of 0.0896.

The function gives the belonging probability matrix in output; that can also be visualized using the function plot(fpdc_class), which generates the silhouette plot (Fig. 14) and parallel coordinates plots, not showed for the sake of space.

The silhouette plot in Fig. 14 shows that, while most points are well clustered, some points have a belonging probability close to 0.5. These points are between the clusters and may require further attention. Therefore, instead of using the crisp partition for the entire data set, an alternative option is to assign a crisp partition to the points with belonging probability higher than a certain threshold and let an expert in the field analyze the points between clusters.

Fig. 14
figure 14

Silhouette plot obtained clustering wdbc data set using FPDC

To investigate which points have a lower belonging probability, we can look at the matrix of probabilities:

figure s

In this example, we used a threshold of 0.6; the threshold is subjective and depends on the number of clusters. If we do not assign observations with a belonging probability lower than 0.6 to a cluster, the error rate decreases to 0.0119, with only one malignant tumor classified as benign.

6 Concluding remarks

Data clustering, grouping units into homogeneous groups, is probably one of the most used numerical methods to improve human knowledge from data. The optimal clustering algorithm does not exist and depends on the domain and the specific cluster definition. However, PDC and the PDQ may represent a good alternative to model-based clustering when the number of variables is large, and the distributional assumptions cannot be verified. Gaussian PDC and Student’s t PDC ensure more flexibility than the more known geometric approaches. FPDC is computationally feasible even with many variables and provides good stability and efficient estimation of the groups’ centers. Mixed-data clustering remains a challenging task, and the PDQ for mixed-type data offers a good answer and an interesting perspective for future work. This article illustrates the FPDclustering package that implements the Probability Distance Clustering algorithm and its variants, which together allow the user to face a vast range of clustering problems, including a large number of numerical variables or mixed-type variables. The article briefly summarizes the methodological aspects of the algorithm and refers those more interested in it to the literature; it also offers several examples that illustrate the package’s functionalities and capabilities.