1 Introduction

Cluster analysis, also known as unsupervised learning, is a branch of Machine Learning which has multiple applications in various domains, including financial fraud detection, medical diagnosis, image processing, information retrieval and bioinformatics [1]. Clustering is used to unlock initially hidden and undetectable patterns by identifying groupings within the dataset under investigation in an unsupervised manner: no labels (or classes) are initially provided as input parameters. This analysis process based on identifying the clusters without a prior knowledge of the outcomes makes this task more difficult, challenging and prone to errors. Clustering methods can be classified into five types: hierarchical [2, 3], partitional [4,5,6,7,8,9,10,11,12,13,14,15,16], density-based [17, 18], grid-based [19] or model-based methods [20]. The aim of cluster analysis is to partition a dataset composed of N observations embedded in a d-dimensional space into k distinct clusters. In this process, data points within the same cluster will have more similar characteristics than observations in other clusters according to a specific measure called distance metric [21].

Clustering qualitative data is problematic due to the lack of its geometric properties. For example, categorical attributes are unordered and it is inappropriate to use traditional numerical distance functions to capture resemblance between these categorical values. To overcome this limitation, many methods were proposed to deal with categorical data types such as the k-modes and its variants [4,5,6,7,8,9,10,11,12,13,14,15,16]. These methods start with K initial centroids and use the alternating minimization method to solve a non convex optimization problem. In the k-modes, the simple matching dissimilarity measure is used to compare two categorical values: the comparison yields a difference of zero for two identical values and one otherwise. However, for all these methods, one main issue is related to the identification of the initial number of clusters [22].

The mode, representing the most frequent patterns (modality in each attribute) in the cluster is randomly selected in the last step of the clustering process in all these methods. However, it is possible to identify more than one mode depending on the modalities’ frequency in the attributes. In the k-modes type clustering algorithms’ iterative process, the mode is also identified when moving from the ith iteration to the (i + 1)th iteration. The selection of this mode is essential and directly influences the formation of the final clusters. Frequently, random modes selection, widely used in the literature, may induce a clustering process to terminate in a locally optimal solution. Thus, this method is not convenient to ensure high performance. On the other hand, tackling the mode’s random selection issue during the clustering process was not considered in previous methods and thus, is an interesting issue to consider.

As it is widely known, in reality, the border of the data is hard to partition, as there is often no sharp boundary between the clusters. Most of the proposed formal modeling tools are deterministic and precise which does not fit real world situations that are very often not deterministic and cannot be described precisely. This fact requires integrating uncertainty based models in the clustering process. The first attempts to handle uncertainty were proposed with fuzzy theory such as the fuzzy k-means [23], the fuzzy k-modes [24] and their variants. In fuzzy clustering, each object can have membership functions to more than one cluster instead of the hard assignment given in the k-modes based methods. However, fuzzy algorithms have the same limitations as hard algorithms, i.e. they require multiple runs with different centroid initializations to ensure stability. Fuzzy methods also need to adjust one control parameter for the membership fuzziness to obtain better solutions, which is a complex task even through extensive experiments. Another attempt to handle uncertainty and avoid the fuzzy sets limitations was by using the Rough Set Theory (RST), introduced by Pawlak [25]. One main reason for the success of the RST is that no additional information is required to start the clustering process, such as thresholds or expert knowledge in a particular domain. In recent years, RST has attracted much attention in a variety of fields [26] such as computer vision [27, 28], biomedical engineering [29] and economy and finance [30].

In this paper, it is proposed to tackle the mode identification in categorical clustering using an uncertainty based model. The proposed method, called Density Rough k-Modes (DRk-M) aims to select the most appropriate modes in each iteration during the clustering process. This method can be implemented either for the k-modes or any of its variants. The rough mode selection permits identifying the most central centroid for each cluster and thus ensures better clustering performance. To better illustrate this notion and the main differences between the DRk-M and states of the art methods, Fig. 1 is proposed.

Fig. 1
figure 1

The k-modes algorithm and its variants compared to the DRk-M

In Fig. 1, the input is a categorical dataset composed of N observations described by d attributes. Although many previous studies considered the issue of the selection of the initial number of clusters [22], these methods do not fall under the scope of this study. In Fig. 1, a description of the clustering process in the k-modes and its variants is given. In the first phase, K initial centroids are randomly selected. These centroids will correspond to the starting point of the partitioning process. However, this random selection may usually lead to incorrect results since elements that are supposed to be part of the same cluster can be selected as centroids and thus put in different clusters. Many initialization methods were proposed in the literature to choose the most representative initial centroids [8, 35,36,37, 42]. In the second step of the k-modes, the (N-K) remaining observations will be assigned to the clusters according to the value of similarity computed using the simple matching dissimilarity metric between these points and the centroids. In this step, some variants of the k-modes, proposed using other distance metrics to enhance the algorithm [7, 10, 24, 38]. Once all the observations are assigned to their corresponding clusters, the centroids are updated in the last step of the algorithm and the new mode is computed for each group. The DRk-M proposes to tackle this last centroid updating step. As shown in Fig. 1, no previous method was presented in the literature to investigate this issue.

The DRk-M has some contributions and characteristics that can be summarized as follows:

  • The DRk-M is a categorical clustering approach. Categorical clustering is a tricky subject since it deals with categorical data characterized by their complex form.

  • The random selection of the mode is a critical issue in categorical clustering. Avoiding this limitation permits obtaining more accurate and stable results.

  • Since more than one possible mode can be generated in a given cluster in each iteration, the RST can be efficiently used in this context to bring more certainty and accuracy to the final results.

  • The convergence, performance and scalability of the DRk-M under the new mode selection method were investigated.

This paper is organized as follows: In Sect. 2, categorical clustering is detailed. Some states of the art categorical clustering methods are given and classified according to their contribution compared to the k-modes. In Sect. 3, the DRk-M is detailed and the building theories and concepts are given. The algorithm and a complexity analysis are also given in this section. In Sect. 4, an experimental analysis is provided using several datasets and the DRk-M is compared to many states of the art algorithms. Finally, in the last section, conclusions and perspectives are provided.

2 State of the art of categorical clustering

2.1 Clustering based on prototypes

A cluster is commonly characterized using a centroid that measures its centrality. Clustering methods based on prototypes are usually called partitional methods. Initially the number of clusters K is required as an input parameter and the centroids are iteratively updated until reaching a stop criterion. One of the prototypes based algorithms proposed in the literature is the k-means and its multiple variants [5, 6]. However, because large categorical data sets exist in many applications, such as environmental data analysis [31], market basket data analysis [32], DNA or protein sequence analysis [33], text mining [34], it was not possible to use the k-means in such application types. Thus, developing more appropriate algorithms to handle categorical clustering was an interesting subject in the last twenty years.

Partitional methods, consecutively divide the dataset until finding K clusters with a static configuration where no further splitting is possible. The partitional process can be described as follows:

Let CEN = {Cen1,Cen2,…,Cenk} be the set of centroids for the clusters \({\mathcal{C}}_{j}\) where j = 1,…,K and X = {obs1, obs2,…,obsN} the initial dataset with size N. For each observation obsi \(\in\) X.

Step 1: assign obsi to the cluster \({\mathcal{C}}_{j}\) where Cenj verifies: Cenj = argminj=1,…,K{d(obsi,Cenj}, where d is a similarity metric.

Step 2 : update the centroid Cj according to a given procedure depending on the data type involved (mean, median, modes).

Step 3: if the stopping criteria are met, the process will end otherwise repeat starting from step 1.

2.2 Categorical clustering: the k-modes and its variants

A categorical information system can be described as a quadruple IS = (U, A, V ,f), where:

  1. (1)

    U = {x1,x2,...,xN} is the nonempty set of N data points, called a universe;

  2. (2)

    A = {a1,a2,...,ad} is the nonempty set of d categorical attributes;

  3. (3)

    V is the union of attribute domains, i.e., \(V={U}_{j=1}^{d}{V}_{{a}_{j}}\), where \({V}_{{a}_{j}}=\left\{{a}_{j}^{1},{a}_{j}^{2},\dots ,{a}_{j}^{\left({n}_{j}\right)}\right\}\) is the value domain of categorical attribute aj and is finite and unordered, e.g., for any 1 ≤ p ≤ q ≤ nj, either \({a}_{j}^{p}={a}_{j}^{q}\) or \({a}_{j}^{p}\ne {a}_{j}^{q}\). Here, nj is the number of categories of attribute aj for 1 ≤ j ≤ d;

The k-modes [4] is based on optimizing a cost function given in Eq. (1):

$${\text{P}}\left(\text{W,Q}\right)=\sum_{\text{l=1}}^{\text{K}}\sum_{\text{i=1}}^{\text{N}}\sum_{\text{j=1}}^{\text{d}}{\omega}_{\text{il}}\mathcal{D}\text{(}{\text{obs}}_{\text{ij}},{\text{cen}}_{\text{lj}}\text{)}$$
(1)

Q = {cen1, cen2, …, cenK} represents the set of cluster modes and W = [wji] is a {0,1} matrix that corresponds to the current membership of an observation obsi to be part of a given cluster. This matrix verifies the rules given in Eqs. (2) and (3):

$$\sum_{\text{j=1}}^{\text{K}}{{\text{w}}}_{\text{ji}}= \text{1}$$
(2)
$$\text{0} < \sum_{\text{i=1}}^{\text{N}}{{\text{w}}}_{\text{ji}}< \text{N}$$
(3)

The k-modes permits clustering categorical datasets according to three modifications:

  1. (i)

    The simple matching dissimilarity measure \(\mathcal{D}\) is considered to evaluate the similarity between obsj and cenj

    $$\mathcal{D}\left({\text{cen}}_{\text{j}},{\text{obs}}_{\text{j}}\right)=\sum_{\text{j=1}}^{\text{d}}{\delta}\text{(}{\text{cen}}_{\text{j}},{\text{obs}}_{\text{j}}\text{)}$$
    (4)

    where

    $${\delta}\left({\text{cen}}_{\text{j}},{\text{obs}}_{\text{j}}\right)=\left\{\begin{array}{c}{0}, \, {\text{i}}{\text{f}} {\text{cen}}_{\text{j}}={\text{obs}}_{\text{j}}\\ {1}, \, {\text{i}}{\text{f}} {\text{cen}}_{\text{j}}{\neq}{\text{obs}}_{\text{j}}\end{array}\right.$$
    (5)
  2. (ii)

    The centroids are called modes where a mode of a categorical dataset U described by d attributes is a vector Q = [q1, q2,…, qd] that minimizes the quantity defined in Eq. (6):

    $${\text{L}}\left(\text{U,Q}\right)=\sum_{\text{i=1}}^{\text{N}}\text{D }\left({\text{obs}}_{\text{i}}\text{,Q}\right)$$
    (6)

    The mode Q represents the set of the most frequent modalities in each attribute.

  3. (iii)

    using a frequency-based method to update the modes during the clustering process to minimize the cost function.

Many variants of the k-modes were proposed to improve its efficiency and scalability according to many perspectives. In the rest of this section, these methods are detailed in order to provide the main research axis used.

2.2.1 Initialization methods

Most partitional categorical clustering methods such as the k-modes and its variants require pre-defining the number of clusters K as well as the selection of the initial centroids which is a great limitation. The initialization of the centroids can have a high impact on the final clustering results and various initializations can lead to several output clusters. Usually, selecting the initial clusters is random which is also problematic since one may select initial centroids that can have similar characteristics. Due to its simplicity, random initialization was widely used. In order to adjust the negative effects of the random initialization, these algorithms generally need to be rerun many times with different initializations [8, 35,36,37].

In [8], the authors proposed an initialization method based on the density and distance measures. In this method, the initial dataset is split into several subsets based on its attributes. Thus, it becomes possible to discard some data points from the potential set of initial centroids. Then, the most frequent attribute value is spotted in each attribute domain to compose its representative point and generate the centroid. The proposed method’s computational cost is O(2Nm|V| +|V|+ mK2|V|) which is linear with respect to the number of data points where \(\left|V\right|=\sum_{i=1}^{m}{n}_{j}\), m is the number of categorical attributes and nj the corresponding modalities. However, the method is not appropriate when the number of clusters is considerable.

In [36], the authors proposed an advanced method to identify prominent attributes that correspond to the dataset’s most relevant attributes. A multiple clustering of data based on the attributes is then performed to spot interesting initial centroids. The method performs multiple clustering on different attributes in the original data space and uses distinct attribute values in an attribute as cluster labels. These multiple views provide new insights into the data’s hidden structures to find consistent cluster structure and aid in computing better initial centroids. Three approaches were presented to select different attribute spaces that can help in generating different clustering views from the data, namely:

  • Vanilla approach: this method considers all the attributes (m) present in the dataset.

  • Prominent attributes: only a few attributes may be useful to generate multiple clustering views.

  • Significant attributes: a set of attributes generated from the prominent attributes will be retained.

    In [37], the initialization of the centroids was considered from the view of outlier detection. Two different initialization algorithms were proposed: a distance-based called Ini_Distance and an entropy-based outlier detection technique called Ini_Entropy within the RST. These two distances were used to calculate the degree of outlierness of each object. The complexity of these two methods is given as follows:

  • The complexity of the Ini_Distance is O(m × N2) which makes it not suitable for large datasets.

  • The complexity of the Ini_Entropy is O(KmN + m2N) which makes it not suitable for high dimensional datasets.

2.2.2 Cost function and distance based methods

In [7], the authors proposed an enhanced version of the k-mode by integrating the between cluster similarity terms in the optimization function to compute the individuals’ similarity. This term is defined as follows:

$$B\left(W,Z\right)=\sum_{l=1}^{K}\sum_{i=1}^{N}{\omega }_{li}S\left({Z}_{l}\right)$$
(7)

S(Zl) denotes the similarity between the lth cluster represented by zl and other clusters and \(\sum_{i=1}^{N}{\omega }_{li}\) its weight which is the number of objects in the lth cluster. Thus, it becomes possible to simultaneously minimize the within-cluster dispersion and enhance the between-cluster separation. The corresponding objective function is given as follows:

$$F\left(W,Z,\gamma \right)=\sum_{l=1}^{k}\sum_{i=1}^{N}{\omega }_{li}d\left({z}_{l},{x}_{i}\right)+\gamma \sum_{l=1}^{k}\sum_{i=1}^{N}{\omega }_{li}S\left({z}_{l}\right)$$
(8)

The parameter γ is to maintain a balance between the effect of the within-cluster information and that of the between-cluster information on the minimization process. In their proposal [7], the authors applied this enhancement to three methods: the Huang version of the k-modes, the Ng’s version of the k-modes and the weighted k-modes version.

In [38], a new dissimilarity measure is defined for the k-modes. The measure is based on the idea that the similarity between a data object and cluster mode, is directly proportional to the sum of relative frequencies of the common values in mode. Formally, the new dissimilarity measure is:

$$d\left({X}_{i},{Q}_{l}\right)=\sum_{j=1}^{m}\phi \left({x}_{ij},{q}_{lj}\right)$$
(9)

where

$$\phi \left({x}_{ij},{q}_{lj}\right)=\left\{\begin{array}{ll}1-{f}_{r}\left({A}_{j}={q}_{lj}|{X}_{l}\right)\left({x}_{ij}={q}_{lj}\right)\\ 1 \left({x}_{ij}\ne {q}_{lj}\right)\end{array}\right.$$
(10)

Note that \({f}_{r}\left({A}_{j}={q}_{lj}|{X}_{l}\right)\) is the frequency of qlj in cluster Xl.

2.2.3 Uncertainty based methods

One of the first attempts proposed to handle uncertainty was by using the fuzzy sets theory. As an extension of the fuzzy k-means, the fuzzy k-modes [24] were proposed and many variants were also developed [10]. In this algorithm, each pattern or object can have membership functions to all clusters rather than having a strict membership to exactly one cluster. In the fuzzy k-modes, the objects of the universe U will be put in k clusters by finding W and Z that minimize the objective function given in Eq. (11):

$$F\left(W,CEN\right)=\sum_{l=1}^{K}\sum_{i=1}^{N}{\omega }_{li}^{\alpha }d\left({cen}_{l},{obs}_{i}\right)$$
(11)

subject to

$$\left\{ \begin{array}{ll} {\omega}_{{\rm li}}{\epsilon}\left[{0,1}\right], \, {1}\leq{{\rm l}}\leq{{\rm K}}, \, {1}\leq{{\rm i}}\leq{{\rm N}}\\ \sum_{{l=1}}^{{\rm K}}{\omega}_{{\rm li}}{=1,1=l=K}\\ {0}{<}\sum_{{i=1}}^{{\rm N}}{\omega}_{\text{li}}{< N,}{1}\leq{{\rm i}}\leq{{\rm N}} \end{array}\right.$$

\(\alpha\) ∈ [1, + ∞[ is the fuzzy index; W = [wli] is a K × N real matrix, wli is the membership degree of obsi to the lth cluster; CEN = {cen1, cen2, …, cenk}, cenl = [cenl1, cenl2, …, cenlm] is the lth cluster prototype with categorical attributes a1, a2,..., am;

d(\({cen}_{l},{obs}_{i}\)) is the simple matching dissimilarity measure as defined by Huang.

The method finds fuzzy cluster modes when a simple matching dissimilarity measure is used for categorical objects.

In [10], the authors proposed a fuzzy categorical clustering algorithm where the fuzzy k-modes’ objective function was modified by adding a between-cluster information term. This consideration permitted simultaneously minimizing the within-cluster dispersion and enhancing the between-cluster separation. To obtain the modified objective function’s local optimal solutions, the corresponding update formulas of the membership matrix and the cluster prototypes were derived. In their methods, the authors integrated the within-cluster and between-cluster information to update the membership matrix and cluster prototypes, which can effectively produce clustering results with high within-cluster similarity and low between-cluster similarity.

By assigning confidence to objects in different clusters, the clusters’ core and boundary objects can be decided. This provides more useful information when dealing with boundary objects. However, the final fuzzy clustering outputs are still influenced by the mode initialization and the processing order of the objects in the datasets. Furthermore, these types of methods need to adjust one control parameter of membership fuzziness. In the applications, it is not clear how to find out the optimal parameters. Their values are often selected based on the decision makers’ previous knowledge of the domain and their intuition or the proposed criteria.

On the other hand, RST, proposed by Pawlak since 1980, has received considerable attention in the computational intelligence literature since its development. It was used to develop clustering algorithms to handle uncertainty. The main advantage of RST based clustering methods compared to fuzzy clustering is that they don’t require any domain expertise to assign the fuzzy membership.

In [39], the authors proposed the information-theoretic dependency roughness (ITDR), taking into account the information-theoretic attributes dependencies degree of categorical-valued information systems. In [26], the authors proposed the Total Mean Distribution Precision (TMDP) to select the partitioning attribute based on probabilistic RST. Using this technique and the concept of granularity, a new hierarchical clustering algorithm, called Maximum Total Mean Distribution Precision (MTMDP), for categorical data was developed. The MTMDP searches the best clustering attribute among the set of available features. It takes into account the mean distribution precision of all attributes and determines the further clustering node by considering the cohesion degree of all nodes. This consideration is a more reasonable method compared to previous methods proposed for RST clustering [40].

In [41], the authors proposed an algorithm based on fusing rough set and fuzzy set theories. The proposed rough fuzzy clustering method was used sequentially to integrate different measures to enhance the clustering performance. Thus, pure classified, semi rough and pure rough points are identified. After that, the Random Forest can be used in an incremental manner to classify these semi and pure rough points using pure classified points to yield better clustering results.

In [42], the authors addressed the issue of outlier detection as an initialization method to select the best centroids when starting the clustering process. The uncertainty regarding the clustering process is addressed by considering a soft computing approach based on rough sets. Accordingly, the modified clustering algorithm incorporates the lower and upper approximation properties of rough sets.

Most of the clustering approaches based Rough Set consider two techniques: (i) introducing a decision attribute based on which the dataset will be divided to partition the objects [39, 40, 43, 44] or, (ii) evaluating the lower, upper and quality of approximations of the a dataset [42, 45].

The selection of the most appropriate centroids when initializing the clustering process has been considered in many types of research since it may heavily impact the final results resulting from the partitioning procedure. However, this issue was only considered in the algorithm’s first step and not in all the consecutive iterations within the process. Even when executing the updating of the modes in each iteration, multiple modes can be proposed. It is crucial to identify the most appropriate when to consider instead of automatically using the random selection method.

3 The DRk-M paradigm

3.1 The clustering model of the DRk-M

Let IS = (U,A,V,f) be a categorical information system, and P a subset of the descriptive attributes A of the universe U (P \(\subseteq\) A). The objective of the clustering is to find the set of observations OBS = {obs1, obs2, …, obsN} and the set of centroids CEN = {cen1, cen2,…,cenK} that minimize the same cost function given in Eq. (1). The same constraints given in Eqs. (2) and (3) will also be considered. The DRk-M will implement the simple matching dissimilarity measure defined in Eqs. (4) and (5) during the assignments step of the observations to their closest clusters [4, 24].

The process of optimization can be described as follows:

  • Step 1. Choose K distinct objects cen1, cen2,...,cenK from the universe U as the initial set of modes (t = 1) CEN(t=1) = {cen1,cen2,...,cenK} \(\in\) Uk. Determine W(1) such that \({\text{F}}\) (W,cen(1)) is minimized.

  • Step 2. Determine cen(t+1) such that \({\text{F}}\)(W(t),CEN(t+1)) is minimized. If \({\text{F}}\)(W(t),CEN(t+1)) = \({\text{F}}\)(W(t),CEN(t)), then stop.

  • Step 3. Determine W(t+1) such that \({\text{F}}\)(W(t+1),CEN(t+1)) is minimized. If \({\text{F}}\)(W(t+1),CEN(t+1)) = \({\text{F}}\)(W(t),CEN(t+1)), then stop; otherwise set t = t + 1 and go to step 2.

The set of observations is represented by the matrix \(\bar{W}\)= [\(\bar{\omega }\) li] according to Theorem 1.

Theorem 1

Considering a fixed set of initial centroids \(\bar{Z}\)={cen1, cen2, …, cenh, …,cenk}, the optimization problem defined in Eq. (1) is minimized by defining the matrix \(\bar{W}\)= [\(\bar{\omega }\) li] according to the following equation:

$${\bar{\omega}_{\rm li}}= \left\{ \begin{array}{c} {1}, \, \, \, \, {{\rm i}}{{\rm f}} {{\rm D}} \, \left(\bar{{\rm cen}}_{{\rm l}},{{\rm obs}}_{{\rm i}}\right) \,\leq {{\rm D}} \, \left(\bar{{\rm cen}}_{{\rm h}},{{\rm obs}}_{{\rm i}}\right) \, \, {{\rm f}}{{\rm o}}{{\rm r}} \, {1} \, \leq {{\rm h}} \, \leq {{\rm k}}\\ {0}, \, \, \, \,{{\rm o}}{{\rm t}}{{\rm h}}{{\rm e}}{{\rm r}}{{\rm w}}{{\rm i}}{{\rm s}}{{\rm e}} \end{array}\right.$$

D stands for the simple matching dissimilarity measure. The \(\bar{W}\)= [\(\bar{\omega }\) li] matrix is a NxK binary matrix that reports whether an observation is part of a given cluster or not. To better illustrate the notion of this matrix, let’s consider the example given in Fig. 2.

Fig. 2
figure 2

W matrix with assigning observations to clusters

In the given example, the dataset is composed of seven observations to be put in three clusters. Once the clustering is terminated, the W matrix will be generated as given in Table 1. If an observation obsi belongs to the cluster Cll then Wil = 1 otherwise Wil = 0. The W matrix is used to determine to which cluster, each observation belongs.

Table 1 Assignment of observations to clusters

Theorem 2

let cenlj = [cenl1, cenl2,...,\({cen}_{ld}\)] be the mode of the lth (1 \(\le\) l \(\le\) K) cluster and \({V}_{{a}_{j}}\)= \(\left\{{a}_{j}^{(1)},{a}_{j}^{(2)},\dots ,{a}_{j}^{({n}_{j})}\right\}\) the domain of attributes ajwhere |aj|= nj be where \(\left(1 \le j \le d\right)\). For a given object obsi = [obsi1, obsi2,…,\({obs}_{id}\)].

\(F\left(W,Z\right)=\sum_{{l=1}}^{\text{K}}\sum_{{i=1}}^{\text{N}}{\omega}_{\text{li}}{{\text{D}}}_{d}\left({\text{cen}}_{\text{l}},{\text{obs}}_{\text{i}}\right)\) is minimized if and only if \({\text{cen}}_{\text{lj}}={\text{a}}_{\text{j}}^{\text{(r)}}\)


where \({a}_{j}^{(r)}\in {V}_{{a}_{j}}\) satisfies:

$$\left|\left\{{\omega}_{\text{li}}\text{ | }{\text{obs}}_{\text{ij}}={\text{a}}_{\text{j}}^{\text{(t)}},{\omega}_{\text{li}}= \text{1} \right\}\right|\le \left|\left\{{\omega}_{\text{li}}\text{ | }{\text{obs}}_{\text{ij}}={\text{a}}_{\text{j}}^{\text{(r)}},{\omega}_{\text{li}}= \text{1} \right\}\right|$$

where 1 ≤ t, r ≤ nj for 1 ≤ j ≤ d

In other words, according to theorem 2, the quantity \(\left|\left\{{\omega}_{\text{li}}\text{ | }{\text{obs}}_{\text{ij}}={\text{a}}_{\text{j}}^{\text{(r)}},{\omega}_{\text{li}}= \text{1} \right\}\right|\) should be maximized.

Proof of theorem 2

For a given matrix W, we have:

$$F\left(W,Z\right)=\sum_{\text{l=1}}^{\text{K}}\sum_{\text{i=1}}^{\text{N}}{\omega}_{\text{li}}{{\text{D}}}_{\text{d}}\left({\text{cen}}_{\text{l}},{\text{obs}}_{\text{i}}\right)=\sum_{\text{l=1}}^{\text{K}}\sum_{\text{i=1}}^{\text{N}}\sum_{j=1}^{d}{\omega}_{li}{\text{D}}_{{a}_{j}}\left({\text{cen}}_{\text{lj}},{\text{obs}}_{\text{ij}}\right)=\sum_{\text{l=1}}^{\text{K}}\sum_{\text{j=1}}^{d}\sum_{i=1}^{N}{\omega}_{li}{\text{D}}_{{a}_{j}}\left({\text{cen}}_{\text{lj}},{\text{obs}}_{\text{ij}}\right)=\sum_{\text{l=1}}^{\text{K}}\sum_{\text{j=1}}^{d}{\gamma }_{lj}$$

where \({\gamma }_{lj}=\sum_{i=1}^{N}{{\omega}_{li}{\text{D}}}_{{a}_{j}}\left({\text{cen}}_{\text{lj}},{\text{obs}}_{\text{ij}}\right)\). Thus minimizing \(F\left(W,Z\right)\) corresponds to minimizing \({\gamma }_{lj}\). Besides,

\({\gamma }_{lj}=\sum_{i=1}^{N}{{\omega}_{li}{\text{D}}}_{{a}_{j}}\left({\text{cen}}_{\text{lj}},{\text{obs}}_{\text{ij}}\right)=n-\left|\left\{{\omega}_{\text{li}}\text{ | }{\text{cen}}_{\text{lj}}={\text{obs}}_{\text{ij}},{\omega}_{\text{li}}= \text{1} \right\}\right|\) and thus minimizing \({\gamma }_{lj}\) corresponds to maximizing \(\left|\left\{{\omega}_{\text{li}}\text{ | }{\text{cen}}_{\text{lj}}={\text{obs}}_{\text{ij}},{\omega}_{\text{li}}= \text{1} \right\}\right|\) where n represents the cardinality of the dataset.

In terms, in theorem 2, minimizing the cost function \(F\left(W,Z\right)\) corresponds to minimizing all the inner sums of the quantity \({\gamma }_{lj}\) that are nonnegative and independent. The inner sum is minimized iff every term \(n-\left|\left\{{\omega}_{\text{li}}\text{ | }{\text{cen}}_{\text{lj}},{\text{obs}}_{\text{ij}},{\omega}_{\text{li}}= \text{1} \right\}\right|\) is minimal which requires maximizing the cardinality of the sets where \({\text{cen}}_{\text{lj}}={\text{obs}}_{\text{ij}}.\)

Theorem 3

The new proposed clustering method with the considered dissimilarity measure converges in a finite number of iterations.

Proof theorem 3

Only a finite number of possible cluster modes CEN = {cen1, cen2, …,cenK} can be defined. We then show that each final mode can have only one occurrence in the clustering process. This case corresponds to the last iteration and the stop criterion of the DRk-M.

If not, then there exist two distinct iterations t1\(\ne\) t2 such that the centroids are equal \({\text{CEN}}^{\left({\text{t}}_{1}\right)}= \text{ } {\text{CEN}}^{\left({\text{t}}_{2}\right)}\). According to the first Theorem, the proposed clustering algorithm using the simple matching dissimilarity measure computes the minimizers \({W}^{\left({t}_{1}\right)}\) and \({W}^{\left({t}_{2}\right)}\) for \(CEN={CEN}^{\left({t}_{1}\right)}\) and \(CEN={CEN}^{\left({t}_{2}\right)}\) for these two iterations, respectively which implies that: \(F\left({W}^{\left({t}_{1}\right)},{CEN}^{\left({t}_{1}\right)}\right)=F\left({W}^{\left({t}_{1}\right)},{CEN}^{\left({t}_{2}\right)}\right)=F\left({W}^{\left({t}_{2}\right)},{CEN}^{\left({t}_{2}\right)}\right)\). In the other hand, the sequence \(F\left({W}^{\left(t\right)},{CEN}^{\left(t\right)}\right)\) generated with the DRk-M using the simple matching dissimilarity measure is strictly decreasing which is not compatible with the previous result.

3.2 The mode of a categorical cluster

In order to illustrate the notion of the mode, an example is provided in Table 2. Let’s consider a cluster composed of seven observations {obs1, obs2, obs3, obs4, obs5, obs6, obs7} and described by four categorical attributes {a1, a2, a3, a4}:

Table 2 Illustrative example of the generation of the candidate modes in a categorical dataset

According to Table 2, the modes corresponding to that cluster are given as follows:

  • The domain DOM of the attribute ai corresponding to the set of the modalities taken by each attribute are given as follows: DOM (a1) = {a,c,d}, DOM (a2) = {e,f,g,h}, DOM (a3) = {l, n, k,m} and DOM (a4) = {x,z,y}

  • The most frequent value of a1 could be a or d with three occurrences.

  • The most frequent value of a2 could be e, f or g with two occurrences.

  • The most frequent value of a3 is l with four occurrences.

  • The most frequent value of a4 is y with three occurrences.

Thus, for the cluster given in Table 2, six candidate modes could be defined and are provided as follows:

$${\text{Q}}_{{1}} \, = \,\left[ {{\text{a}},{\text{e}},{\text{l}},{\text{y}}} \right],{\text{Q}}_{{2}} \, = \,\left[ {d,f,l,y} \right], \, Q_{3} \, = \,\left[ {a,g,l,y} \right], \, Q_{4} \, = \,\left[ {d,e,l,y} \right], \, Q_{5} \, = \,\left[ {a,f,l,y} \right], \, Q_{6} \, = \,\left[ {d,g,l,y} \right]$$

In the original version of the k-modes, the mode is selected randomly and, to our knowledge, no previously defined method was provided to identify the most appropriate mode which is considered as a restriction that limits the performance of the clustering method.

3.3 Uncertainty using the Rough Set Theory (RST)

The RST can deal with imperfect, vague and imprecise data based on the notion of indiscernibility between the observations. In this context of study, it is used to identify the most suitable modes among a list of candidate ones. With any rough set, a pair of precise sets, called the lower approximation and upper approximations, is associated. The lower approximation consists of all objects which surely belong to the set and the upper approximation contains all objects which possibly belong to the set [40, 46].

Definition 1

The indiscernibility relation.

Let IS = (U,A,V,f) be a categorical information system and B \(\subseteq\) A a subset of attributes, a binary relation IND(B), called indiscernibility relation between two observations obsi and obsj of U is defined as:

IND(B) = {(obsi,obsj) \(\in\) U x U | \(\forall\) a \(\in\) B, f(obsi,a) = f(obsj,a)}.

Thus, it is possible to consider that for every subset of attributes B selected from A, an indiscernibility relation can be generated. In other words, two observations are indiscernible in the context of a set of attributes if they have the same values for those attributes.

Definition 2

The lower approximation.

The lower approximation of a subset X \(\subseteq\) U and B \(\subseteq\) A denoted B*(X) or B(X) is defined as follows:

$${B}_{*}\left(X\right)=\bigcup_{x\in U}\left\{x:{\left[x\right]}_{B}\subseteq X\right\}$$

Definition 3

The upper approximation.

The upper approximation of a subset X \(\subseteq\) U and B \(\subseteq\) A denoted B*(X) or \(\bar{B(X)}\) is defined as follows:

$${B}^{*}\left(X\right)=\bigcup_{x\in U}\left\{x:{\left[x\right]}_{B}\cap X\ne \mathrm{\O }\right\}$$

In order to better understand the notion of indiscernibility in a dataset, an example is provided in Table 3 related to the Covid-19 pandemic. Covid-19 is a strain of coronavirus that first broke out in Wuhan, China in December 2019 and has since become a global pandemic. The dataset corresponds to seven attributes used as descriptive features and symptoms depicting the Covid-19. The dataset is composed of five patients and the last column corresponds o the labels of whether the patient is affected or not by the Covid-19.

Table 3 Classification dataset for the Covid infection

In the example provided in Table 3, all the data were collected for various patients according to the symptoms that can either depict a Not Covid or a Covid-19 illness. According to the table, two classes can be identified: the first one represents patients identified as infected by the Covid virus = {patient01,patient03,patient05} and the second one represents patients identified as not infected by the Covid virus = {patient02, patient04}. Normally, it is expected that two users having the same profile will be classified into the same class which is not correct in this case since patient01 and patient02 do not follow this rule. According to definition 1, {user01, user02} are said to be indiscernible (similar) in view of the available set of attributes. According to definitions 2 and 3, the lower and upper approximations can be defined as follows:

  • The lower approximation of the concept {Covid} = {patient03, patient05}

  • The upper approximation of the concept {Covid} = {patient01, patient02, patient03, patient05}

  • The lower approximation of the concept {not Covid} = {patient04}

  • The upper approximation of the concept {not Covid} = {patient02,patient04}

3.4 The rough modes

The DRk-M can be seen as a generalization of the Huang’s definition of the mode. The approach is based on defining the list of all candidate modes, then generate a sub list that represents the most potentially modes and called rough modes.

Definition 4

Let OBS = {obs1, obs2, …, obsN} be a set of categorical objects composed of N observations described by d categorical attributes A1, A2, …, Ad. A rough mode of OBS = {obs1, obs2, …, obsN} is a set of vectors Q = [q1, q2, …, qd] that minimize the quantity:

$$D\left(OBS,Q\right)=\sum_{i=1}^{N}d\left({obs}_{i},Q\right)$$
(12)

d is the simple matching dissimilarity measure.

The rough mode is the closest element to all the observations of the cluster. Minimizing the previous quantity is a key issue to determine the rough modes.

Theorem 4

The function D(OBS,Q) is minimized if and only if:

$$fr\left(Aj\hspace{0.17em}=\hspace{0.17em}qj\right| {Cl}_{j})\hspace{0.17em}\ge \hspace{0.17em}fr(Aj\hspace{0.17em}=\hspace{0.17em}ckj |{Cl}_{j})$$

for qj ≠ ckj for all j = 1, …, d where fr(Aj = ckj|\({Cl}_{j}\))= \(\frac{{n}_{{c}_{kj}}}{N}\) corresponds to the relative frequency of the kth category ckj in attribute Aj and \({n}_{{c}_{kj}}\) is the number of objects having the kth category ckj in attribute Aj.

In other words, the D(OBS,Q) quantity is minimized by considering the most frequent modalities in each attribute to compose the rough mode.

Proof of Theorem 4

let fr(Aj = ckj|\({Cl}_{j}\))= \(\frac{{n}_{{c}_{kj}}}{N}\) be the relative frequency of the kth category ckj in attribute Aj, where N is the total number of observations of the dataset and \({n}_{{c}_{kj}}\) the number of objects having the category ckj.

we have

$$\sum_{i=1}^{N}d\left({obs}_{i},Q\right)=\sum_{i=1}^{N}\sum_{j=1}^{d}\delta \left({x}_{ij},{q}_{j}\right)= \sum_{j=1}^{d}\left(\sum_{i=1}^{N}\delta \left({x}_{ij},{q}_{j}\right)\right)=\sum_{i=1}^{d}N-{n}_{ij}$$

where \(\delta \left({x}_{ij},{q}_{j}\right)\) corresponds to the simple matching dissimilarity measure and thus can take either 1 or 0 with a sum maximum value of N. \({n}_{ij}\) represents the number of cases where xij = qj. Thus, minimizing \(\sum_{i=1}^{N}d\left({obs}_{i},Q\right)\) corresponds to minimizing

$$\sum_{i=1}^{d}N-{n}_{ij}=\sum_{i=1}^{d}N\left(1-\frac{{n}_{qj}}{N}\right)=\sum_{i=1}^{d}N\left(1-{f}_{r}\left({A}_{j}={q}_{j}|{Cl}_{j}\right)\right)$$

Because \(N\left(1-{f}_{r}\left({A}_{j}={q}_{j}|{Cl}_{j}\right)\right)\ge 0\) for \(1\le j\le d, \sum_{i=1}^{N}{d}_{1}\left({obs}_{i},Q\right)\) is minimized if and only if every \(N\left(1-{f}_{r}\left({A}_{j}={q}_{j}|{Cl}_{j}\right)\right)\) is minimal. Thus, \({f}_{r}\left({A}_{j}={q}_{j}|{Cl}_{j}\right)\) must be maximal.

Theorem 4 is used to compute the rough modes that correspond to the list of all possible modes within the cluster.

Definition 5

The rough upper and lower approximations

Let S = (U, A, V, f) be an information system, let B be any subset of attributes A and let X be any subset of observations U. The B rough-upper approximation of X, denoted by \(\bar{{B}_{R}\left(X\right)}\) and B rough lower approximation \(\underset{\_}{{B}_{R}\left(X\right)}\), are defined respectively as follows:

$$\bar{{B}_{R}\left(X\right)}= \bigcup_{x\in Q}\left\{B\left(x\right):B(x)\cap X\ne \mathrm{\O }\right\}\mathrm{ and}\hspace{0.17em}\underset{\_}{{B}_{R}\left(X\right)}\bigcup_{x\in Q}\left\{B\left(x\right):B(x)\subseteq X\right\}$$

In all cases, the rough mode is a vector that contains the most frequent modalities in each attribute of the cluster observations. It may be an element of the cluster or a synthetic one generated during the process.

To select the best mode in the set of potential centroids, we don’t only consider the distance between objects, but also the average density of the modes. If the distance between the object and the already existing cluster centers is the only considered factor, it is possible that outlier is taken as a new cluster center. Similarly, if the density of the object is only taken into account, it is utmost possible that many cluster centers can be located in the surrounding of one center. To avoid these potential problems, the distance between objects with the density of the object will be combined together to measure the possibility of an object to be a cluster center.

To better illustrate the notion of rough mode, Fig. 3 is given.

Fig. 3
figure 3

The DRk-M clustering process

In Fig. 3, the clustering process of the DRk-M is provided. The DRk-M propose to investigate the step where the modes are updated in each iteration of the process which corresponds to the third step of the process. The simple matching dissimilarity measure is used as a distance metric and no centroid initialization in the first step is incorporated. In step 3, the DRk-M considers that more than only one mode is identified. This number can vary from a cluster to another. The mode with the highest density value will has a high probability to be selected as a centroid for that cluster and thus put in the upper approximation.

3.5 The proposed algorithm

The algorithm is described as follows:

figure a

In the first step of the DRk-M, K initial observations are randomly selected as cluster modes. This initial random selection is the same approach also used in the k-modes. However, many previous methods proposed initialization methods to select the most appropriate initial modes [6, 8, 35,36,37]. It can be also possible to integrate in the upcoming researches initialization methods to the DRk-M. In the second step, the simple matching dissimilarity measure is used to assign the observations to their closest clusters. The focus of this step is to minimize the cost function defined in Eq. 10. In the third step, all possible candidate modes are computed for each obtained cluster considering the modality frequency for each attribute and put either in the lower or upper approximation.

3.6 Evaluating the complexity of the DRk-M

The DRk-M is scalable when compared to the standard k-modes since it does not affect the clustering paradigm but only introduces a new approximation step towards identifying the most adapted centroid in the cluster. In order to assess the scalability of the DRk-M it is required to computationally analyze all the different steps involved in the clustering process. As for the standard k-modes, the N observations of the DRk-M will be assigned into K clusters in t iterations and thus the complexity will be O(NKdt). Then, in each iteration t, the computational complexity required to compute the modes is O(NKtd \({n}_{{c}_{kj}}\)) where \({n}_{{c}_{kj}}\) is the number of objects having the category ckj. Finally, a time complexity of O(|Ck|pKtd) where |Ck| is the cardinality of the considered cluster is required to assign the identified rough modes either to the upper or lower approximation of the rough sets. As a conclusion, the overall complexity of the DRk-M will be O(NKdt) + O(NKtd \({n}_{{c}_{kj}}\)) + O(|Ck|pKtd) = O(NK|Ck|ptd \({n}_{{c}_{kj}}\)). Considering the approximation that K, t, d, |Ck|, p, \({n}_{{c}_{kj}}\) are <  <  < N, it is possible to conclude that the overall time complexity of the algorithm is O(N).

4 Experimentations

In this section, it is proposed to evaluate the clustering performance and scalability of the DRk-M. The algorithm will be compared to many states of the art algorithms including the Huang’s k-modes [4] (1998), the Ng’s k-modes (2007) [50], the Cao’s dissimilarity (2012) [11], the improved Huang’s k-modes [7] (2014), the Weighted k-modes [7] (2014), the Improved Weighted k-modes [7] (2014), Improved Ng’s k-modes [7] (2014), the Bai’s fuzzy k-modes [10] (2013), the Khan’s initialization method [36] (2013) and the Fuzzy k-modes [10] (2013). Various experimental datasets will be used with several testing configurations either in terms of the number of observations N, clusters K or dimensions d. The efficiency of the DRk-M will be validated using several well known evaluation metrics. The algorithms were coded using the Java coding language and the experiments were executed with an Intel Core i7-3.5Ghz machine with 16 GB memory capacity.

4.1 Evaluation metrics

4.1.1 The accuracy

The accuracy is used to qualify the correctly classified cases. To compute the accuracy, each cluster is assigned to the most frequent pattern in the cluster according to the modalities in the attributes. The accuracy of this assignment is then measured by counting the number of correctly assigned observations and dividing it by the total number of observations N. The accuracy is computed according to Eq. 13:

$$purity\left(\mathcal{C},{\mathbb{C}}\right)(AC)=\frac{1}{N}\sum_{k}\underset{j}{\mathit{max}}\left|{\omega }_{k}\bigcap {c}_{j}\right|$$
(13)

\(\mathcal{C}=\left\{{\omega }_{1},{\omega }_{2},\dots ,{\omega }_{3}\right\}\) is the set of clusters and \({\mathbb{C}}=\left\{{c}_{1},{c}_{2},\dots ,{c}_{j}\right\}\) is the set of classes identified from the patterns. The accuracy is always a positive value that ranges from 0 to 1 where a higher value of the accuracy depicts a better clustering.

4.1.2 The entropy

The entropy is used to measure the disorder in a distribution of objects. The smallest value for the entropy is 0. An increasing value of this metric indicates a bad clustering. This metric denoted H is defined in Eq. 14:

$$H\left(\mathcal{C}\right)=-\sum_{k}P\left({\omega }_{k}\right)logP\left({\omega }_{k}\right)=-\sum_{k}\frac{{\omega }_{k}}{N}log\left(\frac{{\omega }_{k}}{N}\right)$$
(14)

\(P\left({\omega }_{k}\right)\) and \(P\left({c}_{j}\right)\) are the probabilities of an observation being in cluster \({\omega }_{k}\) and class \({c}_{j}\) respectively.

4.1.3 Normalized mutual information

The mutual information (MI) of two random variables is a measure of the mutual dependence between them. The Normalized Mutual Information (NMI) metric ranges from 0 to 1 and as its value is high, better clustering is obtained. The NMI is defined according to Eq. 15:

$$NMI\left(\mathcal{C},{\mathbb{C}}\right)=\frac{2\times I\left(\mathcal{C},{\mathbb{C}}\right)}{H\left(\mathcal{C}\right)+H\left({\mathbb{C}}\right)}$$
(15)

I is the mutual information defined in Eq. 16:

$$I\left(\mathcal{C},{\mathbb{C}}\right)=\sum_{k}\sum_{j}P\left({\omega }_{k}\bigcap {c}_{j}\right)\times log\left(\frac{P\left({\omega }_{k}\bigcap {c}_{j}\right)}{P\left({\omega }_{k}\right)P\left({c}_{j}\right)}\right)=\sum_{k}\sum_{j}\frac{\left|{\omega }_{k}\bigcap {c}_{j}\right|}{N}\times log\left(\frac{N\left|{\omega }_{k}\bigcap {c}_{j}\right|}{\left|{c}_{j}\right|\left|{\omega }_{k}\right|}\right)$$
(16)

\(P\left({\omega }_{k}\bigcap {c}_{j}\right)\) corresponds to the probability of an observation being in the intersection of \({\omega }_{k}\) and \({c}_{j}\).

Three other evaluation metrics will also be used in this study which are precision, recall and the F1-score. These metrics can be directly computed from the confusion matrix. In order to better understand how these metrics are computed, let’s consider the example given in Table 4:

Table 4 Confusion matrix for two classes

The considered example concerns the classification resulting from a dataset composed of two classes: Cancer = Yes and Cancer = NO. The goal is to predict whether a patient has Cancer or not for a total of 100 patients. The confusion matrix represents the predicted (returned by the model) and the actual (real) results for each of the two classes. It is possible based on this matrix to identify the observations that were correctly classified and those that were not.

The confusion matrix obtained can be interpreted as follows:

  • 25 patients that have cancer were correctly classified by the system as True Positives (TP), i.e. they represent patients that have truly cancer and were predicted with cancer by the system.

  • 65 patients that do not have cancer were also correctly classified by the system as True Negatives (TN).

  • Only 10 (5 + 5) patients were wrongly classified by the system where either the patients have cancer and were spotted as not having cancer or vice-versa. These two groups correspond respectively to False Negatives (FN) and False Positives (FP)

4.1.4 The precision

The precision is a measure of the correctly classified positive cases from all the predicted positive cases. Thus, it is useful when the costs of False Positives is high. The precision can be directly computed from the confusion matrix as follows:

$$\text{PR} = \frac{\text{True Positive (TP)}}{\text{True Positive }\left({\text{TP}}\right)\text{+False Positive (FP)}}$$

4.1.5 The recall

The recall is a measure of the correctly identified positive cases from all the actual positive cases. It is important when the cost of False Negatives is high. This metric is defined as follows:

$$\text{RE} = \frac{\text{True Positive (TP)}}{\text{True Positive} \left({\text{TP}}\right)\text{+False Negative (FN)}}$$

One other way to compute the accuracy using the confusion metric is to apply the formula:

$$\text{AC} = \frac{\text{True Positive }\left({\text{TP}}\right)\text{+True Negative (TN)}}{\text{True Positive }\left({\text{TP}}\right)+ \text{False Positive } \left({\text{FP}}\right)+ \text{False Negative } \left({\text{FN}}\right)\text{+True Negative (TN)}}$$

4.1.6 The F1-score

The F1-score is the harmonic mean of the precision and recall and gives a better measure of the incorrectly classified cases than the accuracy metric.

$$F1-score=2\times \frac{PR\times RE}{PR+RE}$$

For example, in the considered cancer dataset, according to the values presented in the confusion matrix, the values of the precision, recall, accuracy and F1-score can be given as follows:

$$\text{PR} = \frac{25}{\text{25+5}}=\frac{25}{{30}}\text{=0.83}$$
$$\text{RE} = \frac{25}{\text{25+5}}\text{=0.83}$$
$$\text{AC} = \frac{25+65}{\text{25+5+5+65}}=\frac{90}{{100}}\text{=0.9}$$
$$\text{F1-score} =2 \times \frac{0.83 \times 0.83} {\text{0.83+0.83}}{=0.83}$$

4.1.7 The Silhouette score

The Silhouette score is a statistical interpretation and validation of clustering results that provides a measure of how well a data point is classified when it is assigned to a cluster. Thus, this metric can potentially be used as a quality measure to validate the clustering results according to the number of clusters in our case. The silhouette ranges from − 1 to + 1, where a high value indicates that the object is well matched to its cluster and poorly matched to neighboring clusters.

Considering a data observation obsi \(\in\) Cj that was classified in cluster Cj, it is possible to measure the mean distance between obsi and all the other data points in the same cluster as follows:

$${\text{a}}\left({\text{obs}}_{\text{i}}\right)=\frac{1}{\left|{\text{C}}_{\text{i}}\right|\text{-1}}\sum_{{\text{j}}{\epsilon}{\text{C}}_{\text{i}} {,ij}} \neq {\text{d}}\left({\text{obs}}_{\text{i}},{\text{obs}}_{\text{j}}\right)$$

where d is the distance used such as the Euclidean distance. This metric can also be interpreted to qualify how well obsi is assigned to a cluster: the smaller the value of a, the better the assignment is.

It is also possible to define the mean dissimilarity of obsi to other clusters Ck as the mean of the distance from obsito all the points in Ck (where Ck \(\ne\) Ci). For each data point obsi, the mean dissimilarity is computed as follows:

$${\text{b}}\left({\text{obs}}_{\text{i}}\right)=\underset{{k\neq i}}{\text{min}}\frac{1}{\left|{\text{C}}_{\text{k}}\right|}\sum_{{\text{j}}{\epsilon}{\text{C}}_{\text{k}}}{\text{d}}\left({\text{obs}}_{\text{i}},{\text{obs}}_{\text{j}}\right)$$

This distance should be the smallest mean distance of obsi to all the points in any other cluster, of which obsi is not a member. The cluster with this smallest mean dissimilarity is said to be the "neighboring cluster" of obsibecause it is the next best fit cluster for point obsi. The silhouette value of one data point obsi is then given as follows:

$${\text{s}}\left({\text{obs}}_{\text{i}}\right)=\left\{\begin{array}{c}\frac{{\text{b}}\left({\text{obs}}_{\text{i}}\right)\text{-a}\left({\text{obs}}_{\text{i}}\right)}{{\text{max}}\left\{{\text{b}}\left({\text{obs}}_{\text{i}}\right)\text{, a}\left({\text{obs}}_{\text{i}}\right)\right\}}, \, {\text{i}}{\text{f}} \, \left|{\text{C}}_{\text{i}}\right|\text{>}{1}\\ {0} \, {\text{i}}{\text{f}} \, \left|{\text{C}}_{\text{i}}\right|={1}\end{array}\right.$$

For \(s\left({obs}_{i}\right)\) to be close to 1, it is required that a(obsi) <  < b(obsi). As a(obsi) is a measure of how dissimilar obsiis to its own cluster, a small value means it is well matched. Furthermore, a large b(obsi) implies that obsi is badly matched to its neighboring clusters. Thus, if \(s\left({obs}_{i}\right)\) is close to one, this means that the data is appropriately clustered. If \(s\left({obs}_{i}\right)\) is close to negative one, then by the same logic, it is evident that it would be better to classify obsi in a neighboring cluster. An \(s\left({obs}_{i}\right)\) near zero means that the data is on the border of two natural clusters.

4.2 Experiments using the UCI datasets

In order to assess the efficiency of the DRk-M, the algorithm was experimented using five datasets extracted from the UCI Machine Learning Repository. These datasets were widely used in the literature to evaluate states of the art methods and are described as follows:

  • Mushroom data: The data set includes descriptions of hypothetical samples corresponding to 22 species of gilled mushrooms in the Agaricus and Lepiota Family. It consists of 8124 objects and 23 categorical attributes. Each object belongs to one of the two classes, edible (4208 objects) and poisonous (3916 objects).

  • Breast cancer data: The data set was obtained from the University Medical Center, Institute of Oncology, Ljubljana, Yugoslavia. It consists of 699 data objects and 9 categorical attributes. It has two clusters: Benign (458 data objects) and Malignant (241 data objects).

  • Credit approval data: The data set contains data from credit card organization, where customers are divided into two classes. It is a mixed data set with eight categorical and six numeric features. It contains 690 data objects belonging to two classes: negative (383 data objects) and positive (307 data objects). In the test, we only consider the categorical attributes on the data set.

  • Zoo data: Zoo data set contains 101 elements described by 17 Boolean-valued attributes classified into seven classes.

  • Lung cancer data: The data set was used by Hong and Young to illustrate the power of the optimal discriminant plane even in illposed settings. This data has 32 instances described by 56 categorical attributes. It contains three classes.

In the experiments, the DRk-M is first tested using the Mushroom dataset in terms of its dimensionality. Three evaluation metrics are used: the accuracy, the entropy and the NMI. The experiments are conducted by varying the number of dimensions d (4 → 24) and the obtained results are given in Fig. 4.

Fig. 4
figure 4

Experiments for the Mushroom dataset with various dimensions (N = 8124, K = 2)

According to the results, the DRk-M provided better results in 86% of the total cases which makes it more accurate than the k-modes. For the accuracy, the values range from a = 0.5437 for d = 9 to a = 0.7368 for d = 14. In these cases, the DRk-M provided better results in 16 cases (80% of the total cases). The lines representing the accuracy are given in the bottom of Fig. 4. For the NMI represented with the lines in the middle of Fig. 4, the DRk-M provided values ranging from NMI = 1.1288 for d = 7 and NMI = 2.4387 for d = 12. In these cases, the DRk-M provided better results in terms of the NMI in 14 cases (70% of the total cases). The last metric used is the entropy, the values computed for the DRk-M range from e = 2.3259 for d = 6 to e = 4.2498 for d = 15. In terms of the entropy, the DRk-M provided better results than the k-modes in 16 cases (80% of the total cases). The conducted experiments are interesting since they permitted experimenting the effect of various dimensionalities on the performance of the DRk-M and compare it with the k-modes. It is important to mention that in this case, the number of clusters is K = 2 which is in concordance with the ground truth of the mushroom dataset since this dataset is in fact composed of two classes as mentioned in the dataset description. Different results would have been obtained if another value of K was selected.

The breast cancer dataset was also used to assess the efficiency of the DRk-M. In the experiments, the DRk-M and k-modes were compared for various numbers of clusters K (6 → 10). The corresponding results are provided in Table 5 where the accuracy, entropy and NMI are used as evaluation metrics.

Table 5 Experimental results computed for the Breast cancer dataset for K (6 → 10), N = 644, d = 4

For the breast cancer dataset, the DRk-M provided better clustering results in 12 cases.

In this second part of the experiments, more UCI datasets are used to compare the DRk-M to many state of the art methods as enhancements of the k-modes. Since most of these methods suffer from stability issues, 100 runs were carried out of the DRk-M with various initial modes. This technique was also used in many previous studies to ensure stable results. The comparison results of the DRk-M with each of the methods are given in Tables 6, 7, 8, 9, 10. Each value in these tables is the average of 100 times experiments. In the tables, comparison of the DRk-M with some fuzzy k-modes methods as reported in [10] are also provided. In this case, the fuzziness parameter was fixed to α = 1.1. In fact, according to an explanation provided in [10], several values of the fuzziness parameter were tested and it was found that α = 1.1 provided the least value of the cost function to be minimized, i.e. best results were provided using this value. Besides, in all the experiments, the number of clusters is set to be equal to the number of classes for each of the given data sets in order to respect ground truth conditions. In the experiments, two metrics were used: the accuracy and the F1-score.

Table 6 The accuracy (AC) and F1-score computed for 100 runs for the Mushroom dataset
Table 7 The accuracy (AC) and F1-score computed for 100 runs for the lung cancer
Table 8 The accuracy (AC) and F1-score computed for 100 runs for the breast cancer dataset
Table 9 The accuracy (AC) and F1-score computed for 100 runs for the credit approval dataset
Table 10 The accuracy (AC) and F1-score computed for 100 runs for the soybean dataset

According to Tables 6, 7, 8, 9, 10, the DRk-M provided better clustering results in terms of the accuracy and F1-score for all the datasets considered except for the Mushroom dataset when testing the algorithm with the Khan’s initialization method [36]. In the state of the art methods, many variants of the k-modes were considered either by improving the simple matching dissimilarity measure, using fuzzy methods or implementing an initialization method to select the most accurate initial centroids. In all these cases, the DRk-M with the proposed Rough mode selection provided more accurate results. The results obtained confirm the dominance of using the DRk-M for categorical clustering and the advantage of implementing the RST in updating the modes during the segmentation process.

4.3 Experiments using the twitter datasets

In this section, two datasets collected from Twitter using the python coding language were considered. The twitter accounts targeted correspond to some profiles related to terrorist groups and the datasets are described as follows:

  • Dataset 1: this dataset contains 1803 instances described by 13 categorical attributes [“month of the tweet”, “tweet_id”, “source”, “device”, “in_reply_to_status_id”, “in_reply_to_user_id”, “in_reply_to_screen_name”, “user_tweet_id”, “user_tweet_name”, “user_tweet_screen_name”, “user_tweet_location” and “language”. The tweets collected correspond to specific key words related to cyber terrorism and are given as follows: Islamic State, caliphate editions, state of the caliphate, daesh, Battalion Okba-Ibn-Nafaa, African media.

  • Dataset 2: this second dataset contains 284 instances described by 10 categorical attributes [“tweet_date”, “screen name”, “tweet_id”, “in_reply_to_status_id”, “retweeted_status_id”, “reply_to_user_ID”, “user_verified”, “retweeted”, “user_tweet_location”, “hashtags”. The tweets were collected from two specific page user name given as follows: “Gzrawi” (131 tweets) for tweets posted by terrorist groups affiliated with “daech” and “Daesh_Online_01” (153 tweets).

In this study, a categorical clustering algorithm is experimented. In Table 11, an example on how to evaluate the clustering results using the accuracy is given. The number of clusters K is initially defined. Then, the clustering process is launched. In this step, it is required to define the groups’ class label which was set to the attribute location that corresponds to the place where the tweet was posted. A total number of 15 labels were then identified corresponding to various countries: Tunisia, Egypt, Algeria, Lybia, Morocco, Yemen, London, Iraq, KSA, Lebanon, Turkey, Kuwait, Syria, Yemen and NULL if no country is identified. For example, let’s consider Table 11 where K = 5 groups is used:

Table 11 Clusters resulting from the segmentation process for K = 5

In Table 11, the most frequent label in each cluster is identified in the last line = max. These values are then summed and used to compute the accuracy: a = 0.87.

Using the two datasets described above, the accuracy of the DRk-M and the k-modes is computed for various number of clusters K (5 → 10) and the obtained results are reported in Table 13.

According to the results given in Table 12, the DRk-M provided better results than the k-modes for the two datasets. For dataset 1, better results were obtained for all the experiments expect for K = 7 were the same accuracy was computed which can also be considered as an acceptable result. For dataset 2, better results were obtained in all cases expect for K = 5 and K = 10. Out of 24 experiments, the DRk-M provided better results in 18 cases (75%). In the other cases, the same accuracy was computed.

Table 12 Accuracy computed for the DRk-M and the k-modes for various N and K

In order to test the efficiency of the DRk-M for large datasets, the cardinality of dataset 2 (initially composed of 284 observations) was increased using several data copies. Thus, two datasets were generated with cardinalities N = 103 and N = 15 × 103. In this part of the experiments, the entropy and the NMI were used as evaluation metrics to evaluate the effectiveness of the DRk-M and compare it with the k-modes. These experiments were conducted for various K (3 → 15) and the results reported in Figs. 5 and 6:

Fig. 5
figure 5

Entropy computed for the for the two algorithms for N = 103 and K (3 → 15)

Fig. 6
figure 6

NMI computed for the two algorithms for N = 15 × 103 and K (3 → 15)

  • For N = 103, in almost 46% of the cases, the DRk-M provided higher entropy than the k-modes.

  • For N = 15 × 103, the DRk-M provided better clustering results with higher NMI values than the k-modes in 52% of the cases.

In order to statistically validate the obtained results, additional experiments were conducted on the same datasets using the silhouette evaluation metric. The Silhouette refers to a method of interpretation and validation of consistency within clusters of data. The technique indicates how well each object has been classified. The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). In Table 13, the average accuracy computed for the DRk-M and k-modes for 50 runs of the two algorithms is reported. Two datasets were considered with various number of clusters K (5 → 10). Besides, the standard deviation was also computed for each set of 50 runs to identify the degree of confidence of the average accuracy calculated.

Table 13 Average accuracy, STD and average_Silhouette computed for the DRk-M and the k-modes for the two Twitter datasets for 50 runs of the algorithms

A higher value of the silhouette indicates a more compact and separated cluster. Thus, according to the results provided in Table 13, the clusters generated using the DRk-M for all the experiments are more compact and isolated than those generated using the k-modes.

In Fig. 7, the value of the silhouette score is given in the y axis. Each red point corresponds to the silhouette score computed for a given clustering. The k-modes and DRk-Modes were executed 50 times using the Twitter dataset (N = 103 and K = 5 → 10). The silhouette mean SC was also computed and is given as follows:

Fig. 7
figure 7

Distribution of Silhouette scores for various clusterings according to the number of clusters for the DRk-M and k-modes (N = 103, K: 5 → 10 and 50 runs)

$${\rm SC} = \max\limits_{k} \tilde{s}{\rm (k)}$$

In the comparison between the resulting clusters, the focus is to identify the highest average Silhouette score resulting from the 50 runs conducted. From this purpose, a box plot representation was used. Box plots enable to study the distributional characteristics of a group of scores as well as the level of the scores. It is easy to identify the mean silhouette for each set of experiments which corresponds to the ( +) sign. According to the results, the DRk-M provided a higher silhouette average than the k-modes and closer to one which indicates that its performance in producing more compact and isolated clusters is higher than the k-modes.

4.4 Experiments using the GTD datasets

In this section, the Global Terrorism Database was used to assess the scalability of the DRk-M in terms of the execution time and the accuracy. The tests were conducted for several dataset cardinalities N (500 → 25 × 103) and various number of clusters K = 8 and K = 10. The execution time was computed and reported in Fig. 8.

Fig. 8
figure 8

Execution time computed for N (500 → 25 × 103) and K = 8 and 10

According to the results given in Fig. 8, the DRk-M provided higher computational time than the k-modes for large datasets. For small datasets, the two algorithms provided almost the same running performance. This issue is due to the time required for computing all the candidate modes in each cluster for the DRk-M which implies scanning the whole dataset multiple times for each attribute. Besides, it is possible to enhance the run time and obtain faster results by considering more powerful machines and resources either in terms of the memory or CPU.

In the other hand, to statistically validate the obtained results, the clustering outcomes were evaluated using the average accuracy for 50 runs of the two algorithms. Thus, two large datasets (N = 2 × 104 and N = 25 × 103) were considered for this purpose. The experiments were conducted for various number of clusters K = 3 → 8. Each algorithm was executed 50 times with various initial centroids in order to deal with stability issues. The average accuracy, STD and Silhouette were then considered for these 50 runs. The obtained results are reported in Table14.

Table 14 Accuracy, STD and average_Silhouette computed for the DRk-M and the k-modes for 50 runs using the GTD dataset

Table 14 provides results related to the average accuracy computed for 50 runs of the DRk-M and the k-modes for several numbers of clusters. The accuracy reported how well the observations are arranged in their corresponding clusters. According to the values of the accuracy computed, the DRk-M provided better results than the k-modes in all cases which makes it more effective and efficient. Besides, in order to statistically characterize the values of the accuracy, the standard deviation (STD) is used to evaluate the overall spread of the 50 values calculate for each case. A lower value of the STD indicates more close values to the mean value (average) and thus depicts better results. According to Table 14, the DRk-M provides almost better results in all cases expect for K = 5 and K = 7 where the values computed for the k-modes were better. The silhouette was also used to characterize the compactness and density of the clusters generated by measuring the distance between the observations arranged in the clusters generated. A closer value to 1 of the silhouette indicates more compact and dense clusters. Once again and based on the results reported in Table 14, the DRk-M provided more accurate results than the standard k-modes.

5 Conclusion

Categorical clustering has gained great interest since the development of the k-modes algorithm. This algorithm has some major limitation related to the update of the modes in the last step of the process. In this paper, the RST was used as an uncertainty based model to identify the most accurate modes in a list of candidate ones when implementing categorical clustering. This consideration permits avoiding the random selection of the modes previously used in all states of the art k-modes based methods. The DRk-M is proposed based on computing the density of each candidate mode. This characterizes the number of observations that are closer to it as much as possible. Modes with high density would have higher probability to be considered as centroids for that cluster. In the experiments, multiple datasets with various configurations were used to assess the efficiency of the DRk-M. It was experimentally demonstrated that the DRk-M provided promising results. However, one main limitation of the DRk-M is the computational time required compared with the k-modes that is considerable due to the fact that more arithmetic computations are necessary to compute the list of all the candidate modes in the cluster. Since the DRk-M is more flexible and less exclusive than the k-modes in terms of the selection of the centroids in each cluster, the algorithm would provide more efficient results and thus the accuracy will be boosted.