1 Introduction

Clustering is unsupervised learning that categorizes similar data into the same category. Various clustering algorithms with different capabilities and structures have been proposed [1,2,3,4,5,6]. In high-dimensional data, each object has a large number of features. Examples of high-dimensional data types can be found in computer vision applications, pattern recognition [7], and molecular biology [8].

Scalability is one of the major issues in clustering algorithms which often cause difficulties when facing problems with high-dimensional datasets [3, 5, 9]. High-dimensional data, suffers from the curse of dimensionality. In high-dimensional datasets, the distance between data points become practically indifferentiable. Therefore, it is hard to separate similar data points from dissimilar ones. Some clustering algorithms suffer from high time and space complexity to cluster high-dimensional datasets, which usually results in malfunctioning of the algorithms. In addition to these problems, high-dimensional data has many irrelevant features, meaning clusters are embedded in the subspaces of the entire feature space. Traditional clustering algorithms, such as K-means, hierarchical clustering, and DBSCAN, UALM [10], and FUALM [11]are not originally designed for high-dimensional data, and they often failed when applied to such datasets due to the “curse of dimensionality”. Therefore, many concepts are proposed for clustering high-dimensional data, such as: projected clustering, subspace clustering, multi-view clustering (MVC), ensemble clustering and hierarchical clustering for high-dimensional data.

Subspace-clustering algorithms, projected clustering algorithms [12,13,14,15] and MVC are introduced to tackle with high-dimensional clustering difficulties. The main goal of these algorithms is to find clusters within subspaces of the entire feature space. The subspace-clustering goal is to find all clusters in all subspaces, which means that a data point may belong to multiple clusters [16]. This phenomenon results in overlapping clusters. On the other hand, projected clustering, allocates each point to a unique cluster, thus resulting in non-overlapping clusters. Subspace and projected clustering algorithms have their drawbacks. As subspace- clustering-algorithms produce many large numbers of overlapping clusters, the interpretation of the result is complicated. Despite producing non-overlapping clusters, the projected clustering algorithms still have two limitations. First, the resulting subspace clusters have different dimensionality. Second, they are vulnerable to finding clusters of different shapes and densities [17]. MVC is a subspace clustering that aims to group similar subjects and separate dissimilar subjects by utilizing multiple sources of feature information. The goal is to find consistent clustering across different views. There are two main categories of existing MVC methods: generative (or model-based) approaches and discriminative (or similarity-based) approaches. Generative approaches focus on learning the data distribution and exploit generative models to represent each cluster. Discriminative approaches optimize an objective function that involves pairwise similarities to maximize the average similarity between clusters and minimize the average similarity within clusters [18]. In addition, some of these algorithms have growing time complexity as the dimensions of the datasets increase. Hence, subspace, projected, and MVC clustering have some restrictions in dealing with high-dimensional data. These restrictions of subspace-clustering algorithms show that a flexible high-dimensional clustering algorithm with a reasonable degree of generality is needed.

Ensemble clustering is developed as an essential expansion of the classical clustering problem. It overcomes the challenges faced by high-dimensional data and obtains high performance on various datasets. In the face of high-dimensional data; it divides the space into a series of subspaces and clusters each subspace separately. Clustering on the subspace reduces the complexity of clustering. Ensemble clustering combines the results of different clustering on a particular dataset, and finds a single (consensus) clustering result that is better in some sense than existing clustering. Therefore, ensemble clustering integrates clustering results on the same dataset from different sources. In ensemble clustering, finding the final cluster for each point is an NP-complete problem [16].

Hierarchical clustering is a method of clustering high-dimensional data that seeks to build a hierarchy of clusters. There are two strategies for hierarchical clustering: agglomerative and divisive. Agglomerative strategy is a “bottom-up” approach, in which each data start in its cluster, and clusters are merged hierarchically until all data fall into one cluster. Divisive strategy is a “top-down” approach, where all data start in one cluster, and the cluster is split hierarchically until each data fall into a separate cluster. The results of hierarchical clustering are usually presented in a dendrogram.

Majority of the mentioned clustering algorithms require the number of clusters to be known in advance while this is not compatible with the nature of many clustering problems. In this paper, we have used ensemble clustering along with hierarchical clustering to overcome problems associated with high-dimensional datasets. The proposed algorithm does not need the number of clusters as an input parameter. The algorithm is developed based on the concepts of the Active Learning Method (ALM) [19], while improves the UALM and FUALM-clustering algorithms developed for low-dimensional datasets.

In this paper, a hierarchical clustering with two phases, a divisive phase and an agglomerative phase, called HiDUALM (High-Dimensional Unsupervised Active Learning Method), is proposed. In the divisive phase, a zooming-in-process is applied to each already-found cluster, to find sub-clusters hierarchically. At each level of the hierarchy, an ensemble of projected clustering algorithms is done, which breaks the features' space into several single-feature spaces. In each dimension, the data points blur as one-dimensional fuzzy membership functions (1-D-MF) called ink-drop patterns, and map on the related feature axes. Ink-drop patterns of samples are aggregated on each feature axes. This process is called “ink drop spread” (IDS). As individual ink-drop patterns aggregate, the intensity value of overlapping portions becomes higher than other non-overlapping portions. The intensity values related to each feature are saved in a one-dimensional vector, named IDS-vector. After the formation of IDS-vectors, a projected clustering algorithm is executed on these IDS-vectors, separately. The bar graph of IDS-vector has some maxima and minima. The data between two adjacent minimums are labeled as one cluster in the projected feature space. Finally, a novel ensemble method is used to find the clusters in the entire feature space. This ensemble clustering method is done at each level of the hierarchy to find sub-clusters divisively. The second phase of the algorithm is the agglomerative phase, where the formed clusters are combined based on a new distance metric called \({k}^{2}\)-nearest neighbor. The proposed two-phases-hierarchical algorithm shows acceptable evaluation measures compared with Proclus, CLIQUE, DOC, and kMeans Projection Clustering algorithms, which are famous high-dimensional clustering algorithms. It also outperforms MV-RTSC, and IRFLLRR clustering algorithms, which are state-of-the-art clustering algorithms. Furthermore, unlike subspace clustering, it produces non-overlapping clusters in each hierarchy; hence, interpreting the results is straightforward. It also overcomes the problem associated with the UALM [10] clustering algorithm that cannot cluster high-dimensional datasets. Combining divisive and agglomerative hierarchical clustering allows clusters with different shapes and densities in the proposed algorithm. The innovations of the paper are:

  1. 1.

    The method presented in this paper offers different hierarchical solutions for clustering at different hierarchy levels. Moreover, this algorithm combines both agglomerative and divisive hierarchical clustering, where the distance parameter in these two hierarchical methods differs from each other, leading to acceptable hierarchical clustering results.

  2. 2.

    Improving the UALM and FUALM algorithms (ALM-based clustering algorithms), which are efficient and fast clustering algorithms for datasets with many samples. However, they faced memory limitations when clustering high-dimensional data. To overcome this issue, the hierarchical method HiDUALM is proposed in this paper.

  3. 3.

    Unlike many other high-dimensional clustering algorithms, for determining the parameters of the proposed algorithm there is no need iterative algorithms to find optimal parameters, nor does it need to know previous information about clusters (such as the number of clusters). Furthermore, unlike almost all other algorithms, in the proposed algorithm, training data are entered into the system, only once (one epoch).

  4. 4.

    A novel distance metric for two clusters is proposed in this paper as a \({K}^{2}\)-nearest neighbor distance metric, which leads to better clustering results compared to other distance metrics. The advantage of using the proposed distance metric is illustrated in Sect. 4, where Fig. 8 shows the clustering results on the spiral dataset with different distance metrics.

The rest of the paper is organized as follows. In Sect. 2, related works are discussed. In Sect. 3, ALM concepts are reviewed. The HiDUALM algorithm is described in Sect. 4, in detail. In Sect. 5, experimental results on various datasets are compared with related clustering algorithms; in addition, parameters sensitivity analysis and noise robustness analysis of the algorithm are done. Finally, Sect. 6 includes summarization and conclusion.

2 Related Works

In this section, related clustering algorithms such as ensemble clustering, subspace clustering and projected clustering algorithms are reviewed. In addition, various cluster distance measures are reviewed.

2.1 Subspace vs. Projected Clustering

Subspace and projected clustering are the clustering algorithm solutions to overcome difficulties related to traditional clustering algorithms when confronting high-dimensional datasets. In subspace-clustering algorithms, a data point could belong to several clusters in different subspace projections. In contrast, in projected clustering algorithms, each data point is assigned to only one cluster in one subspace. Aggarwal et al. in the CLIQUE approach [12], introduced subspace clustering. LatLRR [20] is also a subspace clustering algorithm which do the clustering by minimization of reweighted Frobenius norm iteratively, while removing the sparse noise and redundant information. IRFLLRR [20] improves the robustness of learning the subspace structure while keeping the effectiveness of the LatLRR model. It introduces the iterative reweighted Frobenius norm (IRFN) into the LatLRR framework. On the other hand, projected clustering algorithms are clustering methods that recognize separate clusters in subspace projections. PROCLUS (PROjected CLUstering) [14] is the well-known projected clustering algorithm, which modifies the k-Medoid. The PROCLUS algorithm discovers the subspace dimensions of each cluster by finding the adjacent space to that cluster. The objective function that PROCLUS tries to satisfy, is a function of the number of clusters to be detected and the average dimensionality of the clusters. DPC [21] is a projective clustering algorithm, which is a parameter-free clustering model. It is a constrained regression model that aims to find a transformation matrix and a binary indicator matrix to minimize the sum-of-squares error. APCGR [22] is Adaptive Projected Clustering with graph regularization, where the similarity matrix calculating and clustering process are conducted simultaneously. LPFCM [23] is another projected clustering algorithm based on FCM while preserving locality structure. MV-RTSC [24] is another projected clustering that computes a common subspace among all views in multi-view datasets, and the contribution of each view to the common subspace is optimized.

The CLIQUE algorithm discovers clusters by dividing each dimension into equal-width intervals, then considers those intervals with a density more significant than a threshold as clusters. Second, each set of two dimensions is examined: If there exist intersecting intervals in a two-dimensional space while the density in the intersection is greater than the threshold, the intersection is again considered as a cluster. For all dimensions, this procedure is repeated. After every step, a joint cluster is saved instead of adjacent clusters. This procedure was repeated until no higher dimension cluster could be formed. In the end, all saved clusters are reported as final clusters.

The ProClus algorithm works most likely to K-Medoids. Initially, a set of k-medoids is chosen. Then, the subspace spanned by attributes with low variance is determined for each medoid. After that, medoids, which are most probably outliers, are removed. Also, the medoids, which are part of a cluster that is better represented by another medoid, are removed until k-medoids are left at the end. Then, it is assumed that clusters are around these medoids.

In DOC (Density-based Optimal Projective Clustering), the Monte Carlo algorithm is used to compute approximations of best projective clusters. The algorithm discovers the best projective cluster from the remaining points by guessing points belonging to the optimal cluster (via random sampling). Then, it computes the best dimensions associated with the cluster. The definition of the projective cluster is based on (C, D) pair where C is a subset of the data set and D is a subset of the features (dimension) of the data space. DOC has two parameters, ω and α. In the DOC algorithm, an optimal projective cluster (C, D) is specified if C includes more than α% of the data set while the projection of C into the subspace defined by D must exist in a hyper-cube of width ω. It is not necessary that the projection of C in all other dimensions d ∈ D, existed in a hyper-cube of width ω. β is the other parameter of the algorithm, which has the function of balancing between the number of points in C and the number of dimensions in D.

A new objective function for projective clustering is used in k-means projective clustering, where it considers a trade-off between the induced clustering error and the dimension of a subspace. K-means projective clustering also uses an extension of the k-means clustering algorithm for projective clustering in arbitrary subspaces. The dimension of each cluster is chosen independently and automatically by the algorithm. Given a point set P in \({R}^{d}\), the number of projective clusters k and a sequence \(Q=\langle q\_1,\dots ,q\_k\rangle \) of required flat dimensions, the algorithm attempts to find a k-partition that tries to minimize a new distance measure using local improvement steps.

The SC-SRGF algorithm utilizes subspace randomization and graph fusion to perform spectral clustering on high-dimensional data. Initially, random subspaces are generated by randomly sampling the original feature space. Next, multiple K-nearest neighbor (K-nn) affinity graphs are created to capture the local structures in the generated subspaces. To combine the affinity graphs from multiple subspaces, an iterative similarity network fusion scheme produces a unified graph for the final spectral clustering.

MV-RTSC is a type of projected clustering that aims to identify a common subspace among all views in multi-view datasets while optimizing the contribution of each view. To achieve this, the approach constructs a 3-mode tensor using normalized adjacency matrices representing the different views. The tensor is then decomposed into self-representation and error components, with the self-representation tensor being used to detect the community structure of the multi-view network. Additionally, a common subspace is computed among all views, with the contribution of each view to the common subspace being optimized.

LatLRR is a subspace-clustering algorithm that performs clustering by iteratively minimizing the reweighted Frobenius norm, while eliminating sparse noise and redundant information. IRFLLRR enhances the robustness of the subspace structure learning while maintaining the effectiveness of the LatLRR model. It achieves this by incorporating the iterative reweighted Frobenius norm (IRFN) into the LatLRR framework, which has two advantages. First, the reweighting strategy preserves the desired structural information while eliminating sparse noise and redundant information.

2.2 Ensemble Clustering

Ensemble clustering is a kind of clustering which is developed to overcome problems related to classical clustering algorithms [25] by combining several different clustering results into a single (consensus) clustering solution [26]. This procedure improves the robustness, accuracy, and quality of the final clustering result. Consensus clustering, aggregation of clustering, and clustering combination [26,27,28,29,30] are different names for this procedure. The ensemble clustering overcomes the computational burden challenges related to high-dimensional data and achieves high performances on real-world datasets [31,32,33].

Every ensemble clustering algorithm comprises two steps: cluster generation and cluster ensemble. The first step takes a dataset as an input and outputs various clustering solutions. Once a collection of clustering results has been generated, the second step, an appropriate integration function, is applied to combine them and produce a final clustering result.

The second step of clustering ensemble is a challenging part [26, 34] for the following reasons. First, the data points (objects) have no associated labels. As a second problem, each base clustering algorithm may produce a different number of clusters. The third problem is that the cluster labels are symbolic. To combine the results of base clustering, the ensemble clustering algorithm is required to overcome these problems [35].

Many methods have been used to generate the ensemble members, such as using different clustering algorithms, changing the parameters of a clustering algorithm, and using different sets of features. Noticeably, the method of projecting objects on different feature spaces is related to subspace clustering and has been mentioned in different ensemble clustering algorithms [27, 34, 36,37,38,39].

The primary step in any ensemble clustering algorithm is combining the generated partitions by a consensus function and producing the final result. Defining the proper labels for different partitions is not simple. Although, this issue has some difficulty, several functions generate the ensemble of partitions. The most famous functions where used for the ensemble of partitions are based on co-association [40], graph [41,42,43], mixture model [44], mutual information [45], and voting [46, 47]. In this paper, a novel data-labeling procedure to combine multiple clustering solutions is proposed which is described in Sect. 4.

2.3 Cluster Distance Measures

In agglomerative high-dimensional clustering algorithms, to combine clusters, a distance measure between clusters should be used. Some different methods for this approach were used, such as:

Single Linkage: In a single linkage, the distance between two clusters is defined as the minimum distance between any single data point in one cluster and any single data point in another.

Let \({X}_{1},{X}_{2}, \dots ,{X}_{m}\) be the samples of cluster 1, \({Y}_{1},{Y}_{2},\dots ,{Y}_{l}\) be the samples of cluster 2, and \(d\left({\varvec{x}},{\varvec{y}}\right)\) be the Euclidean distance between a subject with observation vector \({\varvec{x}}\) and a subject with observation vector \({\varvec{y}}\). The single linkage is \(d = {\text{min}}_{i,j} d\left( {X_{i} ,Y_{j} } \right)\).

Complete Linkage: In complete linkage, the distance between two clusters is defined as the maximum distance between any single data point in one cluster and any single data point in another cluster. The complete linkage is \(d = {\text{max}}_{i,j} d\left( {X_{i} ,Y_{j} } \right)\).

Average Linkage: In average linkage, the distance between two clusters is defined as the average distance between data points in one cluster and data points in another. The average linkage is \(d=\frac{1}{ml}\sum_{i=1}^{m}\sum_{j=1}^{l}d({X}_{i},{Y}_{j})\).

Centroid Method: In the centroid method, the distance between two clusters is the distance between the two mean vectors of the clusters. The centroid distance is defined as \(d=d\left(\overline{x },\overline{y }\right),\) where \(\overline{x }\) and \(\overline{y }\) are the mean vectors of cluster 1 and cluster 2, respectively.

k-Centroid Link Method: In the k-centroid-link method, the distance between two clusters is mainly defined as the average distance between all pairs of k data objects in each cluster, which are the k closest ones to the centroid of each cluster[48]. Let \({X}_{1},\dots ,{X}_{k}\) be the \(k\) samples of the first cluster, where are the closest samples to the center of the first cluster, and \({Y}_{1},\dots ,{Y}_{k}\) be the \(k\) samples of the second cluster where are the closest samples to the center of the second cluster. The k-centroid-link distance is defined as \(d=\frac{1}{{K}^{2}}\sum_{i=1}^{k}\sum_{j=1}^{k}d({X}_{i},{Y}_{j})\).

\({{\varvec{K}}}^{2}\)-nearest neighbor method: In this paper, a new distance measure between two clusters is proposed, named \({K}^{2}\)-nearest neighbor. In \({K}^{2}\)-nearest neighbor, the distance between two clusters is mainly defined as the average distance between all pairs of k data objects in each cluster, which are the k closest to the other cluster. Let \({X}_{1},\dots ,{X}_{k}\) be the \(k\) samples of the first cluster, which are closest samples to the samples of the second cluster, and \({Y}_{1},\dots ,{Y}_{k}\) be the \(k\) samples of the second cluster which are closest samples to the samples of the first cluster. The \({K}^{2}\)-nn distance is defined as \(d=\frac{1}{{K}^{2}}\sum_{i=1}^{k}\sum_{j=1}^{k}d({X}_{i},{Y}_{j})\).

3 Active Learning Method (ALM)

As the proposed algorithm is conceptually based on the ALM, the basic concepts of the ALM are briefly reviewed in this section. For more information about ALM, the reader is referred to [49, 50].

One of the unique algorithms in fuzzy logic is the ALM [19]. ALM breaks multi-input single-output (MISO) problems into several single-input single-output (SISO) problems and then aggregates the results of these SISO problems. ALM can solve complex problems with relatively lower computational complexity. It has been used in many applications such as modeling [51,52,53,54,55,56], control [57,58,59,60], classification [61,62,63,64,65], and clustering [10, 11, 66, 67]. This paper processes a novel high-dimensional fuzzy-based clustering algorithm based on the main concepts of the ALM. It is worth mentioning that active learning, which is a type of semi-supervised learning method, is different from the ALM in concepts.

ALM was introduced based on the human brain learning function as a fuzzy analyzer. The idea is to consider a fuzzy membership function for each data instead of the exact data value. ALM also analyzes a complex problem as several simple problems and then combines the results; this is similar to human brain activity, as well. A Multiple-Inputs-Single-Output (MISO) system is broken into several Single-Input-Single-Output (SISO) subsystems in ALM; then, the aggregation mechanism produces the final output (Fig. 1).

Fig. 1
figure 1

A multiple-inputs-single-output (MISO) system is broken into several single-input–single-output (SISO) subsystems in ALM; then, the aggregation mechanism produces the final output

Projecting data on a two-dimensional input–output plane called IDS-plane expresses each SISO subsystem. ALM considers the “ink” as a fuzzy membership function to simulate the effect of each data point on its neighborhood (Fig. 2a). Aggregating “ink” patterns related to each SISO subsystem on a plane form an IDS-plane. The intensity of the aggregated ink on the IDS-plane in coordinates of \(\left(x,y\right)\) is called the darkness value \(d\left(x,y\right),\) Fig. 2b. Each IDS plane represents two informative features called Narrow-Path and Spread, which are used by the inference engine to generate the final result. Consider \({IDS}_{i}=\left\{\left(x,y\right), x\in {X}_{i}, y\in Y\right\}\) as the IDS Plane for XiY plane, where \({X}_{i}\) and \(Y\) are intervals \([{{x}_{i}}_{\text{min}},{{x}_{i}}_{\text{max}}]\) and \([{y}_{\text{min}},{y}_{\text{max}}]\), respectively. Spreading the ink pattern related to \(j\)th sample \(p({x}_{i,k,j},{y}_{h,j})\) changes the darkness value as in (1). Where \(k,h\in \left[\text{1,2},\dots ,m\right]\), and \(m\) is the resolution of grids in every dimension and the output.

Fig. 2
figure 2

a Two inks related to samples 2 and 3, are diffused on the X1Y plane, b A simple IDS plane. After applying the IDS-operator, two important information are extracted from this IDS plane; narrow-path, and spread

$$\Delta d\left({x}_{i,k,j}+u,{y}_{h,j}+v\right)=h\left(u,v\right), -Ir\le u,v\le Ir$$
(1)

\(\Delta d\) is the change in the darkness of the coordinate \(\left({x}_{i,k,j}+u,{y}_{h,j}+v\right)\). \(Ir\) determines the radius of the ink effect on its neighborhood, and h is a function that describes the shape of the ink drop, which could be triangular or Gaussian.

Narrow-Path could be extracted by various methods such as the maximum method, weighted-average method, or median method. Equation (2) shows how the Narrow-Path with the weighted-average method is obtained.

$$\phi \left({x}_{i,k}\right)=\left\{b\in Y \right|\sum_{\text{h}=1}^{b}d({x}_{i,k},{y}_{h})\approx \sum_{y=b}^{\text{m}}d\left({x}_{i,k},{y}_{h}\right) \},$$
(2)

where \({x}_{i,k}\in {X}_{i}\) and \(\left({x}_{i,k},\phi \left({x}_{i,k}\right)\right)\) denotes the Narrow-Path, \(d({x}_{i,k},{y}_{h})\) is the darkness value of coordinate \(({x}_{i,k},{y}_{h})\). On the other hand, the Spread can be computed as in (3).

$$\sigma \left({x}_{i,k}\right)=max\left\{h\in \left[\text{1,2},\dots ,m\right]|d\left({x}_{i,k},{y}_{h}\right)>Th\right\}-min\left\{ h\in \left[\text{1,2},\dots ,m\right]|d\left({x}_{i,k},{y}_{h}\right)>Th\right\}$$
(3)

where \(Th\) is the threshold value, the user sets (usually \(Th=0\) for modeling purposes).

Figure 2a shows an IDS plane after applying the IDS-operator to two data samples. Figure 2b shows the Narrow-Path and Spread on XiY plane, where \({X}_{i}\) is the \(i\)th dimension of the dataset. The Narrow-Path shows the overall relationship between \({x}_{i}\) and \(y\). The effectiveness of the \(i\)th input in predicting the output is described by a function of Spread over the Narrow-path (usually \(\frac{1}{Spread}\)) called degree of certainty \(\beta (x)\). The lower Spread in the XiY plane shows a strong relation between \({x}_{i}\) and \(y\); on the other hand, the wider Spread in the XiY plane indicates a weak relation between \({x}_{i}\) and \(y\). In this paper, we consider Spread-inverse as the degree of certainty \(\beta \left(x\right),\) used in the inference engine as the weight of the Narrow-Path effectiveness to predict the final output \(y\). However, the ALM algorithm partitions the domain of inputs to achieve sparser space and consequently reduces the Spread value (increase in degree of certainty). Therefore, more accurate results for estimating \(y\) are obtained (Fig. 3) by this fuzzy-partitioning concept.

Fig. 3
figure 3

Structure of two-input one-output ALM. The input-layer breaks the input space into different subspaces by fuzzy membership functions. The modelling-layer (IDS units) models the overall behavior of each SISO subspace. The outputs of the modeling-layer are the narrow-path and spread. The inference-layer combines the results of each IDS plane by a fuzzy inference Eq. (4). \(\beta (x)\) is degree of certainty, \(\phi \) is narrow-path, and \(\sigma \) is spread

ALM algorithm has three layers, as shown in Fig. 3: input-layer, modeling-layer and inference-layer. The fuzzy partitioning is done in the input-layer, and the membership degrees of each point to the partitions are obtained in this layer. The IDS-plane is generated in the second layer, called the modeling-layer. Two essential features, i.e., Narrow-Path (\(\phi \) in Fig. 3) and Spread value (\(\sigma \) in Fig. 3), are extracted from each IDS-plane. In the inference unit of the ALM, a rule base is generated according to fuzzy partitions. Finally, the overall input–output modeling surface is produced based on Narrow-Path, Spread and membership degrees of fuzzy rules. Equation (4) shows how the inference-layer combines this partial knowledge.

$$y\left(x\right)={\beta }_{11}{\phi }_{11}+\dots +{\beta }_{ik}{\phi }_{ik}+\dots +{\beta }_{n{l}_{n}}{\phi }_{n{l}_{n}}$$
(4)

where \({\beta }_{ik}\) is proportional to the inverse of the Spread and denotes the confidence degree to the Narrow-Path \({\phi }_{ik}\) and \({I}_{n}\) is the number of IDS planes.

4 Proposed Algorithm (HiDUALM)

In this section, a novel hierarchical clustering algorithm called HiDUALM is proposed, which has two hierarchical phases, divisive and agglomerative; Fig. 4 shows the two phases of the proposed hierarchical clustering algorithm. The fundamental concepts of the proposed-clustering algorithm have been driven by ALM. The HiDUALM is developed to deal with high-dimensional datasets. In the divisive phase of the algorithm, a zooming-in process is done based on finding sub-clusters of already-found clusters. In each hierarchy level of the divisive phase, an ensemble clustering method is used. In the ensemble clustering part of the proposed algorithm, high-dimensional data are broken into several one-dimensional data. Then, the final clustering result of this phase is obtained by combining the results of one-dimensional clustering. The Clustering Unit (C.U.) of the first phase is shown in Fig. 5 in more detail. As Fig. 5 shows, each dimension is clustered separately by an IDS unit. Then, all 1-D clusters are merged throughout the Data-Labelling-Unit (DLU) that groups similar multiplication results of labels into the same cluster. The clusters formed by the divisive part suffer from two significant problems. The first problem is that, only convex-shape clusters can be found. The second problem is that, the number of clusters may become more extensive than the actual number of clusters as well. Therefore, a merging process is used to combine the composed clusters of this phase. This merging process is done in the second phase of the algorithm (agglomerative phase). In the agglomerative phase of HiDUALM, the clusters produced in the divisive phase are hierarchically merged based on a novel distance metric called \({K}^{2}\)-nearest neighbors. To prevent the distance from being biased by one or two points, we compute the average distance of \(K\)-nearest samples of a cluster to \(K\)-nearest samples of other clusters, which we named \({K}^{2}\)-nearest neighbors. Then, the clusters with an average distance less than a threshold are combined. The dissimilar distance metrics which are used in divisive and agglomerative phases of HiDUALM make the algorithm more potent to cluster datasets with hidden clusters of different shapes and densities. There are also several novelties in each phase of the HiDUALM, which are described in the following paragraphs.

Fig. 4
figure 4

Two phases of the proposed hierarchical clustering algorithm (two levels L2 and L2). The algorithm breaks each cluster in a hierarchical divisible process in L1 levels and then combines the resulting clusters in a hierarchical agglomerative process in L2 levels

Fig. 5
figure 5

The clustering unit (C.U.) structure. For each dimension, the inks of data points are spread, resulting in the IDS-vectors, each for one-dimension. Then, each IDS-vector is clustered separately. The results of each 1-D clustering are fed into the Data-Labelling-Unit (DLU), where they are combined. The outputs of the Data-Labelling-Unit are the sub-clusters founded on each level of the divisive phase

One of the novelties of the first phase of the algorithm is using one-dimensional IDS-vectors, which are conceptually borrowed from the ALM algorithm. The 1-D IDS-vectors are formed by considering the ink for each data point. Aggregating of the inks on IDS-vectors results in a smooth diagram contrary to those clustering algorithms, which consider the frequency of data (MAFIA, EPCH, etc.). A fuzzy membership function for each data point acts similarly to running a low-pass filter over the data frequency (Lemma 1). The smoothness of the resulting IDS-vectors makes it easy to find clusters that are located between every two minima of IDS-vectors by simply finding the minima.

The second novelty of the first phase of the algorithm is using prime numbers to ensemble clusters. This ensemble method is based on multiplying prime numbers, implemented in two steps. In the first step, dense partitions of 1-D IDS-vectors are found, and unique prime numbers are assigned as labels to the data points of each dense part (Fig. 6). After clustering each 1-D IDS-vectors, each data point has \(n\) labels, where \(n\) is the number of dimensions of the dataset. In the second step, to ensemble clusters, the labels (which are prime numbers) are multiplied. Consequently the data points with the same multiplied results belong to the same clusters. Lemma 2 proves that, to have the correct consensus clustering result, the labels of each initial clustering should be a unique prime number. In addition, the ensemble method is illustrated in Fig. 6 by an example.

Fig. 6
figure 6

The proposed data-labeling method to ensemble clusters was applied to the Aggregation dataset. Unique prime numbers are assigned to each dense partition of 1-D IDS-vectors as labels. Then, the labels (prime numbers) are multiplied. Consequently, the data points with the same multiplied results belong to the same clusters

The other characteristic of the first phase of the algorithm is using a zooming process to find sub-clusters of already-found clusters, which generates a divisive hierarchical property of this phase. In this part, the data points of each already-found cluster are separated more and more hierarchically. This zooming process continues until no sub-cluster can be extracted from a cluster, or it exceeds the predefined number of levels of hierarchy (iterations). Regarding our experience, in most cases, no more than three levels of hierarchy are required to find the optimal result. The zooming process of the algorithm is shown in Fig. 7. The dataset used for this purpose is the Aggregation dataset. As Fig. 7 shows, the HiDUALM algorithm can cluster the Aggregation dataset after three levels. It shows that the first phase of the algorithm works well for clusters with convex shapes. As Fig. 7a shows, at the first level, four clusters are formed. At the second level, two of these clusters are split into two sub-clusters (Fig. 7b and e). In the third level, one cluster is divided into two sub-clusters (Fig. 7g).

Fig. 7
figure 7

The three levels of the first phase of the HiDUALM algorithm on the aggregation dataset. It shows the zooming process of the algorithm. Each figure has four parts; the upper right part shows the data distribution of un-clustered data. The upper-left and lower-right parts show the IDS-vector of the first feature, and second features, respectively. The lower-left part shows the clustering result. In zooming-in Level 0, four clusters are found. In Zooming-in Level 1, two of them are split into two sub-clusters; on the other hand, no sub-clusters are found for the other two clusters. In zooming-in Level 2, one of the generated clusters of the previous level is split into two sub-clusters. Consequently, the algorithm divisively finds 7 clusters for the Aggregation dataset after three hierarchy levels

We also use a novel distance metric called \({K}^{2}\)-nearest neighbor in the second phase of the algorithm, which is a hierarchical agglomerative phase. As a criterion for combining two separate clusters, the average distances of \(K\) nearest samples to the \(K\) nearest samples for every two clusters are computed. If this distance is below a threshold (\(D\)), the two clusters will be combined. The proposed distance metric achieves two advantages for the proposed algorithm. As the first advantage, it prevents the algorithm from being biased by one or two samples of a cluster that are nearly too many samples of another cluster. The \({k}^{2}\)-nn distance measure allows only the \(K\)-nearest neighbors of these close points to compute the average distance. As a consequence, the distance will not be biased by these close points, while the other points have the chance to participate in computing the average distance. The second advantage is finding clusters of complex shapes (such as the Spiral dataset, Fig. 8). Since it uses an overall distance metric, the close clusters in full-dimensional space will be combined. In contrast, the outliers or sparse parts of the clusters have little effect on combining the clusters. Therefore, the clusters will be combined based on the dense part of the clusters in full-dimensional space, which results in finding clusters with complex shapes.

Fig. 8
figure 8

The result of clustering after a one level of divisible phase, b two levels of divisible phase, c two levels of divisible phase plus one level of agglomerative phase with the proposed \({K}^{2}\)-nearest neighbor distance measure. The final result of the algorithm after two levels of divisible phase and two levels of agglomerative with d the proposed \({K}^{2}\)-nearest neighbors distance measure, e single-link distance measure, f full-link distance measure, g centroid-link distance measure, h K-centroid-link distance measure

Unlike most agglomerative hierarchical clustering algorithms that merge only one cluster at each level of the hierarchy, in each agglomerative level of the proposed algorithm, every cluster with a distance below a threshold \(D\) will be merged. This characteristic of the agglomerative phase of the proposed algorithm helps the algorithm to converge to the optimum result at most after three levels of hierarchy. This advantage is achieved using a novel \({K}^{2}\)-nearest neighbor distance measure and defining the threshold metric (\(D\)).

Figure 8 shows the result of the proposed algorithm on the Spiral dataset. For this dataset, the algorithm reaches the highest accuracy after two divisible levels (Fig. 8a and b). However, as the figures show, the number of generated clusters is more than the real one. The second phase of the algorithm combines these clusters through two agglomerative levels (Fig. 8c and d). Three separated clusters are the final result of the proposed algorithm, and all the clustering measures reach unity.

To show the effectiveness of the proposed distance measure \({K}^{2}\)-nearest neighbor, the agglomerative part of the clustering on the Spiral dataset is done by different distance measures such as single-link (Fig. 8e), full-link (Fig. 8f), centroid-link (Fig. 8g) and K-centroid-link methods (Fig. 8h). Agglomeration by single-link distance measure combines the closest adjacent data; Therefore, some parts of adjacent spirals are combined. The full-link distance measure acts almost the same as the single-link on the spiral dataset. The centroid-link measure for the spiral dataset fails because the centroids of the spirals fall outside the cluster's zone. Although, the new k-centroid-link distance measure improves the centroid-link measure, it fails for clustering spiral dataset due to the particular shape of this dataset. As the simulation on the spiral dataset shows, the proposed distance measure has separated the spiral dataset more accurately than the other distance measures (Fig. 8d).

4.1 Algorithm Phases

After describing the general functionality of the algorithm and the novelties, the pseudo-code of the HiDUALM algorithm (Fig. 9) is described in more detail as follows. Since the algorithm has two major phases, each is described in a subsection.

Fig. 9
figure 9

a The pseudo-code of the first phase of the HiDUALM algorithm. b The pseudo-code of the second phase of the algorithm

4.1.1 Divisible Phase of the Algorithm

Step 1 in Fig. 9a: The first step of the divisible phase is quantization. The range of \(n\) features should be partitioned into \(m\) levels (\(m\) is the resolution), which results in less hardware for implementation and faster simulation time. Therefore, each IDS unit is a vector with \(m\) elements where the \(i\)th IDS-vector is represented by \(\left({d}_{i,1},\dots ,{d}_{i,m}\right)\) where \({d}_{i,k}\) is called the darkness value of the \(k\)th element of the vector. As described in Sect. 3, the darkness value is the aggregation result of inks (fuzzy membership functions) related to all data points on IDS-vector. We also define the \(i\)th gridding space as \({G}_{i}=\{{g}_{i,1},{g}_{i,2},\dots ,{g}_{i,m}\}\), where the darkness value of \({g}_{i,k}\) is \({d}_{i,k}\), or \(d\left({g}_{i,k}\right)={d}_{i,k}\).

To reduce the computation burden of the clustering problem, the algorithm quantizes the dataset into \(m\) quantization level. The quantization is done by a function called the gridding function. In dataset \(D\), which has \(N\) samples with \(n\) features, the gridding function is defined as \(F=\left({F}_{1},\dots ,{F}_{n}\right),\) where

$${F}_{i}\left({x}_{i,j}\right)=\left[\frac{\left({x}_{i,j}-{\text{Min}}_{\text{i}}\right)\times \left(m-1\right)}{{\text{Max}}_{i}-{\text{Min}}_{i}}\right]+1,$$
(5)

where \({x}_{i,j}\) is the \(i\)th feature of the \(j\)th sample, and \({\text{Min}}_{i}\) is the minimum of the \(i\)th feature of the dataset \(D\), i.e., \({\text{Min}}_{i} = \min_{j = 1,...,N} x_{i,j}\). Similarly, the \({\text{Max}}_{i}\) is defined as: \({\text{Max}}_{i} = \max_{j = 1,...,N} x_{i,j}\).

After gridding the dataset, the \(j\)th sample of dataset \(D\) is \({S}_{j}^{D}={\left({x}_{1,j},\dots ,{x}_{n,j}\right)}^{D}\), where \({x}_{i,j}\in {X}_{i}\), is quantized and represented by \({S}_{j}^{Q}={\left({q}_{1,j},\dots ,{q}_{n,j}\right)}^{Q}\), where \({q}_{i,j}={F}_{i}\left({x}_{i,j}\right)\),\({q}_{i,j}\in {G}_{i}\), \(j=1,\dots ,N\), and \(i=1,\dots ,n\). Therefore, the gridding function maps the \(D\) multiset to the \(Q\) multiset.

$${\left({x}_{1,j},\dots ,{x}_{n,j}\right)}^{D}\stackrel{F}{\to }{\left({q}_{1,j},\dots ,{q}_{n,j}\right)}^{Q}$$
(6)

The \(Q\) multiset can also be represented by \(n\) sub-multisets, each related to one feature. The sub-multiset related to \(i\)th feature is presented as:

$${Q}_{i}=\left\{{q}_{i,1},\dots ,{q}_{i,N}\right\}=<{Q}_{i}^{sup.},{f}_{{Q}_{i}}>$$
(7)

where its support set is \({Q}_{i}^{sup.}\), and \({f}_{{Q}_{i}}\) is the set of the frequencies of the support set elements, (see (15)).

Step 2 in Fig. 9a: The next step is parameter initialization of the algorithm. The first phase of the algorithm has one parameter, Ink-radius (\(Ir\)), which determines the influence of a sample on its neighborhood. The second phase of the algorithm has three parameters: maximum distance or threshold (\(D\)), \(K\) and \(MinCS\). The parameter \(D\) determines the maximum allowable distance between two clusters to combine. Parameter \(K\) determines the number of selected nearest neighbors (\(K\)-nearest neighbors of one cluster to \(K\) nearest samples of the other clusters are selected to compute the average distances). \(MinCS\) determines the minimum size of the clusters. The efficiency and accuracy of the algorithm highly depend on choosing the proper values for the parameters. We will analyze the algorithm's sensitivity to its parameters in Sect. 5.

Step 3 in Fig. 9a: In the next step of the algorithm, all sample labels are set to one, just like other divisible hierarchical clustering algorithms, because there is no information about their actual classes at the beginning.

Step 4 in Fig. 9a: Initializing the iterator parameter.

Steps 5–20 in Fig. 9a: The divisible part of the algorithm is continued until the number of clusters in two subsequent levels does not change, or the number of hierarchical levels exceeds the maximum number of levels.

Steps 6–17 in Fig. 9a: The iteration on all founded clusters; to find sub-clusters.

Step 7 in Fig. 9a: Finding the members of each cluster. The data points with similar assigned labels belong to the same cluster.

Steps 8–15 in Fig. 9a: The iteration on all dimensions; to split each dimension based on the formed pattern on each 1-D IDS-vector.

Steps 9–14 in Fig. 9a: These steps are the main parts of the divisive phase, which does one-dimensional clustering. First, the ink drops associated with the \(i\)th dimension are allowed to spread on the IDS unit. Ink-drop pattern is a two-dimensional fuzzy membership function which can be Gaussian, Triangle, Trapezoidal or other types of membership functions, where we use the Gaussian membership function in this paper. Aggregating the effects of all data points result in a vector that represents the density distribution of samples on one feature space. Figure 10 shows an example of an IDS-vector in a bar graph.

Fig. 10
figure 10

An IDS-vector, which has some maxima and minima. The data points which are between two minima are assigned to a cluster

Step 9 in Fig. 9a: Spreading an ink drop in the coordinate of \({q}_{i,j}\) can be presented as in (8). The \(Ir\)-radius-neighborhood grids are affected based on a distance function to the ink drop center.

$$Ink\left({q}_{i,j}\right)=\left\{\left({g}_{i,k},{d}^{In{k}_{i,j}}\left({g}_{i,k}\right)\right)|{d}^{In{k}_{i,j}}\left({g}_{i,k}\right)=f\left(dist\left({g}_{i,k},{q}_{i,j}\right)\right),dist\left({g}_{i,k},{q}_{i,j}\right)\le \text{Ir}\right\},$$
(8)

where dist() is a distance function, \(dist:{G}_{i}\times {G}_{i}\to {R}^{+}\). For example, (9) describes the Euclidian distance.

$$dist\left({g}_{i,k},{q}_{i,j}\right)=\left|{g}_{i,k}-{q}_{i,j}\right|$$
(9)

Function \(f()\), \(f:{R}^{+}\to [\text{0,1}]\), computes a darkness value for each grid based on the distance to the center of the ink.

Step 10 in Fig. 9a: Finally, the darkness values resulting from all ink drops are aggregated by the IDS-operator. Different forms of IDS-operator can be used, such as max, sum, saturating-sum, or a combination of those. Aggregation with the sum operator for M data samples is obtained by (10).

$$\begin{aligned}{IDS\_vector}_{i}&=\sum_{j=1}^{M}Ink{\left({q}_{i,j}\right)}^{Q} \\ &=\left\{\left({g}_{i,k},d\left({g}_{i,k}\right)\right)| d\left({g}_{i,k}\right)=\sum_{j=1}^{M}d{\left({g}_{i,k}\right)}^{{Ink}_{j}} ,k=1,\dots ,m\right\}\end{aligned}$$
(10)

Step 11 in Fig. 9a: To simplify the following step, the IDS-vector is normalized. After the aggregation and normalization, the IDS-vector can be represented as (11):

$$d{\left({g}_{i,k}\right)}^{{Ink}_{j}}=\left\{\begin{array}{c}{d}^{In{k}_{i,j}}\left({g}_{i,k}\right)\text{ if }\left({g}_{i,k},{d}^{In{k}_{i,j}}\left({g}_{i,k}\right)\right)\in Ink\left({q}_{i,j}\right) \\ 0\text{ Otherwise}\end{array}\right.$$
(11)

Step 12 in Fig. 9a A thresholding operation is applied to all grids of the IDS-vector. In this step, to eliminate outliers, the darkness value of grids which are less than the threshold will be set to zero. After applying the threshold, the IDS-vector can be represented as (12).

$${IDS\_vector}_{i}=\left\{\left({g}_{i,k},d\left({g}_{i,k}\right)\right)|Th\le d\left({g}_{i,k}\right)\le 1,k=1,\dots ,m\right\}$$
(12)

Steps 13 and 14 in Fig. 9a: The next step, is partitioning the IDS-vector based on finding local minima. Data samples between two minima are considered one partition and labeled with the same prime number. The first derivative of the vector (i.e., \(x^{\prime}\left[ n \right] = x\left[ n \right] - x\left[ {n - 1} \right]\)) is used to find the local minima of an IDS-vector. The local minima are the points where the slope of the derivative turns from negative to positive, or turns from negative to zero then turns to positive.

Step 15 in Fig. 9a: End of iteration on all dimensions.

Step 16 in Fig. 9a: After clustering all of the one-dimensional IDS-vectors, the algorithm finds the final clusters by combining one-dimensional clustering results. As a result of one-dimensional clustering, each data point has at most \(\text{n}\) labels, one prime number per feature. However, due to the thresholding process, some data labels may be considered as one (the initial value). The final label of a sample is determined by multiplying its assigned labels, which are \(\text{n}\) prime numbers. Those samples with the same label are considered in the same cluster at the entire feature space, because they are in the same partition for all dimensions.

Step 17 in Fig. 9: End of iteration on all clusters.

Step 18 in Fig. 9: Incrementation of the level iterator for hierarchical divisible-clustering levels.

Step 19 in Fig. 9: Counting the number of clusters to be found after each level of the divisible part of the algorithm.

Step 20 in Fig. 9: The termination condition of the divisible phase of the algorithm.

4.1.2 Agglomerative Phase of the Algorithm

In the second phase of the algorithm, we first compute the \({k}^{2}\)-nn distance metric of every two clusters. To prevent the distance from being biased by one or two points, we select \(\text{K}\)-nearest neighbors of \(\text{K}\)-nearest neighbor points for every two clusters. Then, compute the average distance of these \({K}^{2}\) distances. If the average distance of two clusters is smaller or equal to \(D\), then those two clusters will be merged. This merging process is done several times, which is considered at agglomerative hierarchical levels. The last level of this phase is different. In the last level, the algorithm merges each cluster whose size (number of members) is less than \(MinCS\). Therefore, the minimum allowable size of clusters is \(MinCS\).

Step 1 in Fig. 9b: Initializing the levels of the agglomerative phase.

Steps 2, 7 in Fig. 9b: Repetition of levels for the agglomerative phase of hierarchical clustering.

Step 3 in Fig. 9b: To compute \({K}^{2}\)-nearest neighbors distance measure of two clusters, we initially compute the distance of every sample of the first cluster to the second cluster. Let \({X}_{1},{X}_{2}, \dots ,{X}_{b}\) be the samples of cluster 1, \({Y}_{1},{Y}_{2},\dots ,{Y}_{g}\) be the samples of cluster 2, in n-dimensional space, the distance matrix A, is defined as:

$$A=\left[\begin{array}{cccc}{a}_{11}& {a}_{12}& \dots & {a}_{1g}\\ {a}_{21}& {a}_{22}& \dots & {a}_{2g}\\ \vdots & \vdots & \ddots & \vdots \\ {a}_{b1}& {a}_{b1}& \cdots & {a}_{bg}\end{array}\right]$$
(13)

where \(||.||\) denotes the Euclidean norm on \({R}^{k}\), and:

$${a}_{i,j}=d\left({X}_{i},{Y}_{j}\right)={\left|\left|{X}_{i}-{X}_{j}\right|\right|}^{2}$$
(14)

Then, compute K-nearest neighbors of every sample of the first cluster to the second cluster by sorting the matrix A on its rows, which results in a distance matrix of size \(b\times k.\)

$${A}_{1}=\underset{\text{rows}=1\mathit{ }\text{to}\mathit{ b}}{\text{sort}}\left(A\right)$$
(15)

Then, the columns of the distance matrix are sorted in ascending order.

$${A}_{2}=\underset{\text{col}=1 \text{to}\mathit{ k}}{\text{sort}}\left({A}_{1}\right)$$
(16)

Therefore, the first \(k\) rows of the matrix contain the K-nearest neighbors of K-nearest neighbors for two clusters.

Then, the columns of the distance matrix are sorted in descending order. Therefore, the first \(k\) rows of the matrix contain the k-nearest neighbor of K-nearest neighbors for two clusters.

Step 4 in Fig. 9b: After finding the \({K}^{2}\)-\( nearest neighbor\) distances for every two clusters, the average value of these \({K}^{2}\) elements is computed as a distance measure of every two clusters.

The \({K}^{2}\)-nn distance is defined as:

$$d=\frac{1}{{K}^{2}}\sum_{i=1}^{k}\sum_{j=1}^{k}d\left({X}_{i},{Y}_{j}\right).$$
(17)

where \({X}_{1},\dots ,{X}_{k}\) be the \(k\) samples of the first cluster, where closest samples to the samples of the second are cluster, and \({Y}_{1},\dots ,{Y}_{k}\) be the \(k\) samples of the second cluster where closest samples to the samples of the first cluster are.

Step 5 in Fig. 9b: Every two clusters with the average distance less than the threshold \(D\), are merged in a hierarchy level. Therefore, in each level of the agglomerative phase of the hierarchy, more than two clusters could be merged.

Step 6 in Fig. 9b Incrementation of the counter of the hierarchy levels.

Step 7 in Fig. 9b: The hierarchy levels continued until no merging happened in the last level or the number of hierarchies exceeded the maximum levels.

Steps 8–13 in Fig. 9b: Iteration on found clusters to find the number of members.

Steps 9–12 in Fig. 9b: Finding the clusters with some members less than \(MinCS\), to merge them with their corresponding closest clusters.

Step 10 in Fig. 9b: Finding the \({k}^{2}\)-nearest neighbor of a cluster, whose number of members is less than \(MinCS\).

Step 11 in Fig. 9b: Merging the low-population cluster with its nearest cluster.

Step 12 in Fig. 9b: End of “if” condition.

Step 13 in Fig. 9b: End of iteration on clusters.

As a conclusion of this section, the proposed algorithm uses one-dimensional ensembles of clustering based on the density of data points using IDS units in the first phase of the algorithm. It uses complete dimensional agglomerative clustering with Euclidean distance metric in the second phase of the algorithm.

4.2 Time Complexity

In this section, the complexity of the proposed-clustering algorithm is computed by considering the complexity of each phase of the algorithm. In the first phase of the algorithm, at each hierarchy level, there are three computing parts. In the first part, ink drops for \(N\) data points on n dimensions are spread, which takes \(O(N\times n)\) time. In the second part m elements of the IDS-vectors are scanned and cutting points are found, which takes at most \(O(m\times n\times {c}_{L})\) time, where \({c}_{L}\) is the number of clusters at level \(L\). In the third part data labels are multiplied, which takes \(O(N\times (n-1))\) time. Therefore, the algorithm's complexity for each hierarchy level is \(O\left(2Nn+mn{c}_{\text{l}}-N\right),\) which can be approximated as \(O(N\times n)\). Considering L1 as the number of hierarchical levels of the first phase of the algorithm, the final complexity of the first phase of the algorithm will be in the order of \(O(N\times n\times L1)\), which indicates that the proposed algorithm has a linear relation concerning dataset size, feature size, and the number of levels at the first phase. In the second phase of the algorithm, we first compute the distance between every two points, which takes \(O\left({N}^{2}\right)\) time. Then, to aggregate the clusters, we should determine the distances between every two clusters. Suppose we have \(\text{c}\) clusters with the same number of data points \(\frac{N}{c}\). Therefore, for every two clusters, we have a distance matrix, M with size \(\frac{N}{c}\times \frac{N}{c}\). To find the K-nearest neighbors for each column of this matrix we need \(K\left(\frac{N}{c}-1-\frac{K}{2}\right)\approx \frac{KN}{c}\) time. Therefore, we need an overall complexity of \(O\left(\frac{{KN}^{2}}{{c}^{2}}\right)\) time for all \(\frac{N}{c}\) columns. To find the \({K}^{2}\)-nearest neighbors, we should sort the rows of the matrix \(\text{M}\) as well. For each row we need \(O\left(\frac{KN}{c}\right)\); therefore, the overall time complexity for \(K\) rows is \(O\left(\frac{{K}^{2}N}{c}\right)\). Finally, we need \({K}^{2}\) summations. Therefore, the overall time complexity of finding the distance between two clusters is \(O\left(\frac{{KN}^{2}}{{c}^{2}}+\frac{{K}^{2}N}{c}+{K}^{2}\right)\). Since we have \(\text{c}\) clusters, we need \(\frac{c(c-1)}{2}\) of the above computations. Consequently, the overall time complexity of one level of this algorithm phase is \(O\left(\frac{c(c-1)}{2}\times \left(\frac{{KN}^{2}}{{c}^{2}}+\frac{{K}^{2}N}{c}+{K}^{2}\right)\right)\). However, when \(K\) is bigger than the number of data points in clusters \(K>\frac{N}{c}\), there is no need to sort the distances. Therefore, the amount of required time decreases to \(O\left(\frac{c(c-1)}{2}\times \frac{{N}^{2}}{{c}^{2}}\right)\). As a result, the overall time complexity of the second phase of the algorithm is \(O\left({N}^{2}+\sum_{l=1}^{L2}\left(\frac{{c}_{\text{l}}\left({c}_{\text{l}}-1\right)}{2}\times \left(\frac{{KN}^{2}}{{c}_{l}^{2}}+\frac{{K}^{2}N}{{c}_{\text{l}}}+{K}^{2}\right)\right)\right)\), where \({c}_{\text{l}}\) is the number of clusters at each level and \(L2\) is the maximum number of levels of the second phase of the algorithm. The overall complexity of the algorithm is the summation of the phase’s complexities, i.e., \(O\left(N\times n\times L1\right)+O\left({N}^{2}+\sum_{l=1}^{L2}\left(\frac{{c}_{\text{l}}\left({c}_{\text{l}}-1\right)}{2}\times \left(\frac{K{N}^{2}}{{\text{c}}_{\text{l}}^{2}}+\frac{{K}^{2}N}{{c}_{\text{l}}}+{K}^{2}\right)\right)\right)\).

5 Experiments

In this section, the parameter sensitivity of the algorithm, noise immunity and clustering quality are analyzed by several experiments. Moreover, the algorithm’s sensitivity regarding its parameters is discussed using a different dataset while considering the different configurations of parameters. To analyze the clustering quality of the algorithm, five clustering quality measures are used: Rand Index (RI), Adjusted Rand Index (ARI), F-measure, Adjusted Mutual Information (AMI), and Accuracy. It is also compared with seven different clustering algorithms, PROCULUS, CLIQUE, DOC, kMeans-projection, SC-SRGF, MV-RTSC and LatLRR, where they are efficient high-dimensional clustering algorithms as described in the related works. We have selected these algorithms because their source codes are open source. WEKA software simulates the first three algorithms, and the other four algorithms are simulated in MATLAB. Table 1 shows the characteristics of real-world datasets used in this section.

Table 1 The characteristics of datasets

The accuracy measure definition in our experiments is just the same as the definition of precision in [11], where the homogeneity of clusters is measured concerning the ground truth of the datasets.

5.1 Parameter Sensitivity Analysis

In this section, we analyze the parameter sensitivity of the proposed algorithm. Since the algorithm has two phases, we first analyze the parameter sensitivity of the first phase, which has one parameter, \(Ir\). Then, we analyze the second phase, with three parameters: \(K\), \(D\) and \(MinCS\). We use the Seeds dataset, which has 210 data points with seven dimensions and three classes. Then, we record the ARI validation index and several clusters to describe the behavior of the algorithm correctly.

Figure 11a and b shows the ARI evaluation index and the number of clusters vs. the \(Ir\) parameter for four levels of the first phase of the algorithm, which were run on the Seeds dataset. As Fig. 11a shows the ARI value increases by increasing the \(Ir\) parameter until it reaches its maximum value. After that, increasing the \(Ir\) parameter results in a decreasing ARI value. The reason is that, when the \(Ir\) value is very small, plenty of clusters will be created (Fig. 11b). Therefore, the ARI index value will decrease. On the other hand, by increasing the \(Ir\) parameter to the large values, the clusters will not be divided into sub-clusters. Therefore, The ARI index will decrease again. As Fig. 11a and b shows, the results of Level 3 and Level 4 are almost the same, which is happened for most datasets, therefore, the maximum required levels for the first phase of the algorithm is three levels. However, two levels are enough for some datasets.

Fig. 11
figure 11

The sensitivity of four levels of the first phase of the algorithm to the parameter \(Ir\) for the Seeds datasets. a The ARI vs. Ir. b The number of clusters vs. Ir

The second phase of the algorithm has three parameters: \(K\), \(D\) and \(MinCS\). Our simulation shows that two or three levels are needed for this algorithm phase. Figure 12a and b shows the sensitivity of the algorithm to the parameter \(K\) after running the three levels of phase 2, where \(D=0.15\). Figure 12a shows that the ARI index value is very small when \(K\) is small. Because, as Fig. 12b shows, most of the clusters will be combined at small values of \(K\). Selecting the proper value for \(K\) depends crucially on the size dataset and clusters. However, for most of the datasets, selecting “K = size of the dataset/2” leads to good clustering results.

Fig. 12
figure 12

The sensitivity of the algorithm to the parameter K after running three levels of phase 1 and three levels of phase 2 on the Seeds dataset, while D = 0.15 and MinCS = 0. a The ARI vs. K, b the number of clusters vs. K

Figure 13a and b shows the algorithm’s sensitivity to the \(D\) parameter for different pairs of \(K\) and \(Ir\) parameters. Figure 15a shows that by increasing \(D,\) the ARI validation index increases because for small \(D\) values, the clusters cannot be merged. Therefore, as Fig. 13b shows the number of clusters is large. By increasing the \(D\) value, the clusters can be combined. Therefore, the ARI value increases. However, by increasing the \(D\) value more and more, clusters will be merged. Therefore, as Fig. 13b shows, the number of clusters will be decreased to one. Our experiments on different datasets show that selecting the \(D\) parameter around \(0.15\pm 0.05\) is a good choice for most datasets.

Fig. 13
figure 13

Sensitivity of the algorithm to the D parameter for different pairs of K and Ir parameters after running three levels of phase 1 and three levels of phase 2 on the Seeds dataset, while MinCS = 0. a The ARI vs. D, b the number of clusters vs. D

Finally, Fig. 14 shows the algorithm’s sensitivity to the parameter \(MinCS\). As Fig. 14b shows, there are plenty of clusters with a size of less than ten data points. Therefore, the parameter \(MinCS\) can help the algorithm to force these sparsely populated clusters to merge in the last level of the agglomerative phase of the algorithm. Our simulation shows that selecting this parameter in the range of 5% to 10% of dataset size is a good choice for this parameter. However, if we have some information about the size of clusters, we could select this parameter wisely. In addition, we can use this parameter in the first phase of the algorithm, in which clusters with a size smaller than \(MinCS\) will no longer split into further clusters.

Fig. 14
figure 14

The sensitivity of the algorithm to the parameter MinCS after running three levels of phase 1 and three levels of phase 2 on Seeds dataset. a The ARI vs. MinCS, b the number of clusters vs. MinCS

Finally, according to the results, for the \(Ir\) parameter, the value of the parameter must be less than \(0.3.\) In the \(K\) parameter, for most of the datasets, selecting “K = size of the dataset/2” leads to good clustering results. For the \(D\) parameter, around \(0.15\pm 0.05\) is a good choice for most datasets. In the MinCS parameter, the range is 5–10% of the dataset size.

5.2 Noise and Outlier Immunity

Defining a threshold parameter in the first phase of the algorithm could be help specify boundaries of clusters, outlier rejection and noise elimination. Those data points detected as outliers in all features (in all IDS vectors) are reported as outliers. To investigate the power of our proposed algorithm for detecting outliers and noise, we ran the algorithm on different datasets. In contrast, 100% of white noise was injected into the datasets. Moreover, the sensitivity of the proposed algorithm to the threshold parameter is analyzed.

Figure 15 shows the sensitivity of the proposed algorithm to the threshold parameter. Figure 15a shows that as the threshold parameter increases, the AMI index value gradually increases until it reaches the maximum value, and then it falls rapidly. To study the reason, the percentage of detected noise (True Negative) and the percentage of False Negative vs. threshold value are plotted. Figure 15b shows that increasing the threshold value increases in the percentage of detected noise. Figure 15c shows that by increasing the threshold value, the percentage of False Negative remains almost steady; however, at a point, the number of False Negatives increases rapidly. Consequently, an increase in False Negatives results in a decreased AMI evaluation index.

Fig.15
figure 15

The sensitivity of the HiDUALM to the noise threshold, after one level of the first phase, while Ir = 0.2, a The ARI vs. threshold value, b the percentage of detected noise vs. threshold value, c the percentage of false negative vs. threshold value

5.3 Clustering Quality

In this section, we examined, the clustering quality of HiDUALM in terms of F-measure, Adjusted Mutual Information (AMI), Rand Index (RI), Adjusted Rand Index (ARI) and Accuracy [68,69,70]. Since the internal clustering validation indices are biased with the clustering method, based on the validation relation, validation by these indices is not fair. Therefore, we use the datasets with known ground truth and external validation indices for fair comparison between clustering algorithms. The proposed algorithm is compared with PROCULUS, DOC, CLIQUE, kMeans-projection, SC-SRGF, MV_RTSC, and IRFLLRR clustering algorithms, which are high-dimensional. The characteristics of real-world datasets used in this part are summarized in Table 1.

The results of the experiments are summarized in Tables 2, 3, 4, 5, 6, 7, 8, 9, 10 and 11, where five clustering quality measures and the parameter settings of the algorithms are shown. In our experiments, each algorithm is run with different parameter settings while the best concurrent evaluation indices are reported. We ran the Proclus, DOC, and CLIQUE algorithms more than 50 times on each dataset and reported the best concurrent quality measure indices. These algorithms showed different results in different runs while their parameters were constant.

Table 2 Comparison of HiDUALM, Proclus, CLIQUE, DOC, kMeansProj clustering, SC-SRGF, MV_RTSC, and IRFLLRR on Urban_Land_Cover dataset
Table 3 Comparison of HiDUALM, Proclus, CLIQUE, DOC, kMeansProj clustering, SC-SRGF, MV_RTSC, and IRFLLRR on Penbased dataset
Table 4 Comparison of HiDUALM, Proclus, CLIQUE, DOC, kMeansProj clustering, SC-SRGF, MV_RTSC and IRFLLRR on Wdbc dataset
Table 5 Comparison of HiDUALM, Proclus, CLIQUE, DOC, kMeansProj clustering, SC-SRGF, MV_RTSC and IRFLLRR on Madelon dataset
Table 6 Comparison of HiDUALM, Proclus, CLIQUE, DOC, kMeansProj clustering, SC-SRGF, MV_RTSC and IRFLLRR on Satimage dataset
Table 7 Comparison of HiDUALM, Proclus, CLIQUE, DOC, kMeansProj clustering, SC-SRGF, MV_RTSC and IRFLLRR on Facial_expression dataset
Table 8 Comparison of HiDUALM, Proclus, CLIQUE, DOC, kMeansProj clustering, SC-SRGF, MV_RTSC and IRFLLRR on waveform dataset
Table 9 Comparison of HiDUALM, Proclus, CLIQUE, DOC, kMeansProj Clustering, SC-SRGF, MV_RTSC and IRFLLRR on Yale dataset
Table 10 Comparison of HiDUALM, Proclus, CLIQUE, DOC, kMeansProj clustering, SC-SRGF, MV_RTSC and IRFLLRR on MNIST dataset
Table 11 Comparison of HiDUALM, Proclus, CLIQUE, DOC, kMeansProj clustering, SC-SRGF, MV_RTSC, and IRFLLRR on COIL100 dataset

The parameter settings of the algorithms for each dataset are shown in Tables 2, 3, 4, 5, 6, 7, 8, 9, 10 and 11. For the proposed algorithm, the first phase has two levels in all experiments, and the second phase has four. The value of \(K\) is half of the total number of data points of the datasets. The value of \(MinCS\) is 20 for all datasets except for the waveform, which is 500.

To summarize Tables 2, 3, 4, 5, 6, 7, 8, 9, 10 and 11, two methods have been used. The first method is to calculate the average value of the parameters (Table 12), and the second is the percentage of cases where the proposed algorithm has provided a better answer than other algorithms (Table 13).

Table 12 Average parameters over all datasets for each algorithm
Table 13 The percentage of cases where the proposed algorithm has provided a better answer than others

From Table 12, it is observed that our algorithm performs the best performance compared to other algorithms on average. For example, the proposed algorithm performs 15% better than Proclus, 26% better than CLIQUE, 6% better than DOC, 16% better than kMeansProj, 20% better than SC-SRGF, 29% better than MV_RTSC, and 14% better than IRFLLRR with respect to the accuracy parameter.

As Table 13 shows, in np percent of cases, the proposed-clustering algorithm has better clustering quality measures. For example, the proposed algorithm in 80% of cases shows better clustering results than Proclus based on the ARI index. In comparison, in 100% of cases, it performs better clustering result than CLIQUE, 90% of cases performs better clustering result than DOC, 100% of cases performs better clustering result than kMeansProj, 90% of cases performs better clustering result than SC-SRGF, in 90% of cases performs better clustering result than MV_RTSC, and 70% of cases performs better clustering result than IRFLLRR.

6 Conclusion

In this paper, a novel hierarchical high-dimensional clustering algorithm, HiDUALM, which has both divisive and agglomerative hierarchical clustering phases, was introduced. The divisive part is based on an ensemble of projected clustering algorithms. In this paper, a novel method based on labeling data points with prime numbers is proposed to combine the results of ensembles. A divisive phase of the algorithm is done by a hierarchical zooming process, which tries to find sub-clusters in already-found clusters until no sub-clusters are found. The second phase of the algorithm is agglomerative. In the second phase, the composed clusters of the first phase are combined based on the novel \({k}^{2}\)-nearest neighbor algorithm.

The ability and efficiency of the proposed algorithm are confirmed by simulation results. The parameter sensitivity of the algorithm is also done by simulations. As our simulation shows, HiDUALM can eliminate noise and outliers as well.

The agglomerative phase of the algorithm combines clusters which are near to each other in full-dimensional space. This phase has two main advantages. First, since the algorithm uses one-dimensional clustering in the first phase, there may be some clusters near each other in full-dimensional space but not in all of the one-dimensional spaces; therefore, these clusters will be combined in this phase of the algorithm. Second, the number of formed clusters by the first phase of the algorithm may vary depending on the \(Ir\) parameter and the distribution of data points. Sometimes, many clusters will be formed where each contains a few data points. The agglomerative phase of the algorithm combines these low-populated clusters by Euclidean distance metric. As a consequence, the first hierarchical phase of the algorithm composes the clusters with a unique approach and remarkable distance measure. Then, the second hierarchical phase of the algorithm combines some of the clusters with another approach and distance measure to improve the clustering quality.

The first phase of the proposed algorithm is an ensemble clustering algorithm which uses 1-D clustering to generate initial clusters and produce diversity. Using 1-D clustering reduces the computation burden and time complexity of high-dimensional dataset clustering. HiDUALM proposes a novel consensus function as well. Data-labeling by prime numbers is used for 1-D clustering, then the final clusters are found based on multiplying the labels. This method solves the difficulties related to ensemble clustering.

According to simulation results on different datasets and various high-dimensional clustering algorithms, on average, the proposed algorithm shows:

  • The proposed algorithm 6% improves the accuracy of clustering results compared to the best algorithm.

  • The proposed algorithm 14% improves the F-measure of clustering results compared to the best algorithm.

  • The proposed algorithm 5% improves the AMI index of clustering results compared to the best algorithm.

  • The proposed algorithm 14% improves the ARI of clustering results compared to the best algorithm.

  • The proposed algorithm 7% improves the RI of clustering results compared to the best algorithm.

In future work, we will focus on hardware implementation of the proposed algorithm. The proposed algorithm can also be used to improve the ALM algorithm to solve regression problems by making better initial partitions.