1 Introduction

In 2021, the size of information that was “created, captured, copied, and consumed worldwide” accounted for 79 zettabytes and will grow to 181 zettabytes in 2025.Footnote 1 “Information visualization can accelerate perception, provide insight and control, and harness this flood of valuable data to gain a competitive advantage in making business decisions” (Al-Kassab et al., 2014). One way to visualize information is to enrich an analysis model by assigning colors to different elements of the model where the basic idea originates from cartography (Coulson, 1987). An example for such analysis models is process models that describe the process logic of business processes, e.g., a patient treatment or manufacturing process. Then, for example, the service time or the frequency of an activity representing a node can be mapped to different colors (van der Aalst, 2016). In order to assign the colors, often clustering is employed as pre-processing step, i.e., the data is clustered and colors the area assigned to the resulting clusters (Jiang, 2013).

Process models can be designed by domain experts or discovered from process execution data stored in process event logs, based on process mining techniques (van der Aalst, 2016). Process mining constitutes one of the key technologies for digital transformation (Reinkemeyer, 2022). One example use case is to mine process models from an information system, containing a keyword-based search functionality. Each search term appears several times in the event log and represents a node in the finally mined process model. The resulting process model is depicted in Fig. 1 by using the process mining software DISCOFootnote 2 on Data Set 1 which is further described in Appendix 1.

Fig. 1
figure 1

Mined process model using DISCO on Data Set 1 containing frequencies of activities (search terms)

The search term * is represented by one node in the discovered process model and is marked as the most frequent term in the process model with a frequency of 2.887. The second most frequent search term has a frequency of 90. Both frequencies 2.887 and 90 are unique for the data set. By contrast, 7 different search terms have a frequency of 3 and 6 different search terms have a frequency of 4 where the frequencies of 3 and 4 account for the lowest frequencies in the data set. For all 33 search terms, 17 different frequencies have been observed. Overall, the data set, as for many real-world applications, contains huge gaps between data points, i.e., the frequencies of the search terms. As depicted in Fig. 1, the process mining software DISCO maps the frequencies to colors of the process model nodes which range from dark blue to light gray. As stated in van der Aalst (2016), a process model can become more meaningful by mapping colors to its nodes. However, in this particular process model, the dominating search term * has such a high frequency that as a consequence, less frequent search terms appear all in the same color, i.e., light gray. This prevents differentiation between nodes such as “sport” with a frequency of 48 and “sport familie” with a frequency of 3. Therefore, not much information gain can be achieved through the usage of assigned colors.

Hence, it is of utmost importance to think about how to assign colors to the nodes in order to maximize the information gain of the visualization, even in the presence of “dominating” data points. Of particular interest is also the analysis of paths through the process model, including the happy flow of the process, which denotes the path taken by an average trace (Leemans et al., 2014) or more “exceptional” paths with particularly high/low frequencies of their nodes.

For the visualization of analysis models, often clustering is employed as pre-processing step (Jiang, 2013). Colors are then assigned to the clusters. If, for example in the process mining case (cf. Fig. 1), 10 colors are available for assignment to the frequencies and each frequency is assigned to one of these 10 colors, each color can be considered as a cluster. In order to meet these requirements, the clustering should be performed by considering a low variance inside each cluster and a wide distance between nearby clusters and at the same time an even distribution of data over all clusters. Without the latter, especially for real-world data sets, many clusters will be sparse due to outliers, tailed or non-uniform distributions. As a consequence, results can become uniform and colors assigned to clusters do not show much additional information when, e.g., investigating a particular path. However, there is a trade-off between a low variance inside each cluster with a wide distance of nearby clusters, and an even distribution of all clusters. This trade-off is investigated in this paper. In the following, possible approaches for tackling the problem are discussed.

At first, one could argue to classify the frequencies connected to nodes with unclassed colors. Doing so for each frequency a different color is assigned by using a color gradient which has the advantage of a “raw accuracy.” Exactly these discussions between unclassed and classed colors arose already in cartography when generating choropleth maps (Tobler, 1973), which is quite similar to the visualization problem as discussed here on process models. While unclassed colors have the advantage, as mentioned before, of a “raw accuracy,” classed colors are easier to process for humans. This is due to the few number of distinct colors to recognize, which helps to reduce the cognitive load by using a legend that lists the ranges of values corresponding to each color (Dobson, 1973; 1980). A naive pre-processing approach to generate classed colors would be to slice the frequencies into m equal intervals, where m is the targeted number of clusters. This approach is also used when creating histograms, where each class interval has the same width. While being simple and transparent, the approach might result in sparse or even empty classes, if the data set is not uniformly distributed or contains huge gaps, leading to show just two colors in extreme cases.

Another approach would be to use quantiles (n/m, where n is the number of data points) as classification method where the number of data points (aka frequencies) in each class is roughly equal, which overcomes the problem of equal intervals. However, this approach has many disadvantages. While it maximizes an even distribution of data into clusters, it does not consider any clustering by minimizing the variance inside each cluster and maximizing distances between clusters which may lead to problems of interpretation of colors in models. Another problem arises when there are many identical data points in the data set, which can lead to ambiguous classes.Footnote 3

A more sophisticated approach on pre-processing, which comes from cartography and is well established there, is to use the algorithm called Jenks natural breaks, which produces a classification for a pre-defined number of classes that can be mapped to colors, by minimizing the variation within each class (Jenks, 1967). A result by using this algorithm on Data Set 1 is shown in Fig. 2. Here the blue frame shows the same activities as shown in Fig. 1, but in contrast, it becomes obvious that the three nodes in the process path indicate different frequencies.

Fig. 2
figure 2

Mined process model from Data Set 1 containing frequencies of activities (search terms) assigned to colors through pre-processing with Jenks natural breaks

The process model depicted in Fig. 2 conveys more information than the process model depicted in Fig. 1. Nevertheless, Jenks natural breaks is not optimized for considering an even distribution of the data points over all clusters, which helps to show differences between frequent elements in the process model, for example between frequencies 3 and 4. When discovering particular paths as shown in the red frames, such a distinction is useful to recognize differences in nodes and in further consequence, subpaths with splits at first glance. Also, a happy flow is hard to discover and “exceptional” paths with particular high/low frequencies of their nodes can be discovered, but not all related paths are highlighted in the process model. Another problem can be discovered from the legend in Fig. 2: Jenks natural breaks produces sparse classes (cf. classes with number 4–6, count only one data point), while putting the main outlier (i.e., the search term with a frequency 2.887) into the same class (cf. class with number 1) as the second most frequent search term (frequency 90) which both occurred just once in the whole process model.

Another use case is to divide students into equal learning groups in relation to an equal number of students and an equal level of knowledge in each group. Consider, for example, a previously executed test with 300 students, where 0–200 points were possible to reach. The test was performed at the beginning of a semester and the goal is to bring all students to a uniform level of knowledge. There are 6 teachers available to support the students during the semester, where each teacher teaches in one of 6 classes. Ideally, an equal number of students and an equal level of knowledge of students in each group is aspired. This problem can be solved by an approach that favors an even distribution in contrast to an equal level of knowledge.

In the following, the mentioned requirements and desired characteristics of the clustering algorithm for pre-processing to be designed and realized in this work are summarized. The algorithm is supposed to

1. Handle one-dimensional data sets

2. Handle negative and floating point data points

3. Handle non-unique data points in data sets

4. Define a targeted number of clusters (which can also be less), e.g., to map them with

predefined colors

5. Avoid overlapping cluster ranges

6. Avoid filtering/omitting of elements by the cluster algorithm because it already

performed through pre-processing steps like outlier detection if desired for a

particular use case

7. Reproduce the same clusters after each execution on the same data set, i.e., stable

results of clusters

8. Work on real-world data sets, e.g., data sets containing gaps between data points

9. Produce fair results with classical clustering metrics such as low sum of variances

(SV) from all clusters and a good score of mean silhouette coefficient (MSC)

10. Result in an even distribution of data points into clusters

The requirements and characteristics can be formulated into the following problem:

Problem 1

Given a set D \(\subseteq \mathbb {R}\) of (one-dimensional) data points and a targeted number of clusters M ≥ 1, where m is the number of actually built clusters, 1 ≤ mM.

Find clustering Cluster:\(D \mapsto \mathcal {C} =<C_{1}, ..., C_{m}>\) with \(C_{i}\subseteq D\), \(\dot \bigcup _{i} C_{i} = D\) for i = 1,...,m with \(\forall C_{i} \in \mathcal {C}\):

  • |Ci| converges to \(\frac {|D|}{m}\) (Requirement 1) and ideally M clusters are built

  • Minimize variance of Ci (Requirement 2)

  • \(\forall C_{j} \in \mathcal {C}, C_{i} \neq C_{j}:\) maximize distance between Ci and Cj (Requirement 3)

  • Requirement 1 is prioritized over Requirement 2 and Requirement 3.

Problem 1 requires a given number of M clusters to be populated with data points under Requirements 1, 2, and 3. M is assumed to be determined based on the application, e.g., number of colors to be used for data visualization. 1 is measured with the metric SED (score even distribution) which is introduced in this paper. Similar to existing algorithms such as Jenks natural breaks (Jenks, 1967), Requirements 2 and 3 aim at finding low variance clusters which are separated from each other. And basically, cluster algorithms which produce low variance clusters do not intend to produce an even distribution of data points into clusters (Requirement 1), see Shapiro (2005). Unlike Jenks natural breaks, we do not consider the maximization of distances between the clusters (Requirement 3), in favor of achieving an equal distribution of the data points over the clusters (Requirement 1). This “forces” every cluster to be populated and avoiding the over-representation of frequent data or outliers at the same time. As discussed before, there might be a trade-off between Requirement 1 on the one side and Requirements 2 and 3 on the other side as filling all clusters might take a toll on cluster variance. This work addresses Problem 1 based on the following research questions:

  1. RQ1

    How to design a (heuristic) clustering algorithm to evenly distribute 1d data points into a maximum number of low variance clusters?

  2. RQ2

    On which underlying data distributions and real-world data sets does the algorithm perform most effectively?

  3. RQ3

    How does the algorithm support data visualization?

To tackle RQ1–RQ3, this paper presents the heuristic 1d distribution cluster algorithm (DDCAL) that aims at balancing Requirements 1, 2, and 3 for Problem 1. The idea of DDCAL in a nutshell is to use an iterative approach by using the feature scaling method min-max normalization or also known as rescaling, for normalizing a sorted list of one-dimensional data points and to compare the results against defined boundaries from a set list. From each boundary, outliers from the upper or lower bound are considered as new cluster if the quantity of elements of a potential cluster is inside a given tolerance factor. Otherwise, the next boundary from the list is tested or if all boundaries were tested, the tolerance factor is increased and the testing of boundaries starts again. In other words, in every iteration step, the lower or upper quantity of outliers from a boundary, which is inside a given tolerance factor, that is used to support an even distribution of elements over all clusters, is chosen for building a new cluster.

In order to evaluate the effectiveness of the DDCAL algorithms and to compare them to existing algorithms, we use four quality metrics. We first analyze the results of the DDCAL algorithm in comparison to k-means++, Jenks natural breaks, head/tail breaks, DBSCAN, kernel density estimation (KDE), Gaussian mixture model (GMM), mean shift, largest gaps, Gaussian kernel k-means, k-medoids, and trimmed k-means on five artificial data sets, reflecting a selection of common distributions such as the normal and uniform distribution. The DDCAL algorithm is then compared to the before mentioned cluster algorithms based on 4 real-world data sets, i.e., process mining, weather, star distances, and population data, with respect to quality metrics. Finally, the applicability of the DDCAL algorithm for data visualization is demonstrated for process mining, US population, and Corona pandemic data sets. The real-world data sets were chosen because of their different data structures according to different distributions and different number of data points. Based on this, they illustrate the application possibilities of the DDCAL algorithm on different use cases.

Overall, DDCAL achieves better results than existing algorithms with respect to quality metrics and data visualizations on data sets with gaps, i.e., outliers, and tails. Moreover, the DDCAL algorithm yields promising results for data which contain just one peak with a distribution that looks like a bell curve (cf. normal/gumbel distribution). The results are even good if the number of peaks is equal or higher than M. The latter is reflected, for example, by the results for data with a uniform distribution.

The paper is structured as follows: Section 2 presents the DDCAL algorithm as well as the quality metrics. Section 3 comments on the implementation of the different algorithms and evaluates DDCAL on different data sets, which are real-world and synthetic data sets. The evaluation is compared with existing algorithms. Furthermore, different parameter setups on DDCAL are evaluated to demonstrate their implications on, e.g., different distributions. Section 4 shows applications from visualization of process mining results and maps. In Section 5, we discuss related work. Section 6 concludes the paper and discusses ongoing and future work. Appendix 1 details the data sets and Appendix 2 the implementation of the algorithms.

2 1D Distribution Cluster Algorithm (DDCAL)

This section provides the DDCAL algorithm, and metrics for assessing the quality of the clustering with respect to Problem 1.


Given a set of 1d data points, the idea of DDCAL to tackle Problem 1 is as follows:

In the first step, the data points are sorted in ascending order. In a second step, the list of data points is copied and normalized by using the feature scaling method min-max normalization as described in Section 1. A list of boundaries to test outliers of clusters is defined. Boundaries from this list, starting with the lowest, are tested dynamically in each iteration where outliers from the upper and lower boundary are added to potential clusters. On these potential clusters, neighboring elements are added if the standard deviation inside the potential cluster decreases. If both potential clusters are above a set minimum quantity of elements — which is calculated from the combination of elements to even distribute over all remaining clusters and a set tolerance factor — the cluster with the least difference to the minimum quantity is actually built. If only one potential cluster or even no potential cluster is above the set minimum quantity of elements, the next boundary is tested and if all boundaries were tested without success of a new built cluster, the tolerance factor is increased and the set boundaries from the list, starting first with the lowest, are tested again as described before. Once a cluster is built, only the remaining empty clusters can be filled. The algorithm terminates if no empty cluster is left (i.e., of the maximum number of clusters M) or all data points are already assigned to clusters.

The pseudo code of DDCAL is provided in Algorithm 1 and explained in the following: The input data is an unsorted one-dimensional array of data points dD. The output clustered_results is a list containing all data points from the input where each element consists of a data point d with an assigned cluster number. The algorithm has six parameters: (1) M contains the targeted number of clusters, e.g., 10 for 10 clusters to build. (2–3) boundary_min and boundary_max contain two parameters which define the minimum and maximum boundary for calculating outliers in each iteration to assign the data points from data to potential clusters. The parameters contain a value, ranging from > 0 to < 1, where, e.g., 0.1 stands for 10%, which means, that for a sorted normalized list of data points, outliers are the data points which are in the first 10% and in the last 10% of the list. (4) num_simulation defines the number of boundaries, ranging from boundary_min to boundary_max which are evenly spaced and stored in a list for finding the best boundary in each iteration step. (5) q_tolerance sets the quantity tolerance factor of elements for building a new cluster, which aims to produce only evenly distributed elements over all clusters within the defined factor. (6) q_tolerance_increase_step contains a growth factor which is responsible for increasing the q_tolerance factor, if all simulation steps for the list of different boundaries (cf. input parameters 2–3) did not satisfy the quantity tolerance factor.

Algorithm 1
figure a

DDCAL (O(n log n).

As the first step (line #1), data is sorted in ascending order. In line #2, a list of clusters is generated from M, which contains M elements with ascending numbers, ranging from 0 to M-1. Next in #3, the temporary value q_aim_lo is calculated which determines the minimum quantity, a cluster must have to be created. The value is built by dividing the size of data which is not yet assigned to a cluster through the current q_tolerance factor. In the next step (#4), the data to cluster is normalized and stored as norm_data. For data points D, the feature scaling method min-max normalization is used with the following formula (Milligan and Cooper, 1988) to calculate relative results norm(d) of data point dD depending on the respective minimum min(D) and maximum value max(D):

$$ \text{norm(\textit{d})}=\frac{\text{\textit{d}}-\min(D)}{\max(D)-\min(D)} $$

Continuing with step #5, a list containing num_simulation evenly spaces boundaries is created which comprises a sorted list of decimal numbers, ranging from boundary_min to boundary_max. From #6 to #15, boundaries from the list of boundaries, which was created in #5 and starting with the lowest boundary (i.e., first element in the list), are treated. On step #7, all outlier elements from data are determined by using the current boundary from the list of boundaries and comparing it with norm_data which contains the same data points on the same position as data. The lower outliers are stored in q_lo and the upper outliers are stored in q_up. Step #8 adds neighboring elements from q_lo/up from not yet added data, as long as the standard deviation inside q_lo/up decreases. From line #9 to #14, there is a check with handling defined, if the number of elements from q_up and q_lo is above q_aim_lo. If yes, the differences from the number of elements from q_up/lo and q_aim_lo are calculated and stored in diff_q_aim_up/lo and the algorithm continues with line #19. If no, a next current boundary will be tested by continuing with line #6. The lines #16–#19 are only reached if no boundary was found (#6–#15) which produced both, diff_q_aim_lo/up. Thus, diff_q_aim_up or diff_q_aim_lo is not set and the q_tolerance factor is increased by the factor of q_tolerance_increase_step (#17). The algorithm continues with line #3 by calculating the q_aim_lo again for all boundaries to test. In steps #20–#28, a cluster is built from the outliers q_lo or q_up. q_lo is chosen, if the associated diff_q_aim_lo is smaller or equal diff_q_aim_up. Then, on line #21, a cluster is built from the data points of q_lo where each data point is assigned with the first element from the list of clusters. Next, the first element is removed from the list of clusters (#22) and the data points from q_lo are removed from data (#23). Otherwise, if the diff_q_aim_up is smaller than diff_q_aim_lo, a cluster is built from the data points of q_up where each data point is assigned with the last element from the list of clusters (#25). Then, the last element is removed from the list of clusters (#26) and the data points from q_up are removed as well from data (#27). In both cases, the build cluster is added to the list of clustered_results (#29). Line #30–#34 handles if the list of clusters contains only one element. Then, a cluster is built with all remaining data points from data which are assigned to the element from the list of clusters (#31). Next on steps #32–#33, the cluster is added to the list of clustered_results and the algorithm continues with step #39. If there exist no remaining data to cluster, the algorithm continues as well with step #39 (#35–#37). Otherwise, after a new cluster was built and added to the list of clustered_results and there exists data to cluster and as well a list of clusters containing more than one element, the algorithm starts the next iteration for building a cluster and continues with step #3 (#38). Finally, on #39, the built list of clustered_results is returned as output of the algorithm.

DDCAL produces stable results, i.e., the output of the algorithm remains the same after each execution if the data set and the parameters do not change. Furthermore, the algorithm does not contain nested loops of the input list data which results basically in O(n) with n = |D| in terms of runtime or space requirements if the number of clusters is considered as constant and the sorting step (#1) is left out, where all general sorting functions are, at best, O(n log n).

The algorithm uses the feature scaling method min-max normalization with boundaries in an iterative approach. The concept is inspired from statistical tests based on data distributions and their significance threshold which is in the case of this algorithm defined through a boundary. DDCAL compares in every iteration the whole remaining data set for building a new cluster. The term remaining means in this case that data points are assigned to a cluster, which are typically outliers, and then these data points are removed from the data set of data to cluster. By addressing Problem 1 (cf. Section 1), DDCAL aims to increase the number of elements in each cluster which is reflected in evenly distributed elements over all clusters (cf. Requirement 1). The even distribution can be influenced with the parameter q_tolerance, which sets the minimum amount of elements for building a cluster on a given range of boundaries to test. And if these boundaries do not produce the minimum amount of elements as outliers, the parameter q_tolerance_increase_step increases the q_tolerance factor and thus, influences the even distribution of the results as well.

By using this iterative approach on the remaining data for building clusters with outliers, it becomes more likely to be able to add more data points to clusters in subsequent iteration steps on lower boundaries, because they contain less outliers. To add too many data points into a cluster is hampered through the logic, that on each iteration, clusters are built starting with the lowest boundary to test and a cluster can actually be built, if the minimum amount of elements on both sides, the upper and lower bound, are reached or exceeded where the side with the fewer data points is used for building a cluster. This benefits of course Requirement 1, but through the iterative approach of removing outliers in every iteration, the variance of the whole remaining data set decreases. Therefore, clusters with outliers have lower variances in their clusters in later iterations which benefits Requirement 2 and also indirectly Requirement 3.

2.2 Clustering Quality Metrics

We use four quality metrics, i.e., number of used clusters (NUC), sum even distribution (SED), sum variances (SV), and mean silhouette coefficient (MSC) that capture the coverage of a clustering result with respect to Problem 1.

How evenly the data is distributed over built clusters is measured with the metric SED. The aim of DDCAL, which is described in detail in Algorithm 1, is to produce clusters with high SED values (Requirement 1), as top priority in addressing Problem 1, but also to consider Requirements 2 and 3 which are measured through the quality metrics SV and MSC. Section 3.4 employs the four quality metrics in order to compare DDCAL with other clustering algorithms.

Note, while DDCAL produces high SED values, we performed different simulations of parameters with existing algorithms if available and picked the results with the highest SED values in order to foster a fair comparison.

Definition 1 (Quality Metrics)

Consider Problem 1 with D \(\subseteq \mathbb {R}\) being a set of (one-dimensional) data points. Then, quality metrics NUC, SED, SV, and MSC measure the quality of a clustering C =< C1,...,Cm > (mM), where M denotes the targeted number of clusters. With respect to Requirements 1 and 2, the following quality metrics are defined:

  • NUC := \(\frac {m}{M}\)

  • SED := \({\prod }_{i=1}^{m} \mid C_{i} \mid \)

  • SV := \({\sum }_{i=1}^{m}{\sum }_{j=1}^{\mid C_{i} \mid } \mid \mid x_{ij} - z_{i} \mid \mid \) where m is the number of the actually built clusters, xij denotes the j th data point in cluster Ci, and zi the mean value of cluster Ci (see (Faber, 1994)).

  • MSC : = \(\frac {1}{\mid D \mid } * {\sum }_{d \in D} s(d)\) where s(d) denotes the silhouette coefficient for data point dD with \(s(d)=\frac {b(d)-a(d)}{\max \limits \{(a(d),b(d))\}}\) where a(d) is the average dissimilarity of d to all other objects of Ci, b(d) is the minimum dis(d,Cj) with ij, and dis(d,Cj) is the average dissimilarity of d to all objects of Cj (see (Rousseeuw, 1987)).

NUC ∈ [0;1] assesses the number of used clusters, in particular the implicit requirement to fill M clusters. NUC= 1 denotes the best result and NUC= 0 the worst. When comparing the different algorithms as described in Section 3, we compare just results with NUC= 1 which indicates that all targeted clusters were actually built and a fair comparison with the other clustering quality metrics is therefore valid. SED, the score of even distribution, evaluates how evenly the data is distributed over the clusters (↦ Requirement 1); the product of the cluster cardinalities is maximized for clusters of equal size and if NUC= 1. It is minimized for clusters, where every cluster contains just one element with the exception of one cluster which contains all remaining elements of a data set. Higher results of are better, where the best result appears, if all clusters contain the same number of elements and the worst result will be if all clusters contain just one element with the exception of one cluster which contains the remaining elements. SV, the sum of variances, measures the homogeneity of the clusters, which considers intra-cluster distances. Thus, it assesses low variance clusters which are compact (↦ Requirement 2) and is defined according to literature (e.g., (Faber, 1994)). A low value is therefore better than a higher one. MSC, the mean silhouette coefficient, considers compact (cf. SV) and clearly separated clusters (↦ Requirements 2 and 3). The possible results range from − 1 to 1 where 1 signals the optimum and thus clusters are built well apart from each other and are clearly distinguished. 0 shows that clusters are indifferent and − 1 is the worst result, where clusters are wrong assigned (e.g., many overlapping cluster ranges).

3 Assessment Based on Quality Metrics

3.1 Implementation of Algorithms

In order to evaluate the quality of Algorithm 1 (DDCAL), we compare it to a set of existing algorithms such as kmeans++ based on the quality metrics defined in Section 2.2. Furthermore, the parameters of DDCAL are evaluated.

In Section 3.2, the comparison is based on artificial data sets with different distributions and on these distributions, the evaluation of the DDCAL parameters is performed in Section 3.3. Finally, Section 3.4 compares the algorithms on real-world data sets. For this, we implemented Algorithm 1.Footnote 4 as well as a selection of existing algorithms for comparison.

We used Python 3.9 for implementing all algorithms. For data operations, the numpy framework (version 1.21.2) was used. Because every algorithm produces different outputs, like clustered data points, centroids, breaks, or extrema on a curve, we converted each output to a uniform list of resulting elements, where each element contains a cluster number and a data point. Through these conversion steps, the performance of an algorithm may differ and a comparison of runtime should be considered with caution. Nevertheless, long runtimes when comparing different algorithms on huge data sets are pointed out in this work. Further details on the implementation of all algorithms can be found in Appendix B.

3.2 Evaluation Based on Artificial Data with Different Distributions

We first want to understand how the results produced by Algorithm 1 relate to the distribution of the underlying data, also in comparison with k-means++, Jenks natural breaks, head/tail breaks, DBSCAN, KDE, GMM, mean shift, largest gaps, Gaussian kernel k-means, k-medoids, and trimmed k-means.

We generate artificial data sets of 1000 data points for different distributions. One thousand data points seem to result in meaningful data distributions to be distinguished from each other. Each data set contains at least one-third non-unique data points (cf. Section 1) to represent roughly a real-world data set. Moreover, when clustering the data sets by using M = 10 targeted clusters, each cluster contains about 100 data points if the results are evenly distributed over the clusters. Therefore, built clusters are small enough to be still be observed manually but are big enough to represent meaningful results. For representing different distributions, we have chosen well-known statistical distributions, to cover a broad range of possible data sets and to show how each algorithm, especially DDCAL, performs differently on different distributions. The following data sets are also accessible with details atFootnote 5 normal distribution with 369 unique data points, gumbel distribution with 393 unique data points, uniform distribution with 558 unique data points, exponential distribution with 587 unique data points, and two peaks distribution with 362 unique data points which consists for two randomly generated normal distributions where the first has 300 and the second has 700 data points (cf. Table 1).

Table 1 Algorithm results for SED if NUC= 1 from different artificially generated distributions

We compare existing clustering algorithms with DDCAL by using M = 10 on the artificial data sets with different distributions. The results are shown in Tables 12 and 3 by comparing based on metrics SED, SV, and MSC (cf. Section 2.2, and more details of the results can be found here.)Footnote 6 The parameters set for the different algorithms for distributions to produce these results are shown in Table 4. The default parameters, which are not changed regarding a particular distribution, except for DDCAL which is described in the following, are shown in Appendix 2, for each particular algorithm. Also for stochastic algorithms, where the results depend on trial because of random methods in the particular algorithm, the variances from 10 trials are shown. The results show one execution. k-medoids and Gaussian kernel k-means result in the highest variances and thus the worst results. Furthermore, for k-medoids, metric SED showed the highest variance on all distributions, with exception of the exponential distribution, where Gaussian kernel k-means has the highest variance. The trimmed k-means algorithms lead to the lowest variances, with the exception of metric SV on uniform distribution with trimmed k-means O+. Thus, the trimmed k-means algorithms perform best and also k-means++ shows low variances over all data sets.

Table 2 Algorithm results for SV, if NUC = 1, from different artificially generated distributions
Table 3 Algorithm results for MSC if NUC= 1, from different artificially generated distributions
Table 4 Algorithm parameters which were set for the different distributions as shown in Tables 12, and 3

For DDCAL (cf. Algorithm 1), we use the following input parameter values, i.e., boundary_min= 0.1, boundary_max= 0.49, num_simulation= 20, and q_tolerance_increase_step= 0.5. The parameter q_tolerance is set based on the distribution: for normal, gumbel, and two peaks distribution q_tolerance= 0.45 and for uniform and exponential with q_tolerance= 0.1. The setting of the input parameters is discussed in Section 3.3.

For other algorithms which need input parameters, we implemented simulation methods which aim to maximize the result of the metric SED if the precondition of NUC= 1 is fulfilled. Further details on each algorithm are described in Appendix 2.

The algorithm containing the best result is highlighted by a bold value in each table and on each distribution. We omit results and consequently algorithms which produce less than 10 clusters (i.e., NUC< 1) due to a fair comparison of the different algorithms. Also, results from trimmed k-means are excluded in the evaluation of the results because the algorithm filters outliers. The parameter of the trim factor is set to 0.1 which means that for 10% of the data to cluster, a separate cluster is built which contains identified outliers of the data set. This behavior violates 2 of 10 requirements set out in Section 1: (6) to avoid filtering/omitting of elements by the cluster algorithm and (5) to avoid overlapping cluster ranges. The latter is violated, because outliers are thrown in a separate cluster which contains likely elements from all areas of the data set. For this reason, bold values from trimmed k-means are highlighted in color blue if the result is best among the compared algorithms. We executed trimmed k-means twice on each data set, where the postfix O- indicates that only results without outliers are considered for metrics SED, SV, and MSC in order to avoid results with overlapping clusters as described before. Therefore, if 10% of the data points in the data set are trimmed, only the remaining 90% of the data points with their built clusters are evaluated. Hence, when M = 10 clusters are targeted, as described before, the actual targeted number on the algorithm is set to M+ 1 (i.e., 11). From the actually built clusters, only m-1 clusters, without the 1 outlier cluster are considered with the evaluation metrics. With this approach, the trimming of outlier data to cluster is performed, but no overlapping cluster ranges are produced. In contrast, if the postfix O+ is shown for trimmed k-means, also the cluster which contained the filtered outliers is considered based on metrics SED, SV, and MSC. However, using this approach, the appearance of overlapping cluster ranges is likely and thus it might not be useful for certain use cases (cf. Data Set 1 in Appendix 1).

In Table 1, the SED results are shown where the highest values represent the best ones. DDCAL performs best on the normal, gumbel, uniform, and exponential distribution and is slightly behind Gaussian kernel k-means (average rank of 2.2) on the two peaks distribution, resulting in an average rank of 1.2. trimmed k-means O+ performs also good on all distributions and best on normal and two peaks distribution. However, as mentioned before, overlapping clusters are produced on all distributions due to the outlier cluster, which contained 10% of the whole data set to cluster.

Table 2 shows the result for SV, where the lowest results indicate the best performance of an algorithm. DDCAL performs below average, except on the normal (rank 2) and gumbel (rank 3) distribution. In comparison, when calculating the average rank of all distributions, KDE performed best with 2.4, followed by k-means++ with 3.4 and GMM with 3.6. DDCAL had an average of 5.2 and was ranked as “6th” best algorithm, out of 8.

Table 3 shows the results of metric MSC, where the highest value shows the best result. k-means++, Jenks natural breaks, and GMM perform best on all distributions. These algorithms achieve always the first or second place, with the exception of GMM on the uniform distribution with only rank 7. Note that equal results have the same rank. DDCAL ranks 5 on uniform distribution, 6 on normal, gumbel, and two peaks distribution and 7 on the exponential distribution. By calculating the average rank of all distributions, k-means++ and Jenks natural breaks share the first place as best algorithms with 1.4, followed by GMM with 2.6 and k-medoids with 4. DDCAL scores below the average with an average rank of 6. Only KDE achieves a lower result with 6.6. MSC ranges from − 1 to + 1 and in most cases, except for the exponential distribution, the results for DDCAL differ from the results of the other algorithms on the second decimal place. Hence, we can still say that DDCAL performs close to the best performing algorithms.

When putting the ranks of all metrics on each algorithm together, by calculating the average rank on each distribution, DDCAL and Gaussian kernel k-means perform best on the normal distribution with an average order value of 3.0 (e.g., for DDCAL: \(\frac {1+2+6}{3}\)) from SED, SV, and MSC. On the gumbel distribution, DDCAL and GMM perform best with 3.3. For the other distributions, DDCAL performs worse than on the normal and gumbel distribution, where, for example, on the uniform distribution, k-means++ and Jenks natural breaks have the best average rank with 1.7 and DDCAL is on “5th” place. When looking at the exponential distribution, GMM scores best with 3.0 and DDCAL performs poorly with the “7th” place and an average rank of 5.3. Finally, on the two peaks distribution, k-means++ and GMM perform best with 2.7. DDCAL ranks 6 with an average score of 4.7.

Additionally to the ranking techniques as shown above and to get an even better picture of the effectiveness of the algorithms, we consider the differences between the scores for metrics SED, SV, and MSC by each algorithm. First, we normalize the results for each of the three metrics for each distribution by using the feature scaling method min-max normalization with the equation from Section 2 to obtain scores, ranging between 0 and 1, where 1 shows the best result, e.g., on SED for normal distribution DDCAL= 1.0 (indicates the best rank), KDE = 0.59 (indicates that the result is about less the half than the best ranking result), and largest gaps = 0 (indicates worst rank). On SV, we have to do an extra step after normalization, because the best result has the lowest score. Thus, we subtract each normalized result from 1, e.g., for KDE: 1 − 0 = 1. For metrics SED on the normal distribution, we discover that, for example, DDCAL has a normalized result of 1 and the second best is Gaussian kernel k-means with only 0.40. Nearly the same picture is shown when we look at the gumbel distribution, where the DDCAL has a normalized result of 1 and the second is Gaussian kernel k-means with 0.20. With the exception of the two peaks distribution, where DDCAL ranks second best 0.68 behind Gaussian kernel k-means, DDCAL performs best on all tested distributions. When DDCAL is compared based on metric SV, the results are about average, i.e., 0.91 on normal distribution, 0.69 on gumbel distribution, 0.98 on uniform distribution, and 0.71 on two peaks distribution, except for the exponential distribution with a result of 0. The normalized results for MSC show nearly the same picture as shown for the previous metric SV: 0.89 on normal distribution, 0.6 on gumbel distribution, 0.92 on uniform distribution, and 0.74 on two peaks distribution. Similarly to SV, DDCAL shows an below-average performance with with 0.15 for MSC with the exponential distribution. By averaging the normalized results from all metrics, where each metric is considered with equal weight, on each distribution, DDCAL performs best on normal distribution with 0.94 (\(\frac {1+0.92+0.89}{3}\)), followed by Gaussian kernel k-means with 0.73. On gumbel distribution, DDCAL performs best with an average of 0.77, followed by Gaussian kernel k-means with 0.53 which was close to k-means++ with 0.51. For the uniform distribution, the first three algorithms lead to similar results, i.e., 0.984 for k-means++, 0.981 for Jenks natural breaks, and 0.97 for DDCAL. On the exponential distribution, KDE performs best with an average of 0.51 and DDCAL ranks 7 with a score of 0.38. Finally, on the two peaks distribution, Gaussian kernel k-means has the highest average score with 0.92 and DDCAL ranks 4th place with a score of 0.71. Based on the sum of the normalized results over all distributions from all metrics the algorithms can be ranked as follows: DDCAL with 3.77 (0.94 + 0.77 + 0.97 + 0.38 + 0.71), Gaussian kernel k-means with 3.41, k-means++ with 3.35, GMM with 3.26, k-medoids with 3.25, Jenks natural breaks with 2.91, KDE with 2.67, and largest gaps with 0.99.

Through the analysis of the different metrics, it seems that DDCAL performs well on normal and gumbel distributions. By considering the mechanics of DDCAL, outliers above or below a threshold are the first generated clusters. After every iteration, these outliers are removed continuously from the input list and assigned to clusters. Through this procedure, the input list becomes more evenly distributed. DDCAL builds only a cluster in an iteration if on both sides, above and below a threshold, outliers are identified as potential clusters, containing a minimum amount of data points. Hence, the algorithm performs better, if just one peak with two tails exists in the distribution of a given data set, and also because outliers are considered as local in DDCAL instead of global. DDCAL can be improved on data sets with two or more peaks by a pre-processing step which cuts the data set into two or more slices where every data set contains one distribution.

3.3 Evaluation of DDCAL Parameters on Different Distributions

In the previous sections, we proposed particular values for parameters on DDCAL which we also used for evaluation. These default parameters (boundary_min= 0.1, boundary_max= 0.49, num_simulation= 20, q_tolerance_increase_step= 0.5, q_tolerance= 0.45 resp. 0.1) are chosen because of the results from simulations on different distributions, as shown in Figs. 34, and 5 for the DDCAL algorithm (more details of the results can be found here.)Footnote 7 We left out the gumbel distribution because the impact on the results, by setting different input parameters, is nearly the same as on the normal distribution.

Fig. 3
figure 3

Visualization of metrics SED, SV, and MSC for DDCAL on different distributions with different q_tolerance parameters and M = 10

Fig. 4
figure 4

Visualization of metrics SED, SV, and MSC for DDCAL on different distributions with different boundary_max parameters and M = 10

Fig. 5
figure 5

Visualization of metrics SED, SV, and MSC or DDCAL Advanced on different distributions with different boundary_min parameters and M = 10

At first, we evaluate different settings for parameter q_tolerance, which defines the minimum quantity of elements for building a cluster. As shown in Fig. 3, we tested different values between 0.005 and 0.5 with the observation that DDCAL performs well on normal (also gumbel) and two peaks distribution, when the parameter is set to 0.45. On the uniform and exponential distribution, 0.1 performs well. On the two peaks and exponential distribution, there exist many jumps in the results of the metrics. Therefore, a high q_tolerance factor of, for example, 0.4 on the exponential distribution can perform well, resulting in a lower SED value and the best value for MSC.

Overall, we recommend for real-world data sets, which are similar to the tested distributions, to set q_tolerance to a higher value like 0.45 on normal and gumbel distribution and to a lower value, like 0.1 on a uniform distribution. On exponential and two peaks distribution, we recommend to test both high and low q_tolerance factors, such as 0.45 and 0.1.

The q_tolerance has huge impact on the results on different distributions and can be used to adjust the trade-off between the results of SED and the classical clustering metrics SV and MSC in small steps, e.g., by setting the parameter from 0.45 to 0.5 on the normal distribution, where a higher q_tolerance factor decreases SED (which is still high) but produces better results for SV (which gets lower) and MSC (which increases).

For further tests on different parameters, we use the best performing q_tolerance setting on each distribution as discovered before. Note, we also performed tests on the uniform and exponential distribution by setting q_tolerance to 0.45, even though our observation showed that a lower q_tolerance factor seems to perform better. In most cases, except for testing the parameter boundary_min on exponential distribution, similar curves in the diagram are produced and the results on all three metrics decrease.

For DDCAL, we consider boundaries up to 0.49 as meaningful with respect to achieving fair results and because clustering outliers above 49%, these results cannot be considered as “outliers,” as they make up more than half of the remaining data points on each iteration step. The results depicted in Fig. 4, for which different boundary_max parameters (ranging from 0.15 to 0.9) are tested, show that the overall performance is good with a maximum boundary of 0.49. Thus, we recommend to use the parameter boundary_max= 0.49 on all distributions. The exponential distribution with a higher value performs also good and even better on SED.

When investigating the parameter boundary_min, where the results are shown in Fig. 5, it turns out that to start with a 10% boundary (minimum boundary) for clustering outliers, seems to be a good balance between a high SED value and good results on SV, and MSC. Starting with 15%, leads to DDCAL performing even better in some cases. For example, on the exponential distribution boundary_min= 0.15 yields better results than = 0.1 on SV and MSC on slight costs of SED, but on normal distribution this observation is reversed, where SED is slightly higher on costs of SV and MSC. Therefore, we recommend to use boundary_min= 0.1 on all distributions and to test boundary_min= 0.15 for fine tuning.

With the parameter num_simulations, as described in Algorithm 1, we defined the number of simulation steps for testing different boundaries on each iteration in DDCAL. These boundaries are evenly spaced values from the defined input parameters boundary_min to boundary_max with the defined number of simulation steps, where we consider that 20 elements are sufficient to test all necessary boundaries.

This number is confirmed through testing different numbers of simulation steps (above and below) which show that less than 20 simulation steps decrease the results from the metrics and more than 20 simulation steps do not show significant improvements, but increase the runtime of the algorithm.

Also, the parameter q_tolerance_increase_step is tested with different values. A value of 0.5 seems to perform best, but slight changes have hardly any impact on the results. When setting q_tolerance_increase_step too low, more iteration steps are performed, which increases the runtime of the algorithm. A too high value such as 3 decreases the performance of the metrics, especially for SED, because clusters are built too early when the minimum quantity for building a cluster is reached immediately.

We tested DDCAL by changing more than one parameter at the same time, but the results from the metrics always decrease. Thus, we recommend to change only one parameter value and to use the recommendations as described before. In detail, parameter q_tolerance can be used as pivot for better results. For slight improvements, the parameter boundary_min can be changed, as well. An exception, where the boundary_min changes the results completely, is shown on the two peaks distribution by setting boundary_min to the value 0.35 instead of 0.1, which shows, when comparing DDCAL with 7 other algorithms (cf. Tables 12, and 3) the following results: the performance of SED declines from ranks 2 to 6. The rank of SV remains 6 and MSC improves from 6 to the best performing algorithm (rank 1) with an MSC value of 0.56.

Discussion: Throughout the paper, we assume M = 10, but other M values were tested on the data sets described in Section 3.2. On data set normal distribution, for example, we compared DDCAL with Jenks natural breaks on different values of M (more details of the results can be found here.)Footnote 8 We have chosen Jenks natural breaks for comparison, because it was the best performing algorithm in Section 3.2 (cf. Tables 12, and 3) with stable results on M = 10. Jenks natural breaks ranks third behind DDCAL and k-means++ when calculating the average rank for each distribution (cf. Section 3.2) and ranks sixth behind DDCAL and the stochastic algorithms Gaussian kernel k-means, k-means++, GMM, and k-medoids when calculating the normalized results.

For DDCAL, we used the default parameters (boundary_min= 0.1, boundary_max= 0.49, num_simulation= 20, q_tolerance_increase_step= 0.5, q_tolerance= 0.45) as stated above. First we observe that for M ∈ {3,4}, Jenks natural breaks shows better results for SED, SV, and MSC. For M∈ {5, 6, 8, 12, 15, 30}, SED and SV perform better for DDCAL, but MSC turns out slightly worse than Jenks natural breaks, which shows the same observation as with M = 10 (cf. Tables 12, and 3). As shown before, in some cases, MSC can be improved without costs of other parameter performance, when setting the parameter boundary_min to = 0.15 instead of 0.1. For M = 12, for example, SED improves from 7.78e + 22 to 9.01e + 22, SV remains the same value of 3.69e-01, and MSC improves from 0.48 to 0.5 where Jenks natural breaks has values of SED= 8.11e + 21, SV= 4.90e-01 and MSC= 0.54. The same observation of the improvement of DDCAL by changing this parameter is discovered for M = 6, where all metrics improve. By changing the parameter for other values of M, the performance of the metrics decreases.

When setting M to a high value of, e.g., 100 (with just 1000 data points in the data set), Jenks natural breaks produces 100 clusters where DDCAL produced only 44 clusters. The lower number of produced clusters in the output results from the termination of the algorithm, if no data is left for clustering (cf. Algorithm 1 step #36) and hence it makes no sense to produce more clusters. Because of the fewer clusters, the results of DDCAL rank behind Jenks natural breaks for metrics SED (1.51e + 52 vs. 3.50e + 90) and SV (3.97e-01 vs. 3.86e-01) and better for metric MSC (0.53 vs. 0.5). However, a fair comparison can only be made if both algorithms produce the same number of clusters. When setting the parameter boundary_min to 0.01 instead of 0.1, DDCAL produces 100 clusters, where all metrics perform better than Jenks natural breaks (SED= 1.17e + 97, SV= 3.40e-01, and MSC= 0.54). By lowering boundary_min, it is more likely that the envisaged number of data points per cluster is found, because even low deviations of the data points are counted as outliers (cf. Algorithm 1 step #7). In other words, if not a sufficient number of clusters (m<M) are built after executing DDCAL, boundary_min can be lowered, to fill the target number of clusters (M) with data points.

3.4 Evaluation Based on Real-world Data Sets

To compare DDCAL with existing algorithms as described in Sections 3.1 and 3.2, we use four real-world data sets from different domains in order to show the wide range of possible applications, i.e., Data Set 1 on search processes, Data Set 2 on US population, Data Set 3 on star distances, and Data Set 4 on weather (all data sets are described in Appendix 1).

The results of comparing selected algorithms (cf. Appendix B) are shown in Tables 567, and 8 (more details of the results can be found here).Footnote 9 Like shown on Tables 12, and 3 (cf. Section 3.2), each table contains variances from 10 trials for stochastic algorithms, where the yielded results depend on the trial because of random methods in the particular algorithm. The results of each of these algorithms represent one execution for the actual comparison among the other algorithms.

Table 5 Results for Data Set 1 (process mining) by using M = 10 where max SED = 1.40e + 05 (except for trimmed k-means O- (where O- stands for without outliers), which means that a defined percentage (= 10%) of elements to cluster were filtered because they were considered as outliers)
Table 6 Results for Data Set 2 (U.S. population 2018) by using M = 10 where max SED = 6.25e + 06 (except for trimmed k-means O- (where O- stands for without outliers), which means that a defined percentage (= 10%) of elements to cluster were filtered because they were considered as outliers)
Table 7 Results for Data Set 3 (distances to stars) by using M = 10 where max SED = 6.00e + 40 (except for trimmed k-means O- (where O- stands for without outliers), which means that a defined percentage (= 10%) of elements to cluster were filtered because they were considered as outliers)
Table 8 Results for Data Set 4 (weather) — min temperatures by using M = 10 where max SED = 4.31e + 15 (except for trimmed k-means O- (where O- stands for without outliers), which means that a defined percentage (= 10%) of elements to cluster were filtered because they were considered as outliers)

In the description of each table, the maximum SED value is listed for the data set to indicate how close each algorithm performs to this value. Bold values indicate the best results among the compared algorithms in a table if NUC= 1.0 and therefore M is reached (m=M). As described in Section 3.2, in the following, we discuss only those results with NUC= 1.0 and without trimmed k-means where bold values from this algorithm are highlighted in color blue if the result was best among the compared algorithms.

The default parameters for the comparison algorithms which are not changed regarding a particular data set are shown in Appendix 2. For the DDCAL algorithm, we used the default parameters (boundary_min= 0.1 resp. 0.15, boundary_max= 0.49, num_simulation= 20, q_tolerance_increase_step= 0.5, and q_tolerance= 0.45 resp. 0.1) as described in Sections 3.2 and 3.3. For Data Set 4, q_tolerance is set to 0.1 because of the uniform distribution and on all other data sets, q_tolerance is set to 0.45. For Data Set 2, boundary_min is set to 0.15 because the metrics SED, SV, and MSC perform better. For Data Set 4, boundary_min is set to 0.15, where SED performs a little weaker than 0.1, but the performance increases for metrics SV and MSC.

DDCAL achieves outstanding results given an equal distribution of the data points over the clusters (SED) for all real-world data sets, where DDCAL hit always rank 1 in comparison to all other algorithms. On SV and MSC, the performance of DDCAL is in the midfield or below.

On Data Set 1 (process mining), DCAL ranks 6 for SV 1 for MSC. k-means++ and DBSCAN yield the same output and thus the metrics have the same results. They both rank 1 for metric SV, 2 for MSC, and second last for metric SED which demonstrates a trade-off between SED and the classical clustering metrics SV and MSC. For the stochastic algorithms, k-means shows the same results for all trials and the variance for all metrics thus turns out as 0. Also, the trimmed k-means algorithms performs with no variance for SED. The Gaussian kernel k-means performs worst regarding variances from the 10 trials.

For Data Set 2 (U.S. population 2018), mean shift followed by KDE performed best on metric SV and DDCAL had the “9th” rank out of 10. Regarding metric MSC, DDCAL ranks on place number 7 and k-means++ ranks the first place, followed by GMM. Similar to the previous discussed data set, the variance of k-means++ is the lowest and thus the best and for Gaussian kernel k-means the variance is the highest and thus the worst on stochastic algorithms on this data set.

For Data Set 3 (Stars), with respect to metric SV, DDCAL ranks 4 out of 7 algorithms. k-means++ and GMM perform best. For MSC, DDCAL hits the last place where DBSCAN hits the first, and largest gaps the second place. When comparing the variances of the stochastic algorithms, the trimmed k-means algorithms perform best with low variances where GMM shows the highest variances after 10 trials.

Finally, on Data Set 4 (Weather), for metrics SV and MSC, DDCAL scores between the “6th” and the “7th” rank out of 8. GMM performs best for SV and MSC. k-medoids has the second best results for SV and the best ones for SV. Gaussian kernel k-means performs second best for MSC. For the stochastic algorithms, the variances of trimmed k-means O+ are the highest and thus worst among the compared algorithms. The variances for Gaussian kernel k-means are the lowest and thus the algorithm performs best when comparing the deviations on the metrics from 10 trials.

As for the synthetic data sets (cf. Section 3.2), we sum up all results for all metrics for each distribution and algorithm by using the feature scaling method min-max normalization using the equation provided in Section 2. The scores range from 0 to 1 where 1 indicates the best result. For Data Set 1 (process mining), DDCAL hit the first place, followed by k-means++ and DBSCAN. For Data Set 2 (U.S. population 2018), k-means++ performs best, followed by DDCAL and k-medoids. For Data Set 3 (Stars), DDCAL shows the best overall performance, followed by DBSCAN and GMM. Finally, for Data Set 4 (Weather), DDCAL performs best and is followed by k-means++ and Jenks natural breaks.

In summary, for metrics SV and MSC, none of the algorithms achieves a good ranking on all real-world data sets. In contrast, the SED results on DDCAL come close to the maximum SED value for each of the real-world data sets. Furthermore, the results show that there exists a trade-off between metric SED and the classical clustering metrics SV and MSC, as discussed earlier. Section 4 will illustrate why a high SED value from clustered results as pre-processing of 1d data for visualization is favorable based on different use cases, as also already mentioned in Section 1.

The comparison of the characteristics of the algorithms leads to the following observations: k-means++ shows a good overall performance on the real-world data sets. One reason is that the real-world data sets are similar to an exponential and uniform distribution, where the algorithm also achieves good results (cf. Section 3.2).

k-means++, GMM, Gaussian kernel k-means, k-medoids, and trimmed k-means (which was left out in the comparison) are struggling to be reproducible, because different results are yielded depending on the trial (Thrun, 2021). Thus, after each run, the clustered results differ and thus, do not produce stable results like DDCAL. This behavior produces a weak violation of one requirement set out in Section 1. We speak of a weak violation, because trial sensitive random methods can be set to a fixed number (e.g., the random_state parameter for k-means++ on the Python framework sklearn)Footnote 10, with the effect that the random method will be deterministic and returns always the same results. However, this approach may not lead to the best results of an algorithm.

The variance of the before mentioned algorithms is shown for the metrics in Tables 123567, and 8 because we use random methods without restrictions. Head/tail breaks, DBSCAN, KDE, and mean shift do not entail any parameter to define the number of aimed clusters, M. Except for head/tail breaks, the aforementioned algorithms have input parameters for which we use simulation methods to maximize metric SED which implies to optimize m. We use the simulation method also in order to not exceed M for the particular algorithms. In terms of runtime, mean shift, trimmed k-means, and Gaussian kernel k-means are the slowest algorithms on the biggest Data Set 3. For 119,614 data points, the Gaussian kernel k-means is not able to terminate within 48 h, the trimmed k-means algorithms takes more than 15 h, and the mean shift algorithm was quite slow on this data set with an execution time of about 4 min without the simulation method. DDCAL has a runtime of about 3 s with its built-in simulation method (the algorithm has a complexity of O(n log n), by including the sorting step at the beginning. Regarding the observed runtime of the algorithms, it should be noted that their implementation in Python (cf. Section 3.1) may use different frameworks and data structures under the hood. This might cause different results in terms of performance. Thus, the observations regarding runtime should be considered with caution and interpreted as experimental run and are not further discussed. Also, k-medoids is not able to to produce results on the biggest Data Set 3, because the computer, where the algorithm was executed, ran out of memory.

4 Application: Visualization of Data on Maps and Process Models

Getting the most insights out of the data can be supported by the “[t]ight integration of visual and automatic data analysis methods” (Keim et al., 2008); particularly, the model building can be seen as the iterative application of data mining methods and visual exploration.

Hence, we can identify pre-processing of 1d data for visualization on business process models and (cloropleth) maps as applications for DDCAL (Algorithm 1).

The visualization of business process models represented by directed graphs, as introduced in Section 1, can be enriched by data connected with the nodes of the graph. Data Set 1, for example, stores the number of search results for nodes in a process model that reflects the search behavior of customers. Figure 6 depicts the process model that is used as underlying structure for visualizing Data Set 1 using different clustering algorithms, i.e., k-means++ and DBSCAN in Fig. 6(a) and DDCAL in Fig. 6(b), for coloring each node based on their frequency, labeled as freq in each node. The results for all clustering algorithms are shown in Table 5, where k-means++ and DBSCAN had the same output. We have chosen k-means++ and DBSCAN to compare with the DDCAL because its overall normalized performance among algorithms from the metrics SED, SV, and MSC together was the second best performing with NUC= 1, where DDCAL performed best (cf. Section 3.4).

Fig. 6
figure 6

Process mining data set pre-processed by (a) k-means++/DBSCAN and (b) DDCAL

At first sight, both visualization results differ. In Fig. 6(a) which used k-means++ and DBSCAN for coloring the nodes, there is no difference between the most frequently occurred frequencies in the process model, namely between frequencies 3 and 4 which were assigned to the the same “green” color gradient. By contrast, in Fig. 6(b), there is a distinction between frequency 3 and 4 which is highlighted as a blue frame in the process model and for readability, and these nodes are magnified as well in a blue box on the left side. Such a distinction is important as the whole process model shows 33 nodes in total and nodes with frequency 3 appear 7 times (i.e., 21%) and nodes with frequency 4 appear 6 times (i.e., 18%). Furthermore, if a visualization tool for process models supports a filter option to filter out nodes with low frequencies, for example to show just frequencies > 4, the distinction between often occurred nodes with colors is important because an observer can see visually, which elements are affected from such a filtering step. Generally, a higher variety of colors apart from just “green” tones is shown in the process model which helps to recognize heat maps, for highlighting regions with a high frequency. Such an example is demonstrated through the violet highlighted region in the process model. This also enables a stronger distinction of execution frequencies and in the sequel execution behavior of users between these activities. Furthermore, a happy flow of process paths can be easily discovered at first glance through the coloring of nodes based on their frequencies in process models. A happy flow from process paths which contain more than 4 nodes is highlighted with a red frame.

Note that the clusters do not have to be necessarily mapped onto colors for visualization. For the process mining use case, a mapping onto edges in terms of stroke width is also conceivable.

A different visualization is shown on Data Set 2. The goal on this data set is to visualize the data on a US map in order to distinguish the population sizes of different states. For visualization of a choropleth map, we downloaded the shape files of the US states from the year 2018 from census.gov.Footnote 11 Then, the data is pre-processed by clustering using (a) Jenks natural breaks and (b) DDCAL. The results of all clustering algorithms were already introduced in Table 6 and discussed as well in Section 3.4. We used the algorithm Jenks natural breaks to compare with DDCAL because it was designed to visualize choropleth maps (Jenks, 1967). Figure 7 depicts the results for both algorithms. In Fig. 7(a), it is shown that the clustering by using Jenks natural breaks does not result in well distinguishable colors for states with low population sizes due to the “dominance” of a few states with high population size in the south and east. Thus, the majority of the states with lower populations (in green colors) can hardly be distinguished. Compared to Jenks natural breaks, the visualization in Fig. 7(b), by using the clustering algorithm DDCAL, the assigned colors to the states lead to a more fine-grained exploitation, enabling the distinction of the population sizes of more states, particularly of those with lower population sizes.

Fig. 7
figure 7

Process mining data set pre-processed by (a) Jenks natural breaks and (b) DDCAL

Additionally, Data Set 5 features data on the Corona pandemic that has posed highest priority to monitor closely the infection development in order to enable quick reactions to (local) outbreaks and subsequent mitigation actions (Thomas et al., 2020). Several visualizations have been provided, for example, the dashboard of the World Health Organization (WHO).Footnote 12 For the visualization with cluster pre-processing, we take the subset of absolute confirmed cases from 2020-06-05. That day, the US accounted for around 28% of all confirmed cases world-wide while the 52 African states captured by the data together account for around 2.6%. Hence, we can speak of outliers in the data. Figure 8 displays two visualizations of the data set that have been pre-processed by clustering and are displayed using the app.Footnote 13 In (a), k-means++ is used for clustering, in (b) the DDCAL (cf. Section 2) algorithm to be proposed in this work. At first glance, it can be seen that in (a) there is almost no visual differentiation for African countries, whereas in (b) different clusters can be differentiated. For example, the cluster containing Egypt (0.5%) and South Africa (0.6%) can be distinguished from a cluster containing — among other countries — Nigeria (0.18%) and Algeria (0.15%) and furthermore, Mali (0.002%) can be distinguished from Niger (0.001%). The numbers in brackets denote the percentages of confirmed cases in the respective countries in relation to the number of confirmed cases world-wide.

Fig. 8
figure 8

Corona visualization of absolute confirmed cases for the date 2020-06-05 from the Johns Hopkins CSSE data set pre-processed by (a) k-means++ and (b) DDCAL

5 Related Work

The definition of clustering is still an open discussion (Estivill-Castro, 2002). In some ways, it is described to group similar data into a cluster for permitting a significant generalization (Bonner, 1964) and it helps in data mining to identify interesting structures in data (Thrun et al., 2020). A different clustering concept, which is not treated in this work, is conceptual clustering, which accepts a set of object descriptions to produce a classification scheme over observations (Fisher, 1987). Section 3 provides a detailed comparison of DDCAL with a set of related clustering algorithms, i.e., k-means++ (Arthur & Vassilvitskii, 2007) (Arthur & Vassilvitskii, 2006), Jenks natural breaks (Jenks, 1967), head/tail breaks (Jiang, 2013), DBSCAN (Ester et al., 1996), KDE (Scott, 2015), GMM (Reynolds, 2009), mean shift (Comaniciu & Meer, 2002), largest gaps, Gaussian kernel k-means (Dhillon et al., 2004), k-medoids (Park & Jun, 2009), and trimmed k-means (Cuesta-Albertos et al., 1997). Further algorithms which can be used for clustering of (1d) data points are, for example, hierarchical cluster methods like pairwise agglomerative clustering as described in Faber (1994). This method has a major drawback in terms of performance on large data sets because each iterative step requires multiple distance calculations. Many popular algorithms such as k-means have various modifications of their basic concept, e.g., the k-medoids algorithms (Thrun, 2018) which are designed to be less sensitive to outliers than the original algorithm. One implementation here is the PAM algorithm (Kaufman and Rousseeuw, 1990) which uses medoids instead of centroids, i.e., data points in the data set itself and the Minkowski distance. One-dimensional clustering can be used to generate choropleth maps (Wright, 1938) for visualizing how a measurement varies across a geographic area. Another use case for one-dimensional clustering is to generate heat maps, e.g., for visualizing the frequency of nodes or edges in directed graphs from mined processes by using process mining algorithms (van der Aalst, 2016), which are similar to choropleth maps, but with the difference that they do not use geographic areas. Fair clustering techniques such as Chierichetti et al. (2017) aim at minimizing distances between the data points within the clusters and maximizing the distances of the points between different clusters. On top of that, they try to respect fairness constraints with respect to possibly sensitive features such as age. The DDCAL algorithm can be interpreted as fair clustering technique with respect to the even distribution of the data points over the clusters, but do not assign any sensitivity to features.

6 Summary and Outlook

Summary Regarding Research Questions:

This work addresses research questions RQ1–RQ3 as set out in Section 1 as follows: DDCAL constitutes a heuristic clustering algorithm for evenly distributing data into clusters over a maximum number of low variance clusters based on the iterative feature scaling method min-max normalization which is also known as rescaling (RQ1). Regarding RQ2, we studied DDCAL on several synthetic and real-world data sets and compared the results to 11 existing clustering algorithms. From the synthetic data sets it can be observed that DDCAL performs well for data with outliers and data following (tailed) distributions with one peak, which show a bell curve such as normal and gumbel distribution. Additionally, DDCAL has a good performance on uniformly distributed data, or the number of peaks in the data set is equal or higher than the number of targeted clusters (M). If the number of peaks is lower than the number of targeted clusters or the data set to cluster is exponentially distributed, the DDCAL shows weaknesses. For DDCAL, outliers are particularly treated in a way such that they do not “dominate” the resulting clusters. Three use cases from process model/mining to map visualization indicate that DDCAL results in a more differentiated color grading and hence might lead to a more effective visualization of the data (RQ3).


The assessment of the algorithms is based on four quality metrics applied to synthetic as well as real-world data sets. We do not employ supervised quality measures because no gold standard for the clusters on the used data sets is available which perfectly suits the trade-off between the metrics SED, an even distribution of data points into clusters and the classical clustering metrics SV and MSC. As shown in Section 5, even the definition of clustering is still an open discussion. The usage of unsupervised quality measures is always biased (Handl et al., 2005; Thrun, 2021) and therefore we evaluate the results based on use cases where we show what additional information could be observed by using DDCAL in comparison to other algorithms which had a good performance on the employed quality metrics.

We observed that DDCAL basically outperforms the other algorithms when evenly distributing data over all clusters and shows average results on building low variance clusters. Thus, DDCAL can be seen as good “all-rounder” for use cases demanding for evenly distributing data elements into a given number of clusters. k-means++ yields good results for all quality metrics. Gaussian kernel k-means performs well regarding an even distribution of data over all clusters. Both algorithms have one drawback, which is the missing “reproducibility” of the results because the initial cluster centers are set differently after each initiation run, which leads to different cluster results after every execution. Gaussian kernel k-means has further problems: The algorithm produces overlapping clusters in some cases and has a high time and space complexity. Moreover, for Gaussian kernel k-means and other existing algorithms analyzed in this work requiring input parameters, there is no research for setting these parameters for 1d data to produce results with an even distribution of data over all clusters and at the same time, having a low variance in clusters. We tackled this problem by simulating different input parameters, but possibly better approaches may be used with further research.

Higher-Dimensional Data:

Currently, we are working on extending DDCAL for clustering multidimensional data sets. Starting with 2d data, we follow two directions using (a) several distance measures and (b) pre-processing the data. For (a), we can use, for example, the Euclidean distance, Manhattan distance, cosine coefficient, Jaccard coefficient, dice coefficient, Minkowski distance, root mean squared error coefficientFootnote 14, and TS-SS (Heidarian and Dinneen, 2016) by comparing them with extreme data points, like (max-x-value/min-y-value), (min-x-value/max-y-value), (max-x-value/max-y-value), and (min-x-value/min-y-value). Option (b) includes an additional pre-processing step which is responsible for converting the multidimensional data points into one dimension and then using the core concepts of DDCAL. Subsequently, we aim at developing an algorithm for merging the produced clusters from each dimension for building multidimensional clusters.

Future Work:

Beyond data visualization, we will evaluate how DDCAL can be used for problems like clustering test scores of students for grading (Faber, 1994) or to build evenly distributed learning groups based on previous test scores. Additionally, we will investigate whether and how DDCAL can be used for indexing of data to achieve, for example, a faster information retrieval.