Cluster representation
Before beginning to assess how various clustering algorithms, distance measures, classifiers, etc., influence the discussed process, first, we need to decide on the one component which has not yet been completely defined, namely cluster representation, in other words, how to encode the discovered clusters as new features. So far, we have only stated that this encoding is done based on the distances between the examples and cluster representatives. This, however, can be done in many different ways. In this experiment, we are going to test 5 methods of encoding clusters as features, which we refer to as: binary, binary distance, distance, inverse distance squared, and probability.
Binary representation
According to our framework, each feature corresponds to a certain cluster, so the most straightforward encoding is a one in which each feature encodes whether a given example \(\pmb {x_i}\) belongs to a cluster with a given representative \(\pmb {c_l}\) or not:
$$\begin{aligned} \pmb {x_{im+l}} = {\left\{ \begin{array}{ll} 1 &{} \pmb {x_i} \text { belongs to the cluster represented by } \pmb {c_l} \\ 0 &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$
Binary distance representation
Since not all examples are in the same distance from the centers of their clusters, a natural extension of the binary representation is encoding the distance from a given example \(\pmb {x_i}\) to a given cluster center \(\pmb {c_l}\) if the example belongs to this cluster, and 0 otherwise:
$$\begin{aligned} \pmb {x_{im+l}} = {\left\{ \begin{array}{ll} \delta (\pmb {x_i},\pmb {c_l}) &{} \pmb {x_i} \text { belongs to the cluster represented by } \pmb {c_l} \\ 0 &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$
Distance representation
Since for each example we can evaluate its distance from all cluster representatives, a further extension of the binary distance representation is to let go of the cluster membership altogether and simply encode each cluster as a distance from its representative \(\pmb {c_l}\) to each example \(\pmb {x_i}\): \(\pmb {x_{im+l}} = \delta (\pmb {x_i},\pmb {c_l})\).
Inverse distance squared representation
When assessing the importance of objects, a common approach is to use the inverse square of distance (e.g., in knn classifier). Although measuring the importance of examples is not our direct goal, it is tempting to check what would be the effect of using this method in our problem. Therefore, the fourth examined way of encoding a cluster is to record the squared inverse of the distance between its representative \(\pmb {c_l}\) from each example \(\pmb {x_i}\): \(\pmb {x_{im+l}} = 1/\delta (\pmb {x_i},\pmb {c_l})^2\).
Probability representation
Some clustering algorithms, e.g., expectation-maximization or fuzzy clustering methods, estimate a cluster membership probability for each example. Although based on distance, these estimates represent something different and are therefore interesting to include in the analysis, especially since they can be naturally used as new features without any additional processing. For any given example \(\pmb {x_i}\) and cluster representative \(\pmb {c_l}\), each new feature is encoded as: \(\pmb {x_{im+l}} = 1/\sum _{i=1}^{k}{\left( \frac{\delta (\pmb {x_i},\pmb {c_l})}{\delta (\pmb {x_i},\pmb {c_i})}\right) ^2}\), where k is the number of clusters.
The results of this experiment are presented in Fig. 4. The distance representation stands out as a clear winner, although the probability representation is not far behind, in one case (ionosphere) even prevailing over distance. However, at individual level, not all results are significant. The ANOVA test was unable to distinguish between the results of each approach for breast-cancer-wisconsin dataset. Moreover, the further pairwise t-test was also inconclusive for ecoli, iris, and image-segmentation dataset. For glass, pima-indians-diabetes, pendigits, spectrometer, and statlog-vehicle, the distance representation was significantly better than all other representations, while on the remaining datasets, it was only rivaled by the probability representation (with one exception of statlog-satimage, where it was also indistinguishable from the inverse distance squared representation).
Analyzing the results collectively, the superiority of the distance and probability representations is confirmed. The Friedman test and the post hoc Nemenyi test, results of which are illustrated in Fig. 5, reveal that the measured accuracies were not produced by chance and that, indeed, the distance and probability representations are both significantly better than the other representations. Although distance was also on average better than probability, the test revealed that this difference is not significant.
All of the above test results allow us to draw a clear conclusion that the distance representation is a better choice than the alternatives. The reason for its superiority is in most cases also rather straightforward. The binary, binary distance, and distance representations convey different amounts of information, with the last one conveying the most, and the first one—the least. It is very easy to convert the distance representation into binary distance and the binary distance into binary, but the reverse is impossible. This asymmetry of conversion can be viewed as a lossy data compression, and the experiments simply show that the information lost in this process is valuable. Regarding the inverse distance squared representation, the aim was to check whether looking at the features differently, namely as the importance of a given example w.r.t. a given cluster rather than its simple spatial orientation, would benefit the process. This hypothesis turned out to be wrong; however, we suspect that the importance of examples could be potentially useful as a way of weighting training examples. The probability, although collectively statistically indistinguishable from the distance representation, was ultimately significantly outperformed by it on several datasets, while the reverse was never true. Although we do not see a straightforward reason for this outcome, our educated guess is that, since probability by definition puts an emphasis on cluster membership, it focuses on single clusters, so the information is not as well distributed across all features as in the case of the distance representation. Given the above, we recommend relying on the distance representation as the default way of encoding clusters as new features and use this representation in the remainder of the experiments.
Comparison of clustering algorithms
The first major component of the analyzed framework is the clustering algorithm used to form groups of training examples. Although there are many options to choose from, we wanted to focus on two main characteristics: a) how it deals with the number of clusters and b) what is the shape of the produced clusters. As a result, we decided to compare three approaches along those axes. The first two are k-means and affinity propagation, which allow us to deal with the first characteristic (as already described in Sect. 4.1). Since these approaches tend to produce spherically shaped clusters, we selected spectral clustering as the third candidate, as it allows for discovering clusters of irregular shapes. Conveniently, all of these approaches produce cluster representatives, so there is no need for determining them separately.
Since the selected algorithms deal differently with the number of clusters, we had to make a small adjustment in the experimental procedure. Affinity propagation determines the number of clusters automatically and our experiments have shown that it produces significantly more clusters than we would normally select for k-means or spectral clustering using other methods like the gap statistic [31]. Since each cluster is encoded as a new feature, this poses a problem, namely how to isolate the influence of the clustering algorithm from the influence of the number of features on the classification quality. Here, we only want to measure the impact of the former, while the impact of the latter will be analyzed in Sect. 5.4. Therefore, for each dataset we first perform the experiment using affinity propagation and then use the detected number of clusters as an input for k-means and spectral clustering. The results of this experiment are presented in Fig. 6.
Looking at the results, there is not a single case in which one algorithm would produce noticeably better results than the others. This is only further confirmed through the ANOVA tests which were unable to find significant differences between the algorithms on any dataset. Analyzing the results collectively, we were still unable to find any regularities. The Friedman statistical test confirms this observation as it was unable to reject the null hypothesis which states that the differences between the results are the product of random chance.
The above analysis allows us to draw a firm conclusion that the clustering algorithm does not impact the outcome of the discussed process in a significant way, if at all. On the one hand, this is a positive outcome as it allows us to select any clustering algorithm (e.g., the most efficient) rather than the one which produces the best clusters. On the other hand, it begs the question whether the composition of clusters has any impact on the process or does not matter at all. We will address this question with a dedicated experiment in Sect. 5.5.
Clustering per class or global clustering?
The intuition behind the hypothesis stated in this study is that clustering of training examples regardless of their class could help generalization through the use of global information. We refer to this approach as global as it requires a “global look” at the dataset, without artificially dividing it into predefined categories. However, one could argue for an alternative approach in which clustering is performed per class. We are still adding some global information about distant objects’ similarity, however, with the additional potential benefit of modeling the space occupied by each class. We refer to this approach as local as it analyzes the dataset “locally” in each class. To verify which of these approaches is better, we compared them empirically.
To make the comparison meaningful, we have to ensure an equal number of clusters in both approaches, to make sure that the results solely rely on the generated clusters and not their quantity. This is the same issue we faced when comparing clustering algorithms in Sect. 5.2. To achieve this goal, the experiment is performed as follows. First, we cluster the dataset separately in each class using affinity propagation to automatically determine the number of clusters. Next, we perform the same experiment using the global approach using k-means with the number of clusters equal to the total number of clusters found in all classes by affinity propagation. This way both approaches generate the same number of clusters for each dataset but using different approach. Since in the previous experiment we showed that the selected clustering algorithm does not influence the outcome of the process we can be confident in using different algorithms (affinity propagation and k-means) for each approach. The results of this experiment are presented in Fig. 7.
At first glance, the results do not reveal a clear winner, although the global approach seems to work better in more cases than the local. Looking at individual cases, the differences become more apparent with the global approach dominating significantly in several cases. A t-test confirms this observation indicating a significant difference between the means of the approaches for wine, glass, vowel-context, iris, sonar.all, image-segmentation, ionosphere, pendigits, and statlog-satimage datasets, among which only one (pendigits) is in favor of the local approach.
Analyzing the results collectively, the Wilcoxon signed ranks test was unable to find a significant difference between these two approaches. We therefore cannot state with all certainty that clustering the whole dataset produces better results than clustering for each class separately. That being said, global clustering did significantly outperform local on eight datasets, while the opposite was only true in one case. Moreover, global clustering is also the more reasonable choice, as it ensures a more even spread of the cluster representatives, while local clustering can produce highly overlapping clusters and, as a result, redundant features. Finally, the global approach is also the more intuitive one, as it allows the algorithm to access the hidden structure of the dataset which may be invisible through the lens of the classes. Given the above, even though we cannot statistically state that global clustering works on average better than local clustering, we still deem it a superior choice.
Sensitivity test
So far, in each experiment the number of clusters was determined by affinity propagation and remained unchanged for all tested approaches to isolate the influence of a given component. In this experiment, in turn, we isolate the number of clusters parameter to measure what impact it has on the quality of classification. We do this by executing the experimental procedure with the k-means clustering algorithm for increasing numbers of clusters from 1 to 200. Notice that 200 clusters is an unreasonably large number for certain datasets as some of them do not even have this many examples. This number was picked on purpose to pinpoint a moment at which the added features start leading to model overfitting. To fully achieve this goal, in addition to measuring the test set accuracy we also report training set accuracy to check when exactly the model starts overfitting the data. The results of the experiment are presented in Fig. 8.
It is easy to observe a point of saturation for most datasets at which training accuracy starts to diverge from test accuracy when it reaches a plateau. This is particularly well exemplified by the plots for glass, pima-indians-diabetes, sonar.all, and spectrometer datasets. On the other hand, for datasets image-segmentation, optdigits, pendigits, and statlog-satimage this effect is fairly insignificant, although still observable. Since SVM is a reasonably robust algorithm, this effect is not as dramatic as one could expect; however, the interpretation of these results is still very clear—there is a certain number of clusters after which the model starts to get needlessly complex. Interestingly, judging by the vertical lines in each plot, affinity propagation seems to produce a reasonable number of clusters, usually around the point at which the model starts overfitting. The only significant exceptions from this case are optdigits and pendigits datasets; however, coincidentally they also exhibit the effect of overfitting in the smallest amount, so, apart from the models being needlessly complicated, training accuracy still reasonably approximates test accuracy in these two cases.
The main conclusion from this experiment is that the number of clusters has a significant impact on the quality of predictions. Furthermore, most of the datasets show a predictable link between the number of clusters and classification quality. Up until a certain point, the more clusters we create the better. Afterward, the model starts to overfit the data and gets needlessly complicated without further improving the quality of predictions. The experiments also demonstrate that the number of clusters produced by affinity propagation is usually a good compromise between how complicated the model is and how well is it able to predict. From our point of view, this is a very important finding as it confirms that our choice of affinity propagation as the source of the number of clusters for all clustering algorithms was appropriate.
Taking into account cluster purity
Since we have already established in our sensitivity experiment that the number of clusters has a clear influence on the quality of classification, let us now check whether the purity w.r.t. the class of the clusters themselves makes any difference. In order to do so, we will cluster the datasets into the number of clusters indicated by affinity propagation, encode the clusters as new features, and evaluate the quality of each new feature using Fisher score defined as:
$$\begin{aligned} FS(x_{ij}) = \frac{\sum _{k=1}^{c}{n_k(\mu _k^j-\mu ^j)^2}}{(\sigma ^j)^2}, \end{aligned}$$
where \(n_k\) is the number of examples in k-th class, \(\mu ^j_k\) and \(\sigma ^j_k\) are the mean and standard deviation of k-th class corresponding to j-th feature, while \(\mu ^j\) and \(\sigma ^j\) are the mean and standard deviation of the whole dataset corresponding to j-th feature. Next, we will add new features one by one in order of their increasing and decreasing quality to observe the effect they have on classification accuracy. The results of this experiment are presented in Fig. 9.
Ultimately, the approach discussed in this study is based on selecting points in m-dimensional space (where m is the number of features), calculating the distances between all data points and the selected points, and encoding these distances as new features. The analyzed plots illustrate that the choice of these points matters and has a clear impact on the quality of classification. On the diagram, the yellow (lighter) line represents classification quality when adding new features according to their descending Fisher score, while the blue (darker) line represents the same in an ascending order. The lines necessarily meet at the end, since in both cases in the end all features are used for classification. A direct observation from these plots is that adding new features in order of descending Fisher score usually produces better results than adding them in reversed order. This, however, leads to a more general and more important observation that some points (clusters) hold more information than others from the classification perspective which, in turn, seems to indicate that the choice of these points should matter. This effect is displayed most prominently in the plots for breast-cancer-wisconsin, ionosphere, iris, pima-indians-diabetes, sonar.all, and wine datasets. An odd case is the statlog-satimage dataset, where adding features in the ascending order of their discriminative power produces better results most of the time. However, this can be easily explained by, e.g., a single feature which achieved a low Fisher score but was actually crucial from the classification point of view.
Another obvious observation from the plots is that the maximum accuracy is almost always at the end of each plot, i.e., when all features are being used. This suggests that all of the generated features are important (albeit to a varied extent) and contribute valuable information from the classification perspective. This observation also confirms the results from the sensitivity experiment, hinting that the number of clusters selected by affinity propagation indeed strikes a good balance from the under-/overfitting perspective.
In conclusion, the experiment suggests that not all clusters are equally valuable. The results clearly indicate that some clusters yield features of higher purity, and this difference has a noticeable impact on the quality of predictions.
The impact of distance measure
In Sect. 5.1, we have established that encoding clusters as a distance from training examples to cluster representatives is the best among the proposed representations. Still, there are various distance measures to choose from. So far, in all of the tested approaches we have been relying on the Euclidean distance which by its nature produces spherically shaped clusters. However, one often encounters correlations between features which produce non-spherical clusters, to which the Euclidean distance is completely insensitive. Therefore, in this experiment we would like to test just how big of an impact this particular characteristic has on the classification quality by comparing the Euclidean distance against the Mahalanobis distance. Mahalanobis distance effectively measures the distance from a given point to a group of points and is capable of tracking correlations of features within that group. Therefore, in our case its values will be adjusted according to the data distribution in each cluster allowing us to form non-spherically shaped clusters.
Both measures will be tested using linear SVM classifier and k-means clustering algorithm with equal number of clusters for each dataset according to the experimental procedure from Fig. 3. However, since k-means inherently relies on the Euclidean distance, we have to alter it slightly for the purpose of this experiment. The modification for the Mahalanobis distance is as follows. The first iteration of the algorithm is carried out in a regular fashion using the Euclidean measure. This allows us to form first, candidate groups of objects. Each next iteration is carried out using the Mahalanobis distance, evaluated for each object based on its distance from each cluster representative, taking into account the covariance matrix calculated based on the objects within a given cluster. Just as in regular k-means, cluster centers are recalculated after each iteration along with the covariance matrix of the objects in each cluster. The results of this experiment are presented in Fig. 10.
The results of the experiment do not reveal a superiority of one measure over the other; however, noticeable differences are observable in many cases. For datasets yeast, vowel-context, glass, iris, sonar.all, optdigits, pendigits, spectrometer, statlog-satimage, and statlog-vehicle, the plots clearly indicate significant differences between the applied measures. The Euclidean distance came out on top in 7 out of 10 of these cases, while the Mahalanobis distance prevailed in the remaining 3. This observation is statistically confirmed for each of these datasets using a t-test. Analyzing the results collectively, the Wilcoxon test was unable to differentiate between the two measures (by a large margin with \(p\text {-value}=0.86\)).
Although from a global perspective the results of this test seem inconclusive, the individual results clearly indicate that the applied measure can have a significant impact on the outcome of the process. The experiment therefore hints that the effectiveness of the measure depends on the dataset. This observation could be somewhat expected, as each dataset can have a different data distribution and correlations between features. We can therefore speculate that in some cases the Euclidean distance sufficiently maps the analyzed space, while the Mahalanobis distance makes it excessively complicated, and in other cases, the Euclidean distance is unable to track irregularly shaped data patterns which are detectable by Mahalanobis measure. In any case, obtaining the information about the shapes appearing within a dataset is the exact task of clustering, so we naturally cannot know this information in advance. Consequently, since the Euclidean distance proved to be sufficient more often than not, we use it in all of our remaining experiments.
Semi-supervised learning
Since the methodology discussed in this study creates new features regardless of the decision attribute, it is tempting to try it in a semi-supervised setting and compare it against a fully supervised one, since the transition from one to the other is very straightforward. We achieve this by checking whether clustering on both training and testing data produces better features than clustering only on the former. Apart from using both training and testing data for clustering, the methodology and experimental procedure remains unchanged. The only extra modification concerns the number of clusters for each approach as the additional data may lead to more clusters being detected. Since in Sect. 5.4 we have established that the number of clusters has a significant impact on the outcome of the process, we wanted to eliminate this factor from the experiment. Analogous to Sect. 5.3, we achieve this by first clustering each dataset on both training and testing sets for the semi-supervised approach using affinity propagation and then using the detected number of clusters as an input for the k-means algorithm in the supervised approach. Using two different clustering algorithms is not a problem, since in Sect. 5.2 we have established that the effect of this choice is insignificant. The results of the experiment are presented in Fig. 11.
The boxplots clearly show that there is virtually no difference in the quality of predictions between the supervised and semi-supervised scenario. A t-test for every dataset detected significant differences only in two cases—vowel-context and iris—one in favor of the supervised, while the other in favor of the semi-supervised approach. A further Wilcoxon test only confirms this lack of difference on a global scale as it was unable to differentiate between the approaches by a large margin (\(p\text {-value}=0.9\)).
One would expect that the additional information stemming from clustering both training and testing datasets would benefit the quality of predictions; however, the results clearly indicate that this is simply not the case. On the one hand, this result seems surprising as one would expect the additional information provided by the test set to be beneficial for the classifier. After all, more information allows us to model the structure of the data more accurately and form better clusters which are later used to produce new features. On the other hand, the comparison of clustering algorithms from Sect. 5.2 has already shown us that the clusters themselves are not that important. Why is it then that more information does not benefit our method? The answer to this paradox could lie in the way we encode clusters as new features, namely as distance from each data point to each cluster representative. What this result seems to suggest is that it is not about how accurately the cluster representatives are selected but how well they cover the dataset space. If true, this observation would be consistent with the observations from the feature quality and sensitivity experiments as they clearly suggested that the right number of clusters is more important than their exact positions.
Can clustering-generated features improve classification quality?
In the next two experiments, the main premise of this study is put to the test, namely do clustering-generated features improve classification quality or not. In this first experiment, we compare the original features against two alternative approaches: relying only the new features or combining them with the original ones. This comparison is carried out using the linear SVM classifier. The better of the two tested alternatives will be further evaluated in Sect. 5.9 on 7 other classifiers, allowing us to form conclusions regarding the utility of using clustering-generated features in classification in general. The results of the first comparison are presented in Fig. 12.
The boxplots reveal many notable differences in performance between the compared approaches. Only in 5 cases all three feature sets performed in a comparable way, namely for breast-cancer-wisconsin, ecoli, sonar.all, wine, and yeast datasets. Comparing the original and new features, one can notice a substantial variation in the results and none of the two seems to be working better than the other. However, when combined they produce an outcome which is more often than not equal or better than the two feature sets used separately.
The ANOVA statistical tests reveal significant differences for glass, vowel-context, iris, image-segmentation, ionosphere, optdigits, pendigits, spectrometer, statlog-satimage, and statlog-vehicle datasets. The pairwise t-tests show that in 6 out of these 10 cases using both features was significantly better than relying only on the original ones, while the opposite was not true in any case. In the same test using new features significantly improved the result exactly as many times as it has diminished it, i.e., in 5 cases.
Analyzing the results collectively, the Friedman statistic was unable to find significant differences between the results. We are therefore forced to conclude that clustering-generated features do not improve classification quality in a general sense. However, given the above discussion, we can clearly see that they can significantly improve classification accuracy in some cases while not diminishing the quality in other cases. We can also infer that clustering-generated features can be used as an alternative to the original features, albeit with some reservation as the results vary significantly. Consequently, we recommend using clustering-generated features in conjunction with the original ones as this way they can improve classification quality without adding unnecessary risk. The only obvious downside to this is the extended dimensionality of the dataset which can result in longer training.
Classifiers
So far we have been only testing our hypotheses on a single, linear SVM classifier. In this final experiment, we want to observe what impact does the analyzed process have on other classifiers. Since in the previous experiment we have established that augmenting original features with new ones works favorably to relying solely on the new ones, in this experiment we are only going to compare the original features against the augmented features. We evaluate the process on several different, linear and non-linear classifiers, namely: penalized multinomial regression (multinom), penalized discriminant analysis (pda), Bayesian generalized linear model (bayesglm), decision trees (rpart), k-nearest neighbors (knn), SVM with RBF kernel (svmRadial), random forest (rf). The results are presented in Figs. 13, 14, 15, 16, 17, 18, and 19.
The first thing that one notices when analyzing the results is that the differences are very subtle, with the exception of only a few cases. However, some of these differences appear consistently across all repetitions and are therefore significant. In the case of penalized multinomial regression, one can easily identify several datasets for which adding clustering-generated features was significantly beneficial to the quality of predictions without degrading the quality on other datasets. The same observations hold true for penalized discriminant analysis classifier. Boxplots of Bayesian generalized linear model classifier paint a similar picture, although in the case of image-segmentation dataset the accuracy is noticeably degraded after adding new features. Decision trees produced similar results, although noticeable quality degradation can now be observed in two datasets: glass and image-segmentation. Adding the clustering-generated features proved to be detrimental to the accuracy in the case of the k-nearest neighbors classifier, where it seems to have helped only in two cases (ionosphere and vowel-context) but harmed the quality on at least 7 datasets. Interestingly, swapping the kernel in SVM from linear to RBF dramatically changed the effect of the analyzed process on classification quality. While adding clustering-generated features generally helped with the linear kernel, it seems to have no particular effect in case of the RBF kernel, except for two datasets (optdigits and pendigits), where it significantly degraded the quality of predictions. The worst result recorded was on the random forest classifier, where new features did not improve the quality of predictions on any dataset, but degraded the accuracy in at least 7 cases.
The t-tests for each dataset on each classifier only confirm the above observations but do not provide any additional insight. The clustering-generated features significantly improved the quality on: 6 datasets for multinom classifier, 7 datasets for pda classifier, 5 datasets for bayesglm classifier, 6 datasets for rpart classifier, and 2 datasets for knn classifier. However, they also significantly degraded the quality on: 1 dataset for rpart classifier, 7 datasets for knn classifier, 2 datasets for svmRadial classifier, and 4 datasets for rf classifier. The additional Wilcoxon signed ranks test was unable to distinguish between the two sets of features for any of the discussed classifiers, which forces us to conclude that from a global perspective, the discussed process on average does not influence the quality of predictions. However, given the above discussion, we can see that it can have a huge and consistent impact on the quality of predictions, albeit, depending on the classifier. For instance, it is uncontroversial to say that clustering-generated features seem to be a safe option when dealing with linear models, as adding them did not degrade classification quality on any dataset and was significantly beneficial in many cases. It is also safe to say that the discussed methodology should not be used in conjunction with the knn classifier as it can significantly harm the quality of predictions.
The results also lead to an interesting observation in terms of datasets as they seem to indicate that some datasets are more easily impacted by adding new features than others. In particular, the vowel-context, optdigits, pendigits, and statlog-satimage datasets produce significantly different results in nearly every case. However, and this is the most intriguing part, this effect swings between significant improvement and significant degradation depending on the classifier. The general trend remains unchanged, i.e., for linear models the datasets benefit when adding new features, while the nonlinear models are more often than not harmed by the addition of new features, but the reason for this pattern remains unclear.
In conclusion, we can state that clustering-generated features can improve classification quality, although not with every classifier. In general, the discussed methodology is beneficial for linear models, however, should be used with caution for nonlinear learners. Moreover, clustering-generated features are particularly harmful when used with k-nearest neighbors and random forest classifiers.
Dataset characteristics
Many of the experiments conducted so far indicate that the results are highly dependent on the characteristics of the analyzed data. That is why, in this section, we aim at exploring how each dataset characteristic influences the performance of clustering-generated features compared to the original features. Although in Sect. 5.8 we have established that augmenting the original features produces in general better results than relying solely on the new ones, in this experiment, we compare the original features only against the new features to be able to focus on their properties in isolation. To facilitate a meaningful comparison, we use a dataset generator (from scikit-learn package [24]) which allows us to compare both approaches in various scenarios by changing several distinct data characteristics in a precisely controlled manner and separation from each other. We chose 5 common characteristics to experiment with: number of features, number of classes, number of clusters per class (i.e., number of distributions the data points are drawn from for each class), class distribution, and dataset difficulty (measured by class separation—the higher the separation the easier the dataset). To better capture the influence of each characteristic, we generate a range of datasets for each parameter by gradually changing its value. We rely on the experimental procedure presented in Fig. 3, but the whole process is executed for each parameter value in a specified range and each repeat generates a new dataset. Also, due to more randomness, we increased the number of repeats to 20. For each characteristic, we selected a wide range of parameter values to better illustrate the whole spectrum of possible outcomes. The parameters in each experiment change as follows, with the remaining parameters set to default:
-
number of features: 2, 3, ..., 100 [default: 10]
-
number of classes: 2, 3, ..., 50 [default: 2]
-
number of clusters per class: 1, 2, ..., 50 [default: 1]
-
class distribution: 1:39, 2:38, ..., 19:21, 1:1 [default: 1:1]
-
difficulty (class separation): 1.5, 1.4, ..., 0.1, 0 [default: 0.7].
The maximum value of class separation (1.5) was picked experimentally, as higher values produced trivial datasets which did not add any further insight into the analysis. In most of the experiments, we used affinity propagation with automatically determined number of clusters, with one exception discussed below. The results of this experiment are presented in Fig. 20.
The results reveal some very interesting properties of the analyzed method. Figure 20a clearly shows that the dimensionality of the dataset has a major impact on the quality of clustering-generated features. It also influences the quality of predictions on original features, but to a much lesser extent. However, this result does not paint the whole picture. Since we chose affinity propagation as the clustering method, the number of clusters was selected automatically. Interestingly, regardless of the number of features, the number of clusters was relatively stable between 15 and 25 for each dataset. This gave the clustering-generated features an advantage up until 20 features, while in higher dimensions the advantage was reversed. To address this issue, we re-ran the experiment with k-means and the number of clusters equal to the dimensionality of the dataset. This way, the dimensionality of the clustering-generated features is equal to the dimensionality of the original data. The result presented in Fig. 20b paints a very different picture—the outcome is to some extent reversed. These two results further amplify our main finding that the number of clusters is the key to the performance of the clustering-generated features.
Figure 20c, d shows that varying the number of classes and the number of distributions within each class has a major impact on the quality of predictions, albeit not much different than on original features. These results only indicate that the clustering-generated features do not introduce any additional robustness against these factors, nor do they hinge the performance with regard to these two class characteristics. A similar conclusion can be drawn with respect to class distribution, as illustrated in Fig. 20e, with one additional observation. On average, clustering-generated features seem to produce slightly more stable and better results with the increased balance of the class distribution.
The true highlight of the dataset characteristics analysis is the result of the dataset difficulty experiment presented in Fig. 20f. The plot clearly illustrates that clustering-generated features are much better at predicting difficult to distinguish classes than the original features, without hinging the performance for easier datasets.
In conclusion, the dataset characteristics analysis reveals two main findings. Firstly, it amplifies the importance of the number of clusters as the key to the performance of clustering-generated features. Secondly, it reveals that clustering-generated features are well suited for dealing with difficult problems, i.e., datasets with difficult to distinguish classes. However, it is very important to note that the second point holds only given that the first point is fulfilled, i.e., there is a sufficient number of clusters.