In this section, we report two experimental studies. In particular, the first study involves comparing our proposed approach of semi-random data partitioning with the stratified sampling approach. The second study is to validate the effectiveness of the strategy of semi-random data partitioning involved in Level 2 of the multi-granularity framework proposed in Sect. 3. In particular, we compare the strategy of the semi-random data partitioning with the one of the traditional random data partitioning, in terms of class frequency distribution within the training and test sets, as well as the influence of this distribution on classification performance.
The experimental studies are conducted using 12 UCI data sets (Lichman 2013). The characteristics of the data sets are shown in Table 3. All the chosen data sets are either balanced or slightly imbalanced, except for the ‘anneal’ and ‘autos’ data sets, in terms of class frequency distribution. For using both balanced or slightly imbalanced data sets, the aim is to show that it is necessary to manage to keep the balance level of both the training and test sets as close to the balance level of the original data set as possible, towards avoiding any impact on the learning performance of the algorithms and on the classification performance of the learned classifiers. The imbalanced data sets, i.e., ‘anneal’ and ‘autos’, as well as the ‘segment’ balanced data set, have a larger number of classes, while the other nine data sets have two or three classes. These will allow us to analyze the results in terms of number of classes, as well.
Three popular machine learning algorithms, i.e., the C4.5 decision tree learning algorithm (Quinlan 1993), Naive Bayes (Rish 2001), and K-nearest neighbour (Liu et al. 2016), are used for validation, since these three algorithms are all sensitive to class imbalance (Longadge et al. 2013).
Regarding the first experimental study, the results are shown in Tables 4, 5, and 6. In these three tables, SS stands for stratified sampling and SR stands for semi-random partitioning.
Table 4 shows that the proposed semi-random partitioning outperforms stratified sampling in 9 out of 12 cases, and the two approaches perform the same in the other 3 cases, in terms of overall accuracy of classification. In addition, the proposed semi-random partitioning outperforms stratified sampling in terms of precision and recall with respect to each single class in most cases.
Table 5 shows that the proposed semi-random partitioning outperforms stratified sampling in 9 out of 12 cases, and the two approaches perform the same in 2 out of the other 3 cases, in terms of overall accuracy of classification. In addition, the proposed semi-random partitioning outperforms stratified sampling in terms of precision and recall with respect to each single class in most cases.
Table 6 shows that the proposed semi-random partitioning outperforms stratified sampling in 7 out of 12 cases, and the two approaches perform the same in 3 out of the other 5 cases, in terms of overall accuracy of classification. In addition, the proposed semi-random partitioning outperforms stratified sampling in terms of precision and recall with respect to each single class in most cases.
Regarding the second experimental study, Table 7 displays the original distribution of instances across classes for each data set in terms of frequency (designated by #) and percentages (designated by %). For example, the anneal data set (first row in Table 7) has 6 classes, and in the original distribution, class 1 has 8 instances (representing 1% of all instances), class 2 has 99 instances (representing 11% of the data), and so on. The same information is also displayed for the training and test sets used with the semi-random partitioning approach. The percentage numbers have been rounded to integers for ease of comparison. The loss of precision due to this rounding means that the sum across all classes may not be precisely 100%. In addition, when the number of instances is low, a small difference in the number of instances may lead to a much bigger difference in the percentages values.
Tables 8, 9, 10 show the original distribution, as well as the distribution within the training and test sets for C4.5, NB, and K-NN, respectively. The original distribution was included in all tables for ease of comparison.
Table 11 C4.5 performance on accuracy, precision, and recall
The random selection of data for training and test sets leads to different effects on the distribution of instances across classes within the training and test sets, which are outlined below:
-
For initially balanced data sets such as ‘iris’, ‘segment’, and ‘tae’, the random partitioning may lead to a loss of balance within the training and test sets; this loss can be observed for C4.5 on the ‘iris’ and ‘tae’ data sets, while for the ‘segment’ data set, the variation is smaller; similarly, for NB, the loss of balance can be noticed for the ‘iris’ and ‘tae’ data sets, while for the ‘segment’ data set, the variation is smaller, but more noticeable than for C4.5; for K-NN, a loss of balance can be observed for the ‘tae’ data set, while for the iris data set, the imbalance is very small, and for the ‘segment’ data set, the variation is small and similar to the variation for C4.5.
-
For slightly imbalanced data sets, the random partitioning may lead to a more balanced distribution in the training set, but a more imbalanced one in the test set, i.e., for C4.5., ‘heart-statlog’; for NB, labor, and vote; for K-NN, ‘credit-a’, ‘labor’, and ‘sonar’. Sometimes, the imbalance in the test set may mean that the majority class from the training set becomes minority class in the test set— this occurs only for one data set, i.e., ‘sonar’ with K-NN, which is probably due to the fact that the distribution in this data set is very close to perfect balance (47:53).
-
For slightly imbalanced data sets, the random partitioning may lead to a more balanced distribution in the test set, but a more imbalanced distribution in the training set, i.e., for C4.5, ‘kr-vs-kp’, and ‘labor’ by C4.5; for NB, ‘heart-statlog’. For two of these, C4.5—‘kr-vs-kp’ and NB—‘heart-statlog’, in the test set, the majority class is reversed in comparison with the training set.
-
For slightly imbalanced data sets, the random partitioning may lead to both the training and test sets to become more imbalanced, with a different class being the majority class in the training and test sets; for example, in the ‘sonar’ data set with C4.5, class 2 is the majority class in the training set, while class 1 is the majority class in the test set. This situation occurs on the ‘sonar’ data set for C4.5 and NB, and on the ‘wine’ data set for all algorithms (C4.5, NB, and K-NN).
-
For the data sets with a high number of classes and an imbalanced distribution, e.g., anneal and autos, the random partitioning may preserve the original distribution for some classes, while for others, there is an imbalance in the training set, the test set or both, i.e., the ‘autos’ data set for all algorithms (C4.5, NB, and K-NN); sometimes, the majority class in the training set is no longer the majority class in the test set, e.g., for C4.5—‘autos’, class 5 is the majority class in the training set, while class 4 is the majority class in the test set (as well as the original data set). For the anneal data set, the distribution changes slightly, but the majority of the changes are less than 2%—for this reason, we consider that the distribution for this data set with all algorithms is very similar to the original distribution.
-
For all data sets, the random partitioning may lead to a very similar distribution in the training and test sets as in the original data set. i.e., for C4.5, ‘anneal’, ‘credit-a’, and ‘vote’; for NB, ‘anneal’, ‘credit-a’, and ‘kr-vs-kp’; for K-NN, ‘anneal’, ‘heart-statlog’, ‘kr-vs-kp’, and ‘vote’.
Table 11 shows the experimental results for the C4.5 algorithm with random (R) and semi-random (SR) partitioning, which include the accuracy (last column), as well as precision and recall per class.
In terms of accuracy, the results show three situations: (a) the semi-random partitioning leads to the same accuracy as random partitioning, i.e., ‘anneal’, ‘kr-vs-kp’, and ‘segment’; (b) the semi-random partitioning displays small improvements in accuracy (up to 3%), i.e., ‘autos’, ‘credit-a’, ‘heart-statlog’, ‘sonar’, ‘tae’, and ‘vote’; (c) the semi-random partitioning displays large improvements in accuracy (5% or more), i.e., iris (7%), labor (23%), and wine (5%).
Figures 3, 4, 5 display the class distribution, as well as the precision and recall for the experiments with C4.5 on all data sets (4 per graph). The class distribution for the whole data set is represented by the middle bar for every class; the distribution into the training and test sets for random partitioning is represented by the bars on the left, while the ones for semi-random partitioning are represented by the bars on the right. The lines with the square points represent the values for precision—yellow for random partitioning and brown for semi-random partitioning; the lines with the triangle points represent the values for recall—blue for random partitioning and green for semi-random partitioning. The left axis on the graphs represents the number of instances (or class frequency), while the right axis represents the values for precision and recall, with a range from 0 to 1.
For the data sets where the accuracy is the same for both random and semi-random partitioning, i.e., ‘anneal’, ‘kr-vs-kp’, and ‘segment’, the class distribution (see Table 8; Figs. 3, 4) is very similar for both random and semi-random partitioning. For the ‘kr-vs-kp’ data set, although the test set is more balanced (and the training one more imbalanced) compared with the original distribution, the change is very small, especially for the training set where the change is of 1%. For this data set, we also observed that the majority class in the training set becomes the minority class in the test set—the difference, however, is very small, i.e., 2%. Given the large size of this data set and only a slight imbalance in the distribution of classes, it is not surprising that such a small change in distribution does not impact the results.
For the data sets where the accuracy is slightly higher when semi-random partitioning is used, i.e., ‘autos’, ‘credit-a’, ‘heart-statlog’, ‘sonar’, ‘tae’, and ‘vote’, the random partitioning has different effects on the class distribution within the training and test sets.
For the ‘autos’ data set, we notice several situations for different classes:
-
(a)
for class 2, all instances are assigned to the training set; thus, while the model learned something about this class, nothing is tested and, consequently, the performance for this class is 0;
-
(b)
for class 3 and class 6, the random distribution leads to proportionately more instances in the training set for random partitioning than for the semi-random one—for these, the performance is higher with the random partitioning, which could be explained by the more opportunities for learning for the random partitioning and/or by the lack of sample representativeness for the semi-random partitioning; this will be discussed in more detail further on;
-
(c)
for class 4 and class 7, the opposite situation occurs, i.e., for the random partitioning, there are proportionally less instances in the training set for random partitioning than for the semi-random one—for these, the performance is higher with the semi-random partitioning; similarly, this could be due to lack of learning opportunities for the random partitioning and/or sample representativeness for the semi-random partitioning;
-
(d)
finally, for class 5, there are proportionally more instances in the training set for random partitioning than for the semi-random one; in addition, this is the majority class in the test set (while class 4 is the majority one in the training set); for this class, the precision value is higher with semi-random partitioning, while recall is higher with the random partitioning; precision is about how many of the instances labeled by the model are truly class 5 (as opposed to other classes), while recall is about how many of all of the class 5 instances are correctly identified as class 5; thus, a small precision indicates that class 5 instances are wrongly labeled with another class, while a small recall indicates that the model has not learned sufficiently how to identify class 5 (due to either not enough opportunities for learning or due to overfitting); a possible explanation for the higher recall with random partitioning is that the higher number of instances in the training set for the random partitioning leads to a model that has learned “better” how to recognize a class 5 instance based on the knowledge about class 5, while the opposite effect occurs for the semi-random partitioning; the better precision for semi-random partitioning could be explained by the better balance of distribution between classes with semi-random partitioning, which leads to a model that can distinguish better between a class 5 instance and instances of other classes.
For the ‘credit-a’ and ‘vote’ data sets, the class distribution is very similar for random and semi-random partitioning—in this case, the difference is likely to be due to sample representativeness. For the ‘heart-statlog’, ‘sonar’, and ‘tae’, the class distribution changes for the majority of classes when using random partitioning, which has a mixed effect on the results for different classes, i.e., precision and/or recall are sometimes higher for semi-random partitioning and sometimes higher for random partitioning. In addition, when the distribution is similar for random and semi-random partitioning, e.g., class 1 of ‘tae’ data set, the results are different, which may be due to sample representativeness.
For the data sets where the accuracy is considerably higher when using semi-random partitioning, i.e., iris (7%), labor (23%), and wine (5%), we notice that the random partitioning leads to class distribution imbalance in the training sets, and something in the test sets as well (i.e., iris and wine). The difference in results is likely to be due to the class imbalance issue, as well as sample representativeness (e.g., class 1 of the ‘wine’ data set has similar distribution for both random and semi-random partitioning, but different precision results).
Table 12 displays the experimental results for Naive Bayes (NB), including recall and precision per class, and accuracy—for random (R) and semi-random (SR) partitioning. Figures 6, 7, 8 display the precision and recall results, as well as the class distribution, with the similar structure as for the previous graphs (with the C4.5 results).
When looking at accuracy, the results for NB show four situations: (a) the semi-random partitioning has lower accuracy than the random one, i.e., ‘segment’ and ‘wine’; (b) the accuracy is the same for both types of partitioning, i.e., ‘sonar’; (c) the accuracy for semi-random partitioning is slightly higher than for the random one (up to 4%), i.e., ‘anneal’, ‘autos’, ‘credit-a’, ‘iris’, ‘kr-vskp’, and ‘vote’; (d) the accuracy for semi-random partitioning is considerably higher (5% or more), i.e., ‘heart-statlog’(5%), ‘labor’ (12%), and ‘tae’ (18%).
Table 12 NB performance on accuracy, precision, and recall
For the data sets displaying lower accuracy for the semi-random partitioning, i.e., ‘segment’ and ‘wine’, the difference in accuracy compared with random partitioning is very small, i.e., 1% for ‘segment’ and 2% for ‘wine’. For the ‘segment’ data set, there is a small change in the class distribution with random partitioning; for the classes where the change results in more instances in the training set, the recall values are higher, while for the classes where the change results in more instances in the test set, the precision is higher; for the classes where there is little or no change, the difference in results may be due to sample representativeness. For the ‘wine’ data set, the random partitioning results in a more balanced distribution across classes in the training set, which may explain the better performance.
The accuracy for random and semi-random partitioning is the same on the ‘sonar’ data set, for which the random partitioning leads to a more imbalanced training set, with the same effect as above, i.e., when there are more instances in the training set, the recall is higher, while when there are more instances in the test set, the precision is higher.
For 6 data sets, i.e., ‘anneal’, ‘autos’, ‘credit-a’, ‘iris’, ‘kr-vs-kp’, and ‘vote’, the semi-random partitioning has up to 4% better accuracy than random partitioning. For the ‘anneal’, ‘credit-a’, and ‘kr-vs-kp’, the class distribution is very similar for random and semi-random partitioning—thus, the small difference is likely to be due to sample representativeness. For the ‘autos’, ‘iris’, and ‘vote’, the random partitioning leads to more class imbalance, which may affect the results.
The accuracy for the semi-random partitioning is higher than for the random one on three data sets, i.e., ‘heart-statlog’(5%), ‘labor’ (12%), and ‘tae’ (18%). For the ‘heart-statlog’ and ‘tae’, the random partitioning leads to higher class imbalance in the training set, which may explain the results. For the ‘labor’ data set, the random partitioning leads to a better balance within the training set, but lower results than the semi-random partitioning which matches the original distribution—we believe that sample representativeness plays a big role in this situation and will investigate this in future work.
Table 13 K-NN performance on accuracy, precision, and recall
Table 13 and Figs. 9, 10, and 11 display the results for the experiments with K-nearest neighbour (K-NN) algorithm.
Similar to the results for Naive Bayes, we have four situations: (a) the accuracy for semi-random partitioning is slightly lower than for random partitioning, i.e., ‘wine’ (2%); (b) the two ways of partitioning have the same accuracy for the ‘anneal’ and ‘labor’ data set; (c) the semi-random partitioning has slightly better (up to 3%) accuracy, i.e., ‘credit-a’, ‘iris’, ‘kr-vs-kp’, ‘segment’, ‘sonar’, and ‘vote’; (d) the accuracy is considerably higher (5% or more) for the semi-random partitioning, i.e., ‘autos’ (6%), ‘heart-statlog’, and ‘tae’.
For the ‘wine’ data set, on which the random partitioning leads to 2% better accuracy, the partitioning leads to a higher number of instances in the training set for classes 2 and 3, which have the same or higher recall compared with semi-random partitioning. For class 1, there are more instances in the test set for the random partitioning, which has a higher precision than semi-random partitioning.
For the ‘anneal’ data set, the class distribution is similar for random and semi-random partitioning, thus justifying the similar performance. For the ‘labor’ data set, the random partitioning leads to a better balance in the training set and a similar performance to the semi-random partitioning. This better class balance occurred also for the NB algorithm; however, the results were worst—the different results for the K-NN algorithms support our hypothesis that sample representativeness plays an important role in explaining these results.
When the semi-random partitioning leads to slight improvements in accuracy, i.e., ‘credit-a’, ‘iris’, ‘kr-vs-kp’, ‘segment’, ‘sonar’, and ‘vote’, we notice similar patterns: (1) for similar distributions, i.e., ‘kr-vs-kp’ and ‘vote’, the difference is likely to be due to sample representativeness; (2) when the random sampling leads to changes in the class distribution, an increase in the number of instances in the training set is associated with increase in recall, while the increase in the number of instances in the test set is associated with an increase in precision.
For the data sets with considerably higher accuracy for semi-random partitioning, there are two situations: (a) the class distribution is the same, i.e., ‘heart-statlog’—consequently, the difference in results is probably due to sample representativeness; (b) the random partitioning leads to higher imbalance for some classes, which together with the sample representativeness explain the results, i.e., ‘autos’ and ‘tae’.
To summarise, we noticed that the distribution of classes within the training and test sets has an effect on the performance results. In particular, there is an association between a larger number of instances in the training set and a higher recall and between a larger number of instances in the test set and a higher precision. A higher number of instances in the training set can mean more opportunities for learning, and, thus, a better knowledge of a particular class, which explains the higher recall. For a good performance, however, recall needs to be balanced with precision, i.e., ensure that the model can distinguish a particular class from the other classes; in other words, a low precision means that instances of a particular class is wrongly labeled with another class(es). This is more likely to be influenced by the distribution among classes, than the distribution of a class between the training and the test set, as the balance between classes in the training set has an influence on the capacity to learn to distinguish between classes (which is why class imbalance is known to lead to poor performance). This is supported by the fact that the semi-random partitioning results are more balanced in terms of precision and recall, while the random partitioning with imbalanced class distribution in the training set, as well as imbalance across the training and test sets, tend to have one of two combinations: (a) high precision and low recall, or (b) low recall and high precision.
The results also indicate that the class distribution within the training set has more influence on the performance than the class distribution within the test set. On the other hand, the distribution within the test set still requires consideration to accurately assess the performance of a model. For example, a small test sample may not sufficiently test the knowledge learned for a particular class—in an extreme situation, it may mean that knowledge is not tested at all. These aspects can be easily controlled with our proposed partitioning method.
Overall, the experimental results indicate that the adoption of the strategy of semi-random data partitioning involved in Level 2 of the multi-granularity framework proposed in Sect. 3 achieves effective control of the selection of training/test instances, towards avoiding the case of class imbalance in both training and test sets, especially when data sets are originally balanced or slightly imbalanced.
Our results also showed situations when the random and semi-random partitioning led to the same distribution, but different results. We believe that these are likely to be explained by the sample representativeness issues, which we will address in future work with experiments on Level 3 of the propose multi-granularity framework.