1 Introduction

Machine learning is a branch of artificial intelligence, which is increasingly used in the big data era for the purpose of knowledge discovery and predictive modelling. The former purpose generally means that a model is learned from data and some previously unknown patterns can be extracted from the model (Liu et al. 2016). The latter purpose means that a model is learned from data and the model is then used to predict on any new data instances. For both knowledge discovery and predictive modelling, it is essential to partition a data set into a training set and a test set (Liu et al. 2016). In particular, for the purpose of knowledge discovery, the training set is used for a machine learning algorithm to discover any new patterns, and the test set is then used to validate the degree to which the patterns truly exist and are trustable. In contrast, for the purpose of predictive modelling, the training set is used for a machine learning algorithm to build a model, and the test set is then used to evaluate against the predictive accuracy of the model.

In the context of partitioning a data set into a training set and a test set, it has been critical to decide effectively on which part of the data set is selected as the training set, and which part is selected for the test set (Liu et al. 2017). In the traditional machine learning, it is a normal practice that researchers and practitioners choose to do the data partitioning in a fully random way. This way of partitioning, however, leads to two major issues: (a) class imbalance and (b) sample representativeness issues.

The first issue of class imbalance (Longadge et al. 2013; Ali et al. 2015) is known to affect many classifiers’ performance (Sotiropoulos and Tsihrintzis 2017). Randomly partitioning the data, however, can lead to class imbalance in the training and the test set, even when there is no imbalance in the overall data set. For example, let us consider a 2-class (e.g., positive class and negative class) data set with a balanced distribution of instances across classes, i.e., 50% of the instances belong to the positive class and 50% of the instances belong to the negative class. When the data set is partitioned by selecting training/test instances randomly, it is likely that the class balance of the data set will be broken, which would lead, for example, to more than 50% of the training instances belonging to the positive class and more than 50% of the test instances belonging to the negative class, i.e., the training set has more positive instances than negative ones, while the test set has the opposite situation.

The second issue is about sample representativeness and the fact that the random partitioning may lead to high dissimilarity between training and test instances. In the context of student learning, the training instances are like the revision questions and the test instances are like the exam questions. To test effectively the performance of student learning, the revision questions should be representative with respect to the learning content covered in the exam questions. The random partitioning of data, however, can result in the case that the training instances are dissimilar to the test instances, which corresponds to the situation that students are tested on what they have not yet learned. Such a situation not only leads to a poor performance, but also to a poor judgment of the learner capability. Thus, in the context of predictive modelling, some algorithms may be judged as not being suitable for a particular problem due to a poor performance, when in reality the poor results are not due to the algorithm, but to the representativeness of the training sample.

To address the two issues mentioned above, we propose, in this paper, a semi-random data partitioning framework in the setting of granular computing, towards effective selection of training and test instances. In particular, we focus on dealing with the class imbalance issue and provide a brief proposal towards dealing effectively with the sample representativeness issue.

The rest of this paper is organized as follows: Sect. 2 provides theoretical preliminaries on data partitioning and granular computing concepts. In Sect. 3, we present a multi-granularity data partitioning framework for controlling effectively the partitioning of data into a training set and a test set, towards overcoming the class imbalance and sample representativeness issues. In Sect. 4, we report an experimental study on controlling the class balance of the training and test sets; the results are discussed critically and comparatively. In Sect. 5, we highlight the contributions of this paper and provide further directions towards dealing effectively with the issue of sample representativeness, as well as how to use our framework to change the class balance in the training set for highly imbalanced data sets to further address poor performance due to class imbalance.

2 Theoretical preliminaries

In this section, we provide theoretical preliminaries on data partitioning and granular computing. In particular, we describe two ways of machine learning experimentation through data partitioning, namely cross-validation and partitioning into training/test sets. In addition, we describe the concepts of information granules and information granularity which are used in the proposed framework described in Sect. 3.

2.1 Data partitioning

In machine learning, there are several ways of data partitioning for experimentation. The most popular ways are typically referred to as training/test partitioning or cross-validation (Kohavi 1995; Geisser 1993; Devijver 1982).

The training/test partitioning typically involves the partitioning of the data into a training set and a test set in a specific ratio, e.g., 70% of the data are used as the training set and 30% of the data are used as the test set. This data partitioning can be done randomly or in a fixed way (e.g. the first 70% of the instances in the data set are assigned to training set and the rest to the test set). The fixed way is typically avoided (except when order matters) as it may introduce systematic differences between the training set and the test set, which leads to sample representativeness issues. To avoid such systematic differences, the random assignment of instances into training and test sets is typically used.

Cross-validation is conducted by partitioning a data set into n folds (or subsets), followed by an iterative process of combining the folds into different training and test sets. For n folds, there will be n iterations, where at each iteration, one of the folds is used as the test set, while the others, i.e., \(n-1\) folds, are used as the training set. In other words, each of the n folds is, in turn, used as the test set at one of the n iterations, while the rest of the folds are combined together as the training set. In laboratory research, tenfold cross-validation is a popular practice, i.e., the original data set is partitioned into ten subsets. Cross-validation is generally more expensive in terms of computational cost than training/test partitioning.

There have been some new perspectives identified in Liu et al. (2017) regarding the two above ways of data partitioning used for machine learning experimentation. In particular, cross-validation is considered as an effective measure of the learnability of an algorithm, i.e., the degree to which the algorithm is suitable to learn a high-quality model from the given training data. This is to enable appropriate employment of the suitable learning algorithms towards producing predictive models on the basis of existing data. The way of partitioning a data set into a training set and a test set is taken typically towards learning a model that covers highly complete patterns from the training data and evaluating the model accuracy using highly similar but different instances from the test data. This is to make sure that the model accuracy is evaluated in a trustworthy way using a suitable test set. Section 3 will present a proposed approach for more effective partitioning of data into a training set and a test set.

2.2 Granular computing

Granular computing has been an increasingly popular approach for in-depth processing of information. It is aimed at structural thinking at the philosophical level, as well as at structural problem solving at the practical level (Yao 2005). In general, granular computing involves two operations, namely, granulation and organization. The former operation means to decompose a whole into several parts, whereas the latter operation means to integrate several parts into a whole. From computer science perspective, granulation corresponds to the top-down approach and organization corresponds to the bottom-up approach. The nature of granular computing involves two commonly used concepts, namely, granule and granularity.

In the context of information granule, a granule is defined as “a small particle, especially, one of numerous particles forming a larger unit”, according to the Merriam-Webster Dictionary (Merriam-Webster 2016). In practice, there have been various examples of granules in broad application areas.

In the setting of set theory, a set of any formalism can be viewed as a granule, since a set is a collection of elements. In this context, each element is viewed as a particle. Different formalisms of sets include deterministic sets (Liu et al. 2016), probabilistic sets (Liu et al. 2016), fuzzy sets (Zadeh 2015), and rough sets (Pedrycz 2011).

In the area of computer science, a granule can act as a class due to the fact that a class is a group of objects which are highly similar to each other. An object can also be viewed as a granule, since each object involves a number of attributes, each of which is considered as a particle. Moreover, a granule can also act as a cluster due to the fact that clustering is another way of grouping objects.

In the area of natural languages, a document could be organized in different forms of text units, such as chapters, sections, paragraphs, sentences, and words. In this context, each form of text units can be viewed as a special type of granule. Moreover, each word is viewed as the finest granule due to the fact that a word consists of letters, each of which is viewed as a particle (Liu and Cocea 2017b).

The concept of information granules is also popularly involved in other application areas, such as image processing, machine learning, and rule-based systems. More details on information granules can be found in Pedrycz (2011), and Pedrycz and Chen (2011, 2015a, b).

In the context of information granularity, information granules can be located in different levels of granularity. In set theory, a set S may have several subsets (\(S_1, S_2,\ldots , S_n\)) and each subset may also have several subsubsets (\(S_{1.1}, S_{1.2},\ldots ,S_{1.m},\ldots ,S_{n.1}, S_{n.2},\ldots ,S_{n.m}\)). In this context, the set S is a granule in the top level of granularity, the subsets (\(S_1, S_2,\ldots ,S_n\)) are in the middle level of granularity, and the subsubsets (\(S_{1.1}, S_{1.1},\ldots ,S_{1.m},\ldots ,S_{n.1}, S_{n.2},\ldots ,S_{n.m}\)) are in the bottom level of granularity. In computer science, a class can be specialized into several subclasses through information granulation. In addition, subclasses can be generalized into a super class through information organization.

Fig. 1
figure 1

Fuzzy information granulation for text processing (Liu and Cocea 2017b)

In natural language processing, a document can be managed in a granular structure, as illustrated in Fig. 1. In particular, the complexity of a text instance (granule) can be reduced through top-down decomposition (granulation) to enable text units (granules) in different levels of granularity (such as paragraphs, sentences, and words) to be processed separately. In addition, the outcomes for processing text units in the same level of granularity can be combined through bottom-up aggregation (organization) towards deriving the outcome for processing larger text units in a higher level of granularity.

In real applications, techniques of granular computing have been involved very often in other popular areas, such as artificial intelligence (Wilke and Portmann 2016; Yao 2005; Skowron et al. 2016), computational intelligence  (Dubois and Prade 2016; Yao 2005; Kreinovich 2016; Livi and Sadeghian 2016), and machine learning (Min and Xu 2016; Peters and Weber 2016; Liu and Cocea 2017a; Antonelli et al. 2016).

Furthermore, ensemble learning is also a subject that involves applications of granular computing concepts (Liu and Cocea 2017a). In particular, ensemble learning approaches, such as Bagging, involve information granulation through decomposing a training set into a number of overlapping samples and combining the predictions made from different classifiers towards classifying a test instance; a similar perspective has also been stressed and discussed in Hu and Shi (2009). Section 3 will show how granular computing concepts can be used towards more effective partitioning of data for machine learning experimentation.

3 Semi-random partitioning of data into training and test sets

In this section, we propose a multi-granularity framework for effective control of the partitioning of a data set into a training set and a test set. We also justify how the proposed approach can address the class imbalance and sample representativeness issues that can arise from random partitioning.

3.1 Key features

The multi-granularity framework for semi-random data partitioning is illustrated in Fig. 2. In particular, this framework involves three levels of granularity as outlined below:

  1. 1.

    Level 1 Data Partitioning is done randomly on the basis of the original data set towards getting a training set and a test set.

  2. 2.

    Level 2 The original data set is divided into a number of subsets, with each subset containing a class of instances. Within each subset (i.e., all instances with a particular class label), data partitioning into training and test sets is done randomly. The training and test sets for the whole data set are obtained by merging all the training and test subsets, respectively.

  3. 3.

    Level 3 Based on the subsets obtained in Level 2, each of them is divided again into a number of subsubsets, where each of the subsubsets contains a subclass (of the corresponding class) of instances. The data partitioning is done randomly within each subsubset. The training and test sets for the whole data set are obtained by merging all the training and test subsubsets, respectively.

Fig. 2
figure 2

Multi-granularity framework for semi-random data partitioning

In this multi-granularity framework, Level 2 is aimed at addressing the class imbalance issue, i.e., to control the distribution of instances by class within the training and test sets. Level 3 is aimed at addressing the issue of sample representativeness, i.e., it is to avoid the case that the training instances are highly dissimilar to the test instances following the data partitioning.

In the setting of granular computing, the proposed framework involves explicitly both granulation and organization. In particular, granulation is involved through the operation that a data set is divided into a number of subsets and each subset is divided into a training subset and a test subset (Level 2), or further divided into subsubsets and then split into training and test subsubsets (Level 3). In addition, organization is involved by integrating the training subsets or subsubsets into a whole training set, and the test subsets or subsubsets into a whole test set. In addition, in each level of the granularity as shown in Fig. 2, a set of data is viewed as a granule, which also has hierarchical relationships with sets of data (granules) located in other levels of granularity.

Table 1 Sampling probability by stratified sampling
Table 2 Sampling probability by semi-random partitioning

3.2 Justification

Level 2 of the proposed multi-granularity framework is aimed at controlling effectively the selection of training/test instances towards avoiding the issue of class imbalance, especially when the original data set is balanced. In particular, Level 2 is designed to ensure that for each class of instances, a fixed percentage of the instances would be included in the training/test set. For example, if we suppose that a data set is divided into a training set and a test set in the ratio of 70:30, the strategy of semi-random data partitioning involved in Level 2 of the multi-granularity framework can ensure that for each class of instances, there would be 70% of the instances selected as training instances and the rest of them selected as test instances. The above statement can be proven as follows:

Let us suppose that a data set contains two classes (positive and negative) of instances with the frequency distribution of \(p:(1-p)\), and the size of the data set is m. Following data partitioning, the percentage of the training set is q, whereas the percentage of the test set is \(1-q\).

While the above strategy of semi-random data partitioning is taken, the following steps would be involved:

  1. 1.

    Step 1 The data set is divided into two subsets, respectively, for the positive and negative classes, which results in mp positive instances and \(m(1-p)\) negative instances.

  2. 2.

    Step 2 Each class subset is partitioned into a training subset and a test subset. In particular, for the positive class, the size of the training subset is mpq and the size of the test subset is \(mp(1-q)\). Similarly, for the negative class, the size of the training subset is \(m(1-p)q\) and the size of the test subset is \(m(1-p)(1-q)\).

  3. 3.

    Step 3 The two training subsets resulting from Step 2 are merged into a whole training set and the frequency distribution between the positive and negative classes is \(mpq: m(1-p)q\), which is equivalent to \(p:(1-p)\), i.e., the original class distribution.

  4. 4.

    Step 4 The two test subsets resulting from Step 2 are merged into a whole test set and the frequency distribution between the positive and negative classes is \(mp(1-q): m(1-p)(1-q)\), which is equivalent to \(p: (1-p)\), i.e., the original class distribution.

Thus, the procedure for Level 2 ensures that the original class distribution for the whole data set is reflected within the training and test sets. The above proof, although demonstrated for a 2-class problem, also applies to multi-class classification problems, since the frequency distribution between different classes does not have any dependency on the number of classes as shown above.

The above procedure is inspired from the stratified sampling technique, used in statistics (Srndal et al. 1992). In this context, a population (data set) is divided into subpopulations (data subsets), and then, simple random sampling is used within each subpopulation for getting a subsample (strata). In the context of machine learning, each class represents a subpopulation and a training/test subset for a class represents a strata. Stratified sampling is typically used for improving the sample representiveness by reducing the data variability and thus reducing sampling error (Esfahani and Dougherty 2014; Lang et al. 2016).

However, for the purpose of avoiding class imbalance through preserving the class distributions for training and test sets, the classic stratified sampling technique needs to calculate the size of each strata based on its percentage of the total, whereas the procedure for Level 2 of the proposed multi-granularity framework only needs to divide a data set into subsets (each subset for a class) and then partition (in a fixed ratio) each subset into a training subset and a test subset, without the need to calculate the size of each training/test subset.

For example, a data set has three classes with the distribution 40:40:20; the data partitioning needs to result in 70% of the data set for the training subset and 30% for the test subset.

While stratified sampling is adopted, Table 1 shows that each class needs to be given a probability for its instances to be selected into either the training set or the test set, i.e., it is needed to calculate the sampling probability for each class regarding the selection of its instances for the training/test set. This way aims to preserve the original class distribution in both the training and test sets but leads to higher computational complexity.

Table 2 shows that it is not needed to calculate the sampling probability for each class regarding the selection of its instances for the training/test set. Instead, it is only needed to divide the original data set into n subsets, where n is the number of classes. For each subset corresponding to a class, it is just simply selecting an instance for the training/test set with 70%/30% chance.

On the basis of the above description, stratified sampling pays only attention to preserving the original class distribution by giving each class a sampling probability for its instances to be selected, without taking into account the balance between training and test samples, whereas the proposed semi-random partitioning pays more attention to balancing training and test sets by simply giving each instance 70%/30% chance to be selected for the training/test set.

Table 3 Data sets
Table 4 Comparison with stratified sampling in terms of C4.5 performance
Table 5 Comparison with stratified sampling in terms of NB performance
Table 6 Comparison with stratified sampling in terms of K-NN performance

Level 3 of the proposed multi-granularity framework is aimed at controlling effectively the selection of training/test instances to ensure sample representativeness. In particular, the lack of sample representativess is likely to lead to overfitting, which means that a model performs well on the training data, but poorly on the test data. Thus, what the algorithm learns from the training data is not useful for the test data—something that is typically referred as a lack of generalization; in other words, the model is too specialized, i.e., it has learned from the training data very well, but cannot generalize this knowledge to other situations such as the ones in the test set.

To avoid this problem, the sample of data in the training set should be representative of the whole data, by ensuring that there is not a large dissimilarity between the training set and the test set. To avoid this dissimilarity, level 3 of the proposed multi-granularity framework is thus designed to involve grouping instances on the basis of their similarity to each other, and perform the partitioning within these groups, such that instances from the group will be present in both the training and the test sets.

Table 7 Class frequency distribution with semi-random partitioning
Table 8 C4.5: class frequency distribution in training and test sets for random partitioning
Table 9 NB: class frequency distribution in training and test sets for random partitioning
Table 10 K-NN: class frequency distribution in training and test sets for random partitioning

4 Experiments, results, and discussion

In this section, we report two experimental studies. In particular, the first study involves comparing our proposed approach of semi-random data partitioning with the stratified sampling approach. The second study is to validate the effectiveness of the strategy of semi-random data partitioning involved in Level 2 of the multi-granularity framework proposed in Sect. 3. In particular, we compare the strategy of the semi-random data partitioning with the one of the traditional random data partitioning, in terms of class frequency distribution within the training and test sets, as well as the influence of this distribution on classification performance.

The experimental studies are conducted using 12 UCI data sets (Lichman 2013). The characteristics of the data sets are shown in Table 3. All the chosen data sets are either balanced or slightly imbalanced, except for the ‘anneal’ and ‘autos’ data sets, in terms of class frequency distribution. For using both balanced or slightly imbalanced data sets, the aim is to show that it is necessary to manage to keep the balance level of both the training and test sets as close to the balance level of the original data set as possible, towards avoiding any impact on the learning performance of the algorithms and on the classification performance of the learned classifiers. The imbalanced data sets, i.e., ‘anneal’ and ‘autos’, as well as the ‘segment’ balanced data set, have a larger number of classes, while the other nine data sets have two or three classes. These will allow us to analyze the results in terms of number of classes, as well.

Three popular machine learning algorithms, i.e., the C4.5 decision tree learning algorithm (Quinlan 1993), Naive Bayes (Rish 2001), and K-nearest neighbour (Liu et al. 2016), are used for validation, since these three algorithms are all sensitive to class imbalance (Longadge et al. 2013).

Regarding the first experimental study, the results are shown in Tables 4, 5, and  6. In these three tables, SS stands for stratified sampling and SR stands for semi-random partitioning.

Table 4 shows that the proposed semi-random partitioning outperforms stratified sampling in 9 out of 12 cases, and the two approaches perform the same in the other 3 cases, in terms of overall accuracy of classification. In addition, the proposed semi-random partitioning outperforms stratified sampling in terms of precision and recall with respect to each single class in most cases.

Table 5 shows that the proposed semi-random partitioning outperforms stratified sampling in 9 out of 12 cases, and the two approaches perform the same in 2 out of the other 3 cases, in terms of overall accuracy of classification. In addition, the proposed semi-random partitioning outperforms stratified sampling in terms of precision and recall with respect to each single class in most cases.

Table 6 shows that the proposed semi-random partitioning outperforms stratified sampling in 7 out of 12 cases, and the two approaches perform the same in 3 out of the other 5 cases, in terms of overall accuracy of classification. In addition, the proposed semi-random partitioning outperforms stratified sampling in terms of precision and recall with respect to each single class in most cases.

Regarding the second experimental study, Table 7 displays the original distribution of instances across classes for each data set in terms of frequency (designated by #) and percentages (designated by %). For example, the anneal data set (first row in Table 7) has 6 classes, and in the original distribution, class 1 has 8 instances (representing 1% of all instances), class 2 has 99 instances (representing 11% of the data), and so on. The same information is also displayed for the training and test sets used with the semi-random partitioning approach. The percentage numbers have been rounded to integers for ease of comparison. The loss of precision due to this rounding means that the sum across all classes may not be precisely 100%. In addition, when the number of instances is low, a small difference in the number of instances may lead to a much bigger difference in the percentages values.

Tables 8, 9,  10 show the original distribution, as well as the distribution within the training and test sets for C4.5, NB, and K-NN, respectively. The original distribution was included in all tables for ease of comparison.

Table 11 C4.5 performance on accuracy, precision, and recall
Fig. 3
figure 3

Class distribution and performance (precision and recall) by C4.5 for random and semi-random partitioning for the ‘anneal’, ‘autos’, ‘credit-a’, and ‘heart-statlog’ data sets

Fig. 4
figure 4

Class distribution and performance (precision and recall) by C4.5 for random and semi-random partitioning for the ‘iris’, ‘kr-vs-kp’, ‘labor’, and ‘segment’ data sets

Fig. 5
figure 5

Class distribution and performance (precision and recall) by C4.5 for random and semi-random partitioning for the ‘sonar’, ‘tae’, ‘vote’, and ‘wine’ data sets

The random selection of data for training and test sets leads to different effects on the distribution of instances across classes within the training and test sets, which are outlined below:

  • For initially balanced data sets such as ‘iris’, ‘segment’, and ‘tae’, the random partitioning may lead to a loss of balance within the training and test sets; this loss can be observed for C4.5 on the ‘iris’ and ‘tae’ data sets, while for the ‘segment’ data set, the variation is smaller; similarly, for NB, the loss of balance can be noticed for the ‘iris’ and ‘tae’ data sets, while for the ‘segment’ data set, the variation is smaller, but more noticeable than for C4.5; for K-NN, a loss of balance can be observed for the ‘tae’ data set, while for the iris data set, the imbalance is very small, and for the ‘segment’ data set, the variation is small and similar to the variation for C4.5.

  • For slightly imbalanced data sets, the random partitioning may lead to a more balanced distribution in the training set, but a more imbalanced one in the test set, i.e., for C4.5., ‘heart-statlog’; for NB, labor, and vote; for K-NN, ‘credit-a’, ‘labor’, and ‘sonar’. Sometimes, the imbalance in the test set may mean that the majority class from the training set becomes minority class in the test set— this occurs only for one data set, i.e., ‘sonar’ with K-NN, which is probably due to the fact that the distribution in this data set is very close to perfect balance (47:53).

  • For slightly imbalanced data sets, the random partitioning may lead to a more balanced distribution in the test set, but a more imbalanced distribution in the training set, i.e., for C4.5, ‘kr-vs-kp’, and ‘labor’ by C4.5; for NB, ‘heart-statlog’. For two of these, C4.5—‘kr-vs-kp’ and NB—‘heart-statlog’, in the test set, the majority class is reversed in comparison with the training set.

  • For slightly imbalanced data sets, the random partitioning may lead to both the training and test sets to become more imbalanced, with a different class being the majority class in the training and test sets; for example, in the ‘sonar’ data set with C4.5, class 2 is the majority class in the training set, while class 1 is the majority class in the test set. This situation occurs on the ‘sonar’ data set for C4.5 and NB, and on the ‘wine’ data set for all algorithms (C4.5, NB, and K-NN).

  • For the data sets with a high number of classes and an imbalanced distribution, e.g., anneal and autos, the random partitioning may preserve the original distribution for some classes, while for others, there is an imbalance in the training set, the test set or both, i.e., the ‘autos’ data set for all algorithms (C4.5, NB, and K-NN); sometimes, the majority class in the training set is no longer the majority class in the test set, e.g., for C4.5—‘autos’, class 5 is the majority class in the training set, while class 4 is the majority class in the test set (as well as the original data set). For the anneal data set, the distribution changes slightly, but the majority of the changes are less than 2%—for this reason, we consider that the distribution for this data set with all algorithms is very similar to the original distribution.

  • For all data sets, the random partitioning may lead to a very similar distribution in the training and test sets as in the original data set. i.e., for C4.5, ‘anneal’, ‘credit-a’, and ‘vote’; for NB, ‘anneal’, ‘credit-a’, and ‘kr-vs-kp’; for K-NN, ‘anneal’, ‘heart-statlog’, ‘kr-vs-kp’, and ‘vote’.

Table 11 shows the experimental results for the C4.5 algorithm with random (R) and semi-random (SR) partitioning, which include the accuracy (last column), as well as precision and recall per class.

In terms of accuracy, the results show three situations: (a) the semi-random partitioning leads to the same accuracy as random partitioning, i.e., ‘anneal’, ‘kr-vs-kp’, and ‘segment’; (b) the semi-random partitioning displays small improvements in accuracy (up to 3%), i.e., ‘autos’, ‘credit-a’, ‘heart-statlog’, ‘sonar’, ‘tae’, and ‘vote’; (c) the semi-random partitioning displays large improvements in accuracy (5% or more), i.e., iris (7%), labor (23%), and wine (5%).

Figures 34,  5 display the class distribution, as well as the precision and recall for the experiments with C4.5 on all data sets (4 per graph). The class distribution for the whole data set is represented by the middle bar for every class; the distribution into the training and test sets for random partitioning is represented by the bars on the left, while the ones for semi-random partitioning are represented by the bars on the right. The lines with the square points represent the values for precision—yellow for random partitioning and brown for semi-random partitioning; the lines with the triangle points represent the values for recall—blue for random partitioning and green for semi-random partitioning. The left axis on the graphs represents the number of instances (or class frequency), while the right axis represents the values for precision and recall, with a range from 0 to 1.

For the data sets where the accuracy is the same for both random and semi-random partitioning, i.e., ‘anneal’, ‘kr-vs-kp’, and ‘segment’, the class distribution (see Table 8; Figs. 3, 4) is very similar for both random and semi-random partitioning. For the ‘kr-vs-kp’ data set, although the test set is more balanced (and the training one more imbalanced) compared with the original distribution, the change is very small, especially for the training set where the change is of 1%. For this data set, we also observed that the majority class in the training set becomes the minority class in the test set—the difference, however, is very small, i.e., 2%. Given the large size of this data set and only a slight imbalance in the distribution of classes, it is not surprising that such a small change in distribution does not impact the results.

For the data sets where the accuracy is slightly higher when semi-random partitioning is used, i.e., ‘autos’, ‘credit-a’, ‘heart-statlog’, ‘sonar’, ‘tae’, and ‘vote’, the random partitioning has different effects on the class distribution within the training and test sets.

For the ‘autos’ data set, we notice several situations for different classes:

  1. (a)

    for class 2, all instances are assigned to the training set; thus, while the model learned something about this class, nothing is tested and, consequently, the performance for this class is 0;

  2. (b)

    for class 3 and class 6, the random distribution leads to proportionately more instances in the training set for random partitioning than for the semi-random one—for these, the performance is higher with the random partitioning, which could be explained by the more opportunities for learning for the random partitioning and/or by the lack of sample representativeness for the semi-random partitioning; this will be discussed in more detail further on;

  3. (c)

    for class 4 and class 7, the opposite situation occurs, i.e., for the random partitioning, there are proportionally less instances in the training set for random partitioning than for the semi-random one—for these, the performance is higher with the semi-random partitioning; similarly, this could be due to lack of learning opportunities for the random partitioning and/or sample representativeness for the semi-random partitioning;

  4. (d)

    finally, for class 5, there are proportionally more instances in the training set for random partitioning than for the semi-random one; in addition, this is the majority class in the test set (while class 4 is the majority one in the training set); for this class, the precision value is higher with semi-random partitioning, while recall is higher with the random partitioning; precision is about how many of the instances labeled by the model are truly class 5 (as opposed to other classes), while recall is about how many of all of the class 5 instances are correctly identified as class 5; thus, a small precision indicates that class 5 instances are wrongly labeled with another class, while a small recall indicates that the model has not learned sufficiently how to identify class 5 (due to either not enough opportunities for learning or due to overfitting); a possible explanation for the higher recall with random partitioning is that the higher number of instances in the training set for the random partitioning leads to a model that has learned “better” how to recognize a class 5 instance based on the knowledge about class 5, while the opposite effect occurs for the semi-random partitioning; the better precision for semi-random partitioning could be explained by the better balance of distribution between classes with semi-random partitioning, which leads to a model that can distinguish better between a class 5 instance and instances of other classes.

For the ‘credit-a’ and ‘vote’ data sets, the class distribution is very similar for random and semi-random partitioning—in this case, the difference is likely to be due to sample representativeness. For the ‘heart-statlog’, ‘sonar’, and ‘tae’, the class distribution changes for the majority of classes when using random partitioning, which has a mixed effect on the results for different classes, i.e., precision and/or recall are sometimes higher for semi-random partitioning and sometimes higher for random partitioning. In addition, when the distribution is similar for random and semi-random partitioning, e.g., class 1 of ‘tae’ data set, the results are different, which may be due to sample representativeness.

For the data sets where the accuracy is considerably higher when using semi-random partitioning, i.e., iris (7%), labor (23%), and wine (5%), we notice that the random partitioning leads to class distribution imbalance in the training sets, and something in the test sets as well (i.e., iris and wine). The difference in results is likely to be due to the class imbalance issue, as well as sample representativeness (e.g., class 1 of the ‘wine’ data set has similar distribution for both random and semi-random partitioning, but different precision results).

Table 12 displays the experimental results for Naive Bayes (NB), including recall and precision per class, and accuracy—for random (R) and semi-random (SR) partitioning. Figures 67,  8 display the precision and recall results, as well as the class distribution, with the similar structure as for the previous graphs (with the C4.5 results).

When looking at accuracy, the results for NB show four situations: (a) the semi-random partitioning has lower accuracy than the random one, i.e., ‘segment’ and ‘wine’; (b) the accuracy is the same for both types of partitioning, i.e., ‘sonar’; (c) the accuracy for semi-random partitioning is slightly higher than for the random one (up to 4%), i.e., ‘anneal’, ‘autos’, ‘credit-a’, ‘iris’, ‘kr-vskp’, and ‘vote’; (d) the accuracy for semi-random partitioning is considerably higher (5% or more), i.e., ‘heart-statlog’(5%), ‘labor’ (12%), and ‘tae’ (18%).

Table 12 NB performance on accuracy, precision, and recall
Fig. 6
figure 6

Class distribution and performance (precision and recall) by NB for random and semi-random partitioning for the ‘anneal’, ‘autos’, ‘credit-a’, and ‘heart-statlog’ data sets

Fig. 7
figure 7

Class distribution and performance (precision and recall) by NB for random and semi-random partitioning for the ‘iris’, ‘kr-vs-kp’, ‘labor’, and ‘segment’ data sets

Fig. 8
figure 8

Class distribution and performance (precision and recall) by NB for random and semi-random partitioning for the ‘sonar’, ‘tae’, ‘vote’, and ‘wine’ data sets

For the data sets displaying lower accuracy for the semi-random partitioning, i.e., ‘segment’ and ‘wine’, the difference in accuracy compared with random partitioning is very small, i.e., 1% for ‘segment’ and 2% for ‘wine’. For the ‘segment’ data set, there is a small change in the class distribution with random partitioning; for the classes where the change results in more instances in the training set, the recall values are higher, while for the classes where the change results in more instances in the test set, the precision is higher; for the classes where there is little or no change, the difference in results may be due to sample representativeness. For the ‘wine’ data set, the random partitioning results in a more balanced distribution across classes in the training set, which may explain the better performance.

The accuracy for random and semi-random partitioning is the same on the ‘sonar’ data set, for which the random partitioning leads to a more imbalanced training set, with the same effect as above, i.e., when there are more instances in the training set, the recall is higher, while when there are more instances in the test set, the precision is higher.

For 6 data sets, i.e., ‘anneal’, ‘autos’, ‘credit-a’, ‘iris’, ‘kr-vs-kp’, and ‘vote’, the semi-random partitioning has up to 4% better accuracy than random partitioning. For the ‘anneal’, ‘credit-a’, and ‘kr-vs-kp’, the class distribution is very similar for random and semi-random partitioning—thus, the small difference is likely to be due to sample representativeness. For the ‘autos’, ‘iris’, and ‘vote’, the random partitioning leads to more class imbalance, which may affect the results.

The accuracy for the semi-random partitioning is higher than for the random one on three data sets, i.e., ‘heart-statlog’(5%), ‘labor’ (12%), and ‘tae’ (18%). For the ‘heart-statlog’ and ‘tae’, the random partitioning leads to higher class imbalance in the training set, which may explain the results. For the ‘labor’ data set, the random partitioning leads to a better balance within the training set, but lower results than the semi-random partitioning which matches the original distribution—we believe that sample representativeness plays a big role in this situation and will investigate this in future work.

Table 13 K-NN performance on accuracy, precision, and recall
Fig. 9
figure 9

Class distribution and performance (precision and recall) by K-NN for random and semi-random partitioning for the ‘anneal’, ‘autos’, ‘credit-a’, and ‘heart-statlog’ data sets

Fig. 10
figure 10

Class distribution and performance (precision and recall) by K-NN for random and semi-random partitioning for the ‘iris’, ‘kr-vs-kp’, ‘labor’, and ‘segment’ data sets

Fig. 11
figure 11

Class distribution and performance (precision and recall) by K-NN for random and semi-random partitioning for the ‘sonar’, ‘tae’, ‘vote’, and ‘wine’ data sets

Table 13 and Figs. 910, and 11 display the results for the experiments with K-nearest neighbour (K-NN) algorithm.

Similar to the results for Naive Bayes, we have four situations: (a) the accuracy for semi-random partitioning is slightly lower than for random partitioning, i.e., ‘wine’ (2%); (b) the two ways of partitioning have the same accuracy for the ‘anneal’ and ‘labor’ data set; (c) the semi-random partitioning has slightly better (up to 3%) accuracy, i.e., ‘credit-a’, ‘iris’, ‘kr-vs-kp’, ‘segment’, ‘sonar’, and ‘vote’; (d) the accuracy is considerably higher (5% or more) for the semi-random partitioning, i.e., ‘autos’ (6%), ‘heart-statlog’, and ‘tae’.

For the ‘wine’ data set, on which the random partitioning leads to 2% better accuracy, the partitioning leads to a higher number of instances in the training set for classes 2 and 3, which have the same or higher recall compared with semi-random partitioning. For class 1, there are more instances in the test set for the random partitioning, which has a higher precision than semi-random partitioning.

For the ‘anneal’ data set, the class distribution is similar for random and semi-random partitioning, thus justifying the similar performance. For the ‘labor’ data set, the random partitioning leads to a better balance in the training set and a similar performance to the semi-random partitioning. This better class balance occurred also for the NB algorithm; however, the results were worst—the different results for the K-NN algorithms support our hypothesis that sample representativeness plays an important role in explaining these results.

When the semi-random partitioning leads to slight improvements in accuracy, i.e., ‘credit-a’, ‘iris’, ‘kr-vs-kp’, ‘segment’, ‘sonar’, and ‘vote’, we notice similar patterns: (1) for similar distributions, i.e., ‘kr-vs-kp’ and ‘vote’, the difference is likely to be due to sample representativeness; (2) when the random sampling leads to changes in the class distribution, an increase in the number of instances in the training set is associated with increase in recall, while the increase in the number of instances in the test set is associated with an increase in precision.

For the data sets with considerably higher accuracy for semi-random partitioning, there are two situations: (a) the class distribution is the same, i.e., ‘heart-statlog’—consequently, the difference in results is probably due to sample representativeness; (b) the random partitioning leads to higher imbalance for some classes, which together with the sample representativeness explain the results, i.e., ‘autos’ and ‘tae’.

To summarise, we noticed that the distribution of classes within the training and test sets has an effect on the performance results. In particular, there is an association between a larger number of instances in the training set and a higher recall and between a larger number of instances in the test set and a higher precision. A higher number of instances in the training set can mean more opportunities for learning, and, thus, a better knowledge of a particular class, which explains the higher recall. For a good performance, however, recall needs to be balanced with precision, i.e., ensure that the model can distinguish a particular class from the other classes; in other words, a low precision means that instances of a particular class is wrongly labeled with another class(es). This is more likely to be influenced by the distribution among classes, than the distribution of a class between the training and the test set, as the balance between classes in the training set has an influence on the capacity to learn to distinguish between classes (which is why class imbalance is known to lead to poor performance). This is supported by the fact that the semi-random partitioning results are more balanced in terms of precision and recall, while the random partitioning with imbalanced class distribution in the training set, as well as imbalance across the training and test sets, tend to have one of two combinations: (a) high precision and low recall, or (b) low recall and high precision.

The results also indicate that the class distribution within the training set has more influence on the performance than the class distribution within the test set. On the other hand, the distribution within the test set still requires consideration to accurately assess the performance of a model. For example, a small test sample may not sufficiently test the knowledge learned for a particular class—in an extreme situation, it may mean that knowledge is not tested at all. These aspects can be easily controlled with our proposed partitioning method.

Overall, the experimental results indicate that the adoption of the strategy of semi-random data partitioning involved in Level 2 of the multi-granularity framework proposed in Sect. 3 achieves effective control of the selection of training/test instances, towards avoiding the case of class imbalance in both training and test sets, especially when data sets are originally balanced or slightly imbalanced.

Our results also showed situations when the random and semi-random partitioning led to the same distribution, but different results. We believe that these are likely to be explained by the sample representativeness issues, which we will address in future work with experiments on Level 3 of the propose multi-granularity framework.

5 Conclusions

In this paper, we identified two issues resulting from the operation of random partitioning of data into a training set and a test. In particular, we argued that a fully random way of data partitioning could lead to the case of class imbalance and to sample representativeness issues, i.e., the case that training instances are highly dissimilar to the test instances. To address these issues, we proposed a multi-granularity framework for semi-random data partitioning. The proposed framework involves both granulation and organization in the setting of granular computing, towards more effective data partitioning in a semi-random way.

We conducted several experiments using 12 UCI data sets and three popular machine learning algorithms (C4.5, Naive Bayes, and K-nearest neighbour). We focused on Level 2 of the framework for avoiding class imbalance. The results show interesting effects of the class distribution within the training and test sets on overall accuracy, as well as precision and recall per class. The results have also shown that the same class distribution for random and semi-random partitioning can lead to different performance results—we believe that this is most likely due to the issues of sample representativeness, which are addressed in Level 3 of the proposed framework.

In particular, for Level 3, we argued the necessity that each class of instances needs to be specialized into a number of subclasses, by grouping instances from the same class based on their similarity. By sampling data for the training and test sets at the level of these subclasses, the sample representativeness can be controlled across both the training and test sets, thus avoiding situations in which knowledge is learned but not tested, or knowledge that is tested without having been learned.

In this paper, we focused on the preservation of the original class distribution within the training and test sets. While this approach is suitable for balanced and slightly imbalanced data sets, it may not be the best for highly imbalanced data sets. In future work, we will investigate how the principles of Level 2 in our framework can be adapted for imbalanced data sets, using stratified sampling (mentioned in Sect. 3.2) to achieve a better balance for the class distribution, particularly in the training set.