Introduction

Many real-world datasets have missing or incomplete data [22, 23, 75]. Since the accuracy of most machine learning algorithms for classification, regression and clustering could be affected by the completeness of datasets, processing and dealing with missing data is a significant step in data mining and machine learning processes. Yet, this is still under-explored in the literature [11, 28, 49, 61, 68,69,70].

A few strategies have been commonly used to handle incomplete data [30, 34, 48]. For regression problems specifically where missing data has been more widely studied [36, 38, 39, 48, 55], multiple imputation (MI) has shown advantage over other methods [48, 72] because the multiple imputed values give a mechanism to capture the uncertainty reflected in missing data. However, work is still needed to address the problem of missing data in the context of data mining algorithms. Particularly, it is timely to experiment with the concept of multiple imputation and how to apply to classification problems.

The aim of this work is therefore to conduct a thorough investigation on how to effectively apply MI for classification algorithms. We propose an ensemble that combines multiple models produced by MI, and we investigate the ways for combining different ensemble mechanisms with MI methods to achieve best results.

Our proposed method, MIE, is evaluated and compared with other alternatives under some simulated scenarios of increasing uncertainty in terms of missing data. For this, we create an experimental environment using datasets selected from the University of California Irvine (UCI) machine learning repository [47]. For each dataset, we use a mechanism called Missing Completely at Random (MCAR) to generate missing data by removing the values of chosen attributes and instances with a variable probability. Therefore, we produce several experimental datasets which contain increasing amount of data MCAR.

In those scenarios, we investigate how increasing the amount of missing data affects the performance of competing approaches for handling missing data. They include the algorithm’s internal mechanism for handling missing data, single imputation, machine learning imputation and our proposed MIE.

The Problem of Missing Data

Little and Rubin [48] have defined the missing data problem based on how missing data is produced in the first place and they proposed three main categories as follows: Missing Completely at Random (MCAR), Missing at Random (MAR) and Missing not at Random (MNAR). The categorisation is important because it affects the biases that may be inherent in the data, and therefore the safety of approaches such as imputation. Missing Completely at Random (MCAR) occurs when an instance missing for a particular variable is independent of any other variable and independent of the missing data. It can be said that for MCAR missing data is not related to any other factor known or unknown in the study. This represents the safer environment for imputation to operate. Missing at Random (MAR) happens when the probability of having missing value in a record may depend on the known values of other attributes but not on the missing data. There are some inherent biases in data MAR, but it may be still safe to analyse this type of data without explicitly accounting for the missing data. Missing not at Random (MNAR) occurs when the probability of the instance having a missing value depends on unobserved values. This is also termed a non-ignorable process and is the most difficult scenario to deal with. In this paper we focus on addressing MCAR data, the safest environment in which imputation could operate and one that is often encountered. Further work will investigate the other mechanisms.

Horton et al. [39] have further categorised the patterns of missing data into monotone and non-monotone. Monotone patterns of missing data imply that the same data points have missing values in one or more features, so specific points are affected by missing data. They state that the patterns are concerned with which values are missing, whereas the mechanisms are concerned with why data is missing.

We focus in this study on non-monotone MCAR data, so our missing data affects multiple data points with no particular relation between data missing for different attributes for the same data points.

Mechanisms for Dealing with Missing Data

In practice, there are four popular approaches that have been used to deal with incomplete data: complete analysis [48], statistical imputation methods [4, 24, 34, 59], machine learning algorithms for imputation [40, 51, 52, 57, 59] and algorithms with a built-in mechanism to deal with missing data [10, 25, 43, 51, 74]. The first three approaches rely on pre-processing of the data to either remove or replace missing values. The last approach comprises a mechanism in the algorithms themselves to produce models taking account of the missing data. We provided a detailed explanation of the different approaches in our previous work [2].

We also studied how different classification algorithms such as C4.5 [51, 52], Naïve Bayes (NB) [10, 45], support vector machines (SVMs) [5, 16, 74] and random forest (RF) [7, 21] and their implementations in Weka, our platform of choice, can treat missing values [2]. A number of classification algorithms (e.g. C4.5 [52] and RF [43]) have been constructed with a mechanism called fractional method to cope with missing data. Naïve Bayes ignores features with missing values; thus, only the complete features are used for classification [10, 45]. SVMs do not deal with missing values [46] but its implementation in Weka, SMO, performs simple imputation [32]. In this work we investigate PART, in addition to the previous classifiers, which is also capable of treating missing data when constructing a partial tree as C4.5 does [25].

The rest of this paper is organised as follows: “Related Work” section reviews related research; the methods used in our paper are described in “MIE for Classification” section followed by our experimental set-up in “Experimental Set-Up” section and “Results” section detail out the results of this study; this is followed by a discussion and conclusions in “Discussion and Conclusions” section.

Related Work

MI has been studied in the context of statistical analysis [55, 57, 58]. After that, it has been widely applied in many studies such as in survival analysis [41, 73], epidemiological and clinical trials [44, 65], medical studies [56, 72] and longitudinal studies [50, 63]. The application of MI with ensemble learning for classification has rarely been used in the literature. We review a few published papers that have discussed the problem of missing data in the context of classification algorithms and the use of MI methods.

Silva-Ramírez et al. [62] proposed a method for simple imputation based on a multi-layer perceptron (IMLP) and a method for multiple data imputation that combines a multi-layer perceptron and k-nearest neighbour (k-NN algorithm to impute missing data (MIMLP). The problem under consideration was monotone MCAR missing data. The methods were compared with the traditional imputation methods such as mean, hot-deck and regression-based imputation. Their results showed that the MIMLP method performed best for numeric variables and the IMLP method performed better with categorical variables. Imputation by MLP methods offered some advantages for some datasets though statistical test for significance was not performed.

Liu et al. [49] proposed a credal classification method with adaptive imputation for incomplete pattern. In credal classification objects can belong to multiple classes and meta-classes. The method has two stages. First, a record is classified based on the available information if the class is non-ambiguous. However, when the record is hard to classify, then it goes to the second step which involves imputation and later classification. In the imputation phase, self-organized map (SOM) is used in combination with k-NN to obtain good accuracy while reducing computational burden.

A correlation-based low-rank matrix completion (LRMC) method was developed by Chen et al. [12]. The method applies LRMC to estimate missing data then uses a weighted Pearson’s correlation followed by K-nearest neighbour (k-NN) search to choose the most similar samples. Furthermore, they proposed an ensemble learning to integrate multiple imputed values for a specific sample to improve imputation performance. The proposed method was tested on both traffic flow volume data and benchmark datasets. Further investigation was conducted to test the performance of the imputation in the classification tasks. Their proposed correlation-based LRMC and its ensemble learning method achieved better performance than traffic flow imputation methods such as temporal nearest average imputation (TNAI), temporal average imputation (TAI), probabilistic principal component analysis (PPCA) and low-rank matrix completion (LRMC).

Tran et al. [69] proposed a method that introduces multiple imputation with an ensemble and compared the proposed method with others that use simple imputation. Ten datasets were collected from UCI repository. The ensemble achieved better classification accuracy than the other methods. However, they only applied C4.5 as a classification algorithm and used one method to perform multiple imputation on relatively small datasets.

Garciarena and Santana [31] studied the relationship between different imputation methods and missing data patterns using ten datasets from the UCI repository and a set of fourteen different classifiers such as decision trees, neural networks, support vector machines, k-NN and logistic regression. The result shows that the performance of individual classifiers is statistically different when using various imputation methods. They concluded that the key to selecting proper imputation methods is to check first the patterns of missing data.

Tran et al. [70] further proposed methods incorporating imputation (single/multiple) with feature selections and clustering to improve classification accuracy and also the computational efficiency of imputation.

A new hybrid technique based on a fuzzy c-means clustering algorithm, mutual information feature selection and regression models (GFCM) was developed by Sefidian and Daneshpour [61]. The aim was to find a set of similar records with high dependencies for a missing record and then apply regression imputation techniques within the group to estimate missing values for that record. The method showed statistically significant differences in most cases in comparison with mean imputation, kNNI, MLPI [62], FCMI [53] and IARI [64].

MIE for Classification

MI is a promising method that has been used to replace missing values by randomly drawing several imputed values from the distribution of unknown data [48, 55]. Unlike in simple imputation, the uncertainty is reflected as the imputation process will result in various plausible values. There are a number of methods to impute data that we will explore in our work, and explain below. We also explore different methods to ensemble the results obtained from the different imputed values, as the ensemble represents a method for combining the evidence from the different models to arrive a final classification which should encompass the degree of missing data.

Imputation Methods

Multivariate Imputation by Chained Equations (MICE)

Fully Conditional Specification (FCS) is a method of MI that was firstly developed by Kennickell [42]. It defines a conditional density function to specify an imputation model for each missing predictor (variable) one by one, then iterates the imputation over that model. Multivariate Imputation with Chained Equations (MICE), an algorithm developed by Buuren [8], is based on FCS but the imputation can be also applied for data that has no multivariate distribution. For each variable with missing values, the algorithm starts by identifying an imputation model for each column with missing values. After that, the imputation will be performed based on random draws from the observed data. The process is repeated based on the number of iterations set-up and the number of variables with missing values.

Expectation–Maximisation with Bootstrapping (EMB)

Honaker et al. [38] have developed an EMB algorithm for handling missing data that combines the EM algorithm with a bootstrapping approach. The EM algorithm is an iterative approach developed by Dempster et al. [17]. Starting with the expectation and then the maximisation step, the algorithm aims to estimate the model parameters by iteratively performing the following. Firstly, in the expectation step (E-step), the likelihood function is evaluated by considering the current estimate of the model parameters. Second, in the maximisation step (M-step), the parameters are updated to maximise the likelihood function. Next, the E-step updates the parameters from M-step to determine the new distribution.

On the other hand, bootstrapping is a mechanism used to estimate a sample distribution from original data with or without replacement. EMB works by repeatedly drawing a bootstrap with replacement from the original data M times, for the M required imputations; then, EM is run which firstly assumes a particular distribution, then initialise a mean and variance values for the missing data in each bootstrap generated. Then, the likelihood function is estimated by considering the current estimate of the model parameters (mean and covariance). Then, the parameters are updated to maximise the likelihood model. The expectation and the maximisation steps are repeated until the values converge [37].

Ensemble Methods

An ensemble is a technique for combining models used in machine learning. It was introduced by Tukey [71] when he built an ensemble of two different regression models. Since then, it has been then broadly studied and reviewed in classification tasks [6, 19, 20, 76]. The idea of an ensemble is to induce a set of base learners (classifiers), then their predictions are aggregated in some way to obtain a better classification. This can have advantages over relying on a single model as a combined model may be more precise and accurate. Furthermore, Breiman [6] explained the usefulness of ensembles with unstable classifiers that are easily affected by changes to the training data such as decision trees and neural networks.

An ensemble can be categorised according to the underlying machine learning algorithms used into two main types: homogeneous and heterogeneous. A homogeneous ensemble is constructed from learners of the same type, e.g. a set of decision trees. On the other hand, when the strategy is to combine different types of learners such as decision trees, neural networks, or Bayesian networks, then we have a heterogeneous ensemble.

In general application, the aim of constructing an ensemble is to achieve a classification accuracy that is higher than any of the individual learner. Thus, the individual learners are expected to be accurate with an error rate better than random guess, and diverse so two classifiers make different errors when predicting a new instance. A number of methods for constructing a diverse ensemble have been developed [6, 13, 19, 27]. Below is an explanation of the most popular ensemble methods which are bagging and stacking.

Bagging

Bagging (also known as bootstrap aggregation) is one of common ensemble methods that can be applied to classification and regression problems [19, 52]. It is used to reduce the variance between models by generating additional training sets from the original data [52]. One such method is where a proportion of data points are randomly chosen with replacement by using bootstrap mechanism which generates multiple training sets; each has approximately 63% of the training data points [52, 66]. Then, a same base learner (e.g. decision trees) is run in parallel on these training sets. As a result, an ensemble of different models will be generated. To make the prediction for a new data, the final decision is made by a majority vote of the individual predictions obtained from the different models [19, 52].

Stacking

In the context of ensemble learning, meta-learning is the process of learning from the multiple learners and their outputs on the original training data. Such a method is efficient when individual classifiers misclassify the same patterns [54]. The method was introduced by Wolpert [76] and refers to a construction mechanism that uses the output of classifiers instead of the training data to build the ensemble. A stacking ensemble can be implemented in two or more layers. In the first layer, a number of base learners are trained on the entire training set then produce (level-0) models. Then, the predictions of the individual models are used as input attributes (meta-level attributes) to the ensemble. The target of the original training set is appended to the (meta-level attributes) to form a new set of predictions, (level-1) model. This set is used to train a meta-classifier in the ensemble. The meta-classifier can be trained based on the predicated class label or the probabilities generated from (level-0) models [67]. This model is used to estimate the final prediction in the ensemble.

Framework for MIE

Our ensemble for MI works as follows. We first generate a series of increasing missing data under MCAR assumption. We then impute the artificial training datasets and generate five imputed datasets using two different MI techniques: MICE and EMB as described in “Imputation Methods” section. We next use these datasets to train classifiers and build our bagging and stacking ensembles. For our bagging ensemble, we train homogeneous classifiers (same classifiers) on the imputed datasets. We then combine the predictions of the models obtained from a separate test data using a majority vote method. This method aggregates the predictions from the individual models and chooses the class that has been predicted most frequently as the final prediction, as illustrated in Fig. 1. Therefore, the bagging ensemble is evaluated using a hold-out test set. This can be viewed as an alternative method for bagging in which multiple imputed datasets may be more dissimilar to each other hence generate more diverse models. On the other hand, Fig. 2 represents the construction of our stacking ensemble showing two layers. The first one involves the multiple imputed datasets trained by a number of learners (heterogeneous classifiers) to generate different models. The models are tested then against a separate test set to make a new dataset of predictions. This new dataset is combined with the actual class of the test set to construct the (level-1) dataset which is used as an input for the second layer. In this layer we train a meta-classifier and then we evaluate the performance of the ensemble using 10-fold cross-validation.

Fig. 1
figure 1

The bagging ensemble framework takes imputed datasets as inputs to train different classifiers \(C_{1}\), ...,\(C_{n}\) in layer 1. The predictions made by individual classifiers, \(P_{1}\), ...,\(P_{n}\), are combined by the majority vote method

Fig. 2
figure 2

The stacking ensemble framework takes imputed datasets as inputs to train classifiers \(C_{1}\), ...,\(C_{n}\). The predictions made by individual classifiers, \(P_{1}\), ...,\(P_{n}\), are used to form a new data to be used to train a meta-classifier in the second layer

Experimental Set-Up

Datasets

For our study, a collection of 20 benchmark datasets were obtained from the UCI machine learning repository [47]. The datasets have different sizes and feature types (numerical real, numerical integer, categorical and mixed) as shown in Table 1. They were all complete datasets, that is they have no missing values, except PostOperativePatient dataset where three records with missing values have been deleted.

Data Preparation

Before conducting experiment we solved the problem of the sparse datasets we have, LSVT and Forest Cover Type. LSVT has five attributes with zero values, so we deleted those features. On the other hand, we transform Forest Cover Type by taking the attributes that represent Wilderness_Area (4 binary columns) and Soil_Type (40 binary columns) then reducing each to a single column with multiple values. So the first new column has a numerical value of (1–4) which represents the presence of a particular area while the second indicates the soil type with a value (1–40). Additionally, we followed Clark et al. [15] mechanism of treating Abalone as a three-category classification by grouping classes 1–8, 9 and 10, and 11 so that we can improve the classification process.

Some datasets are provided with separate train and testing sets. The rest have been partitioned using StratifiedRemoveFolds filter in Weka to retain the class distributions with a ratio of 70% training and 30% testing, except Forest Cover Type, our largest dataset, where 60% of data is used for training and 40% for testing.

Table 1 The details of the datasets collected for the experiments

Table 2 shows the mean accuracy and the standard deviation of the classifiers (J48, NB, PART, SMO and RF) obtained on the testing sets by training on the original data with no missing values. Those classification results for the complete data are used then as the benchmarks to study how missing data affects the accuracy and performance of the algorithms when various methods for dealing with missing data are used. Note that we run RF five times with different random seeds as it obtains different results in each run given its stochastic nature. The RF classifier performs better than other classifiers in nine of the datasets. Then, SMO is the second best classifier, working best on five datasets out of twenty, while J48 and PART work best on three datasets each. The performance of NB is the worst compared to the other classifiers.

Table 2 The mean accuracy of the classifiers and standard deviation for the complete datasets obtained based on test set

Then, we test the performance of multiple classifiers on multiple datasets using Friedman test to check which classifiers outperform others. The test shows a significant difference (p value \(< 0.05\)) in performance so we proceed with post hoc test, Nemenyi test. The Critical Difference diagram as a result of applying Nemenyi test is shown in Fig. 3. The figure illustrates that RF behaves significantly better than PART and NB although there are no statistical differences within each group, i.e. no statistically significant between RF, SMO and J48. Similarly, the performance difference for J48, PART, NB and SMO is not statistically significant.

Fig. 3
figure 3

Critical Difference diagram showing statistically significant differences between classifiers. The bold line connecting classifiers means that they are not statistically different

Missing Data Generation

To create scenarios for testing with increasing missing data, some values in the training sets are removed completely at random as follows: Firstly, 10% (then 20%, 50%) of the attributes are randomly selected to remove data with the following chosen rates 5%, 15%, 30% and 50% of the records, respectively. We repeat the process of selection and removing five times so different features/records may be affected by missing data each time. As a result, 12 artificial datasets are produced from each of the original datasets each time and those have multiple levels of missing data. In total, we generate (\(20 * 12 * 5= 1260\)) datasets. Table 3 summarises the experimental scenarios artificially created.

Table 3 Experimental scenarios with missing data artificially created

In our experiments, the models are tested on separated test data. However, for the stacking ensemble, the results reported represent tenfold cross-validation as the predictions of the separated test sets are used to construct a new dataset for the second layer of the stacking ensemble.

Comparative Methods

Building Models with Missing Data (MD)

In “Mechanisms for Dealing with Missing Data” section we discussed that the chosen algorithms have their own way of dealing with missing data internally. We therefore pass all the data including missing data to the algorithms without pre-processing. Such models are referred to as J48_MD, NB_MD, PART_MD, SMO_MD and RF_MD.

Simple Imputation (SI)

To test simple imputation, the numerical attributes are replaced with their mean and the categorical attributes with their mode. Then, the produced datasets after imputation are used for classification model building. In our results the models with single imputation are referred to as J48_SI, NB_SI, PART_SI, SMO_SI and RF_SI.

Random Forest Imputation (RFI)

We use a RF imputation package (missForest) implemented in R to replace missing values using a RF algorithm. The algorithm starts with filling incomplete data by median if they are numeric or mode if they are categorical. Then, it updates missing values by using proximity from random forest and iterates the imputation a number of times. Finally, the imputed value for an attribute with missing values is the weighted average of non-missing values if it is numeric or the mode if it is nominal. We set up the number of iterations to perform the imputation to 5 and the number of trees that grow in each forest to 300. In our results the models that used RFI are referred to as J48_RFI, NB_RFI, PART_RFI, SMO_RFI and RF_RFI.

Proposed MIE Methods

The following steps have been undertaken to test the bagging and stacking ensemble. First, MICE and Amelia packages in R, which implement Multivariate Imputation with Chained Equation and Expectation–Maximisation with Bootstrap algorithms, respectively, are applied to generate five imputed datasets. For MICE, we set the predictive mean match as the imputation method, the number of iterations to perform the imputation to 20 and the number of the imputed datasets to 5. For Amelia, we used 5 as the number of imputations, too. Additionally, we perform the imputation in parallel when processing large and high dimensional datasets.

The multiple imputed datasets are used as inputs to train the classifiers and for our bagging ensemble, we aggregate the predictions obtained by the models by the majority vote method. One base learner is used as the classifier. Our classifiers of choice for the bagging ensemble are J48 [51], NB [45], PART [26], SMO [60] and RF [43] (as implemented in Weka) with their default options for classifying the data. On our results, such models are referred to as MICE_Hom and EMB_Hom, depending on the method to perform the imputation in the first place.

A second ensemble approach tested is to build a stacking ensemble, where the training datasets are used to perform the imputation then to train all chosen classifiers to generate several models. These models are used as inputs to the first layer of the stack. The predictions of the different models on the testing set are used to form a new dataset for level-1 in the ensemble. Then, we train a meta-classifier in this new dataset in the second layer. For testing the stacking ensemble, we perform 10-fold cross-validation to evaluate the performance of models. These are referred to as MICE_SE and EMB_SE depending on the MI method.

Evaluating Classification Methods

We use the classification accuracy as the metric for our comparisons of performance. We compare between all the approaches looking for differences in the algorithms’ performance on each scenario separately. We perform Wilcoxon signed-rank test with Finner’s procedure for correcting the p values for pairwise comparison testing with a significance level of \(\alpha =0.05\) [18, 29] with a number of controls separately as follows:

We perform two different statistical tests when evaluating the performance of classifiers over the datasets as follows:

  1. 1.

    When comparing multiple classifiers over multiple datasets, we use the method described by Demšar [18], including the Friedman test and the post hoc Nemenyi test which is represented as a Critical Difference (CD) diagram, with a significance level of \(\alpha =0.05\).

  2. 2.

    The Wilcoxon signed-rank test is applied for performing pairwise comparison. We use a control algorithm, the performance of the classifier on SI, with a significance level at \(\alpha =0.05\).

Evaluating Imputation Methods

Numerous statistical methods are used to check the variation between and within the multiple imputed datasets [1, 3, 8, 35]. These involve graphical representations such as histogram, density and quantile–quantile plots [1]. Others suggest the use of numerical comparisons such as means and standard deviations [35]. In this study instead we propose the use of dissimilarity, as used in the context of clustering algorithms [9, 9, 33], to evaluate the quality of the imputation methods. Dissimilarity is a numeric measurement of the degree of difference between data points. We check the quality of the imputation by comparing each imputed data point with its original counterpart. In that way we can measure the dissimilarity between each pair of points (original/imputed). We can then aggregate dissimilarity across the whole dataset to arrive at a measure of quality of the imputation, with imputations that produce points closer to the original being considered better than those where the dissimilarity is greater.

As we have different data types in our datasets, i.e. numeric, categorical, mixed, we use the weighted overall dissimilarity formula proposed by Gower [33], the Gower Coefficient, to compute the distance dis(a,b) between each data point, a, in the original dataset and the corresponding data point, b, in the artificial dataset after performing MI as follows:

$$\begin{aligned} dis(a,b)= \dfrac{\sum _{f=1}^N w_{ab}^{(f)} dis_{ab}^{(f)}}{\sum _{f=1}^N w_{ab}^{(f)}} \end{aligned}$$
(1)

where N denotes the total number of features in a dataset, w is the assigned weight to a feature (we set w=1 for each feature) and f is a feature which can be either numerical or categorical.

Before we apply the formula to measure distance, we standardise each numerical attribute, f, into a comparable range using the standardised measure, \(z_{-}score\), to avoid attributes with a larger range having a bigger effect on the distance measurement. For this we use the following equation:

$$\begin{aligned} x^{'}=z(x_{f}) = (x_{f}-m_{f})/s_{f} \end{aligned}$$
(2)

where x denotes a value in an attribute, m the mean of attribute and s is the mean absolute deviation for that attribute. Then, we compute the distance, \(dis_{ab}^{(f)}\), as follows:

$$\begin{aligned} dis_{ab}^{(f)}=\left| x^{'}_{af} - x^{'}_{bf}\right| \end{aligned}$$
(3)

The contribution of each categorical attribute to the overall dissimilarity \(dis_{ab}^{(f)}=0\) if \(x_{af}\) and \(x_{bf}\) are identical otherwise \(dis_{ab}^{(f)}=1\) .

The overall aggregated dissimilarity function remains in the same range [0,1]. Finally, we average the distance of all records to obtain the mean distance between the original and imputed data.

Results

In order to understand how different algorithms behave under different imputation regimes, we began by investigating each algorithm separately. In particular, we applied our proposed methods that combine MI with ensemble techniques, MIE, along with the comparative approaches as described in “Comparative Methods” section. We therefore study the performance of the internal mechanism of the algorithms for handling missing data (e.g. for J48, J48_MD, NB_MD, PART_MD, SMO_MD and RF_MD), the simple imputation (J48_SI, NB_SI, PART_SI, SMO_SI and RF_SI), the RF imputation (J48_RFI, NB_RFI, PART_RFI, SMO_RFI and RF_RFI). Our proposed MIE methods are represented by the combination between MI methods with bagging (MICE_Hom, EMB_Hom) and stacking ensembles (MICE_SE and EMB_SE). The details of the results can be found from the authors.

Classification Performance

For each of the classifier/imputation methods studied, we applied the Friedman statistical test [18] to compare the performance of the imputation methods including our proposed approach. The test compares the mean ranks of the classifiers on a number of datasets as follows: with 7 algorithms (i.e. variations on imputation regimes) and 20 datasets, \(\digamma\) it is distributed according to the \(\digamma\) distribution with \(7 - 1 = 6\) and \((7 - 1) * (20 - 1) = 114\) degrees of freedom. If we use a significance level of \(\alpha =0.05\), the critical value of \(\digamma\) is 2.18.

For the J48 algorithm, Table 4 summarises the mean rank for the different imputation/ensemble methods on each of the artificial datasets in each scenario separately. Hint: lowest rank means better performance. On average, for J48 the stacking ensemble with EMB (EMB_SE) obtained a better rank hence better overall classification accuracy, with MICE_SE second best. J48_SI was the worst.

Similar results were obtained for the NB, PART and SMO algorithms as illustrated in Tables 5, 6 and 7. In each case the EMB_SE algorithm produced the best performance in terms of ranking and hence overall accuracy on different datasets. For both NB and PART, MICE_SE was second best. However, for SMO in Table 7, RFI was a close match to MICE_SE. For RF, shown in Table 8 EMB_Hom was the best in most scenarios, whereas the internal mechanism of RF for handling MD showed worse performance than others in most cases.

Table 4 The mean rank for J48 on different imputation methods along with proposed approach on all dataset affected by missing data in all scenarios
Table 5 The mean rank of NB in combination with different imputation methods and of our proposed approach on all dataset affected by missing data for different scenarios
Table 6 The mean rank of PART in combination with different imputation methods and of our proposed approach on all dataset affected by missing data for different scenarios
Table 7 The mean rank of SMO in combination with different imputation methods and of our proposed approach on all dataset affected by missing data for different scenarios
Table 8 The mean rank of RF in combination with different imputation methods and of our proposed approach on all dataset affected by missing data for different scenarios

So far we have used the Friedman test to compute the average ranks. The test also gives us the ability to compute a p value, to discern if the algorithm performs significantly different to others according to the average rank obtained. Table 4 presents the p values resulting from application of the Friedman test for each scenario and each algorithm and shows that the performance of the different imputation methods when combined with a given classifier were significantly different. The symbol \(*\) denotes that the test was significant \(p<0.05\). J48, NB and PART were statistically different when different imputation methods were applied in all scenarios tested. The performance of SMO was significant in most cases while RF was the same in half of the cases.

Table 9 The p values resulting from the Friedman test for comparing the performance of each classifier with different imputation methods separately

We also used the Wilcoxon signed-rank test for pairwise comparison with a control algorithm. This test computes the median (not average) accuracy among all datasets. We chose the performance of the classifiers applied to data imputed by SI as a control as this is a form of naive imputation which may be frequently used and is often used as a control against new data imputation methods (Table 9). MI combined with an ensemble in the case of the J48 algorithm (i.e. EMB_Hom, MICE_SE and EMB_SE) was statistically significant better than the control in all cases as shown in Table 10. On the other hand, MICE_HOM models and J48_RFI were significantly different in a few scenarios. The internal mechanism of J48 for handling MD was not different than SI.

For the NB algorithm, the results are shown in Table 11. We can see that only the combination between MI and stacking (MICE_SE and EMB_SE) performed statistically different from the control while other methods showed no difference.

For PART, as shown in Table 12, the combination between MI with ensembles (EMB_Hom, MICE_SE and EMB_SE) was better than the control in all cases. On the other hand, MICE_Hom, PART_RFI and the internal method were significantly different from the control in a few scenarios.

For SMO, performance for the EMB_SE approach is better in most but not all cases and similarly for MICE_SE, as illustrated in Table 13. SMO_RFI was better than the control only when the ratio of missingness increases. The EMB_Hom method was significantly better than the control when low missing values were encountered.

For RF, Table 14 presents the comparison with the control and shows some improvements when EMB_Hom was used. For all other approaches to missing data there appears to be little difference.

Table 10 The median accuracy for J48 on different imputation/ensemble methods along with proposed approach resulting from Wilcoxon signed-rank test
Table 11 The median accuracy for NB on different imputation/ensemble methods along with proposed approach resulting from Wilcoxon signed-rank test
Table 12 The median accuracy for PART on different imputation/ensemble methods along with proposed approach resulting from Wilcoxon signed-rank test
Table 13 The median accuracy for SMO on different imputation/ensemble methods along with proposed approach resulting from Wilcoxon signed-rank test
Table 14 The median accuracy for RF on different imputation/ensemble methods along with proposed approach resulting from Wilcoxon signed-rank test

Quality of the Imputed Data

Here we first evaluate the quality of imputation methods used, i.e. how far is the imputed data from the real data. We used the normalised Euclidean distance as explained in “Evaluating Imputation Methods” section to compute the mean dissimilarity between the imputed and the original data. We divide our analysis by the feature type (i.e. numerical, categorical or mixed) as imputation may work differently for different data types. The number of datasets in each group is 10, 5 and 5, respectively.

The three plots at the top of Fig. 4 represent the mean dissimilarity between the real and the imputed values using EMB, MICE, RFI and SI with respect to the numerical datasets. In most of the scenarios RFI produced imputed data closer to the real data as the mean dissimilarity was very close to 0. EMB was a close match to RFI followed by MICE. However, imputed data by SI was the worst as it was further from the real data specially with increasing uncertainty. With respect to the categorical data, the plots in the middle of the figure show that the mean dissimilarity for all imputation methods devised was close to each other and to the real data though EMB produced the best performance for most scenarios. In the case of categorical data, all methods tested were efficient in terms of recovering missing values. On the other hand, different imputation methods behave differently with the mixed data type as shown at the bottom of the figure EMB produced data that was more similar to the real data as the mean dissimilarity did not exceed 0.1 in the worst case. RFI became second best followed by MICE. Again the SI was the worst in all cases.

Fig. 4
figure 4

The mean dissimilarity between the original and imputed data points as a result of applying different imputation methods on numerical datasets where different percentage of features affected by missing data at different levels. The first row represents numerical data, the second represents categorical data, and the last represents mixed data

Classification Results by Datatype

Finally, we analyse the efficiency of the imputation method based on different data types and we relate this to the performance of different classifiers. We do this separately for each classification algorithm. We present box plots showing the range of accuracies (max, min, median and any outliers) obtained for all the datasets of a given data type. The different scenarios in terms of % of missing data are represented in the x-axis, though we combined the results from the missing data affecting 10%, 20% and 50% features in one box plot as the same patterns were observed for each. The grey box plot in each graph represents the accuracy on the complete dataset, before any data is removed.

Figure 5 shows the range of accuracies as box plots for the J48 algorithm applied on numerical (left), categorical (centre) and mixed (right) datasets. On numerical datasets (the plot to the left), there are two clear methods that stand out as the median value for EMB_SE and MICE_SE was much higher than that of other methods and of the complete data for all levels of missing data. Also, the maximum accuracy of both EMB_SE and MICE_SE increased by about 10% compared with the complete data also for all levels of missing data. The EMB_Hom method also shows some improved performance though not so marked. The other methods perform similarly to one another and to the complete data. For the categorical data (centre plot), a similar pattern for median accuracy is observed with EMB_SE and MICE_SE showing best median performance, with some but not so marked improvement for maximum accuracy too. Overall median accuracy of most approaches decreased with increasing uncertainty but for EMB_SE and MICE_SE it was both higher than the complete data and that the other methods for most scenarios. On mixed datasets (right plot), the median accuracy of all approaches seemed to be similar to the complete data but the maximum average accuracy increased when applying MICE_SE and EMB_SE. Outliers, represented by dots in the plot, were presented in all methods tested for mixed data.

Fig. 5
figure 5

This figure contains box plots describing the overall average accuracy of J48 applied on imputed datasets using different imputation approaches along with the average accuracy on the complete data

For the NB algorithm, similar results are shown in Fig. 6. However, for NB, MICE_SE and EMB_SE show performance improvements both in terms of maximum and medium average accuracies with respect to the mixed datasets.

Fig. 6
figure 6

This figure contains box plots describing the overall average accuracy of NB applied on imputed datasets using different imputation approaches along with the average accuracy on the complete data

For PART, shown in Fig. 7 the median accuracy of PART_MD, PART_SI, PART_RFI, MICE_Hom and EMB_Hom deteriorated for numerical datasets in comparison with the original data while the performance improves when MICE_SE and EMB_SE are applied. On categorical datasets, all different methods helped to keep performance similar to that of the complete data for low % of missing values but not when increasing the uncertainty. The performance of EMB_Hom, MICE_SE and EMB_SE was the best on both categorical and mixed data.

Fig. 7
figure 7

This figure contains box plots describing the overall average accuracy of PART applied on imputed datasets using different imputation approaches along with the average accuracy on the complete data

For SMO, shown in Fig. 8 the median accuracy of all methods on numerical datasets was similar to each other and to the complete data while the minimum accuracy of MICE_SE and EMB_SE increased by up to 10% compared with the completed data. Similarly, all approaches tested on the categorical data were relatively close. On mixed datasets, the median accuracy of the classifier seemed to be equal to the complete data but the maximum accuracy increased when applying MICE_Hom, MICE_SE and EMB_SE.

Fig. 8
figure 8

This figure contains box plots describing the overall average accuracy of SMO applied on imputed datasets using different imputation approaches along with the average accuracy for the complete data

Finally, the performance of RF with respect to different data types is shown in Fig. 9. All approaches tested were relatively similar so for this algorithm the method of imputation produced minor or no improvements. MICE_SE and EMB_SE improved on maximum average accuracy for the categorical data but did not perform so well when increasing missing data was present. For the mixed data all approaches seemed similar.

Fig. 9
figure 9

This figure contains box plots describing the overall mean accuracy of RF applied on imputed datasets using different imputation approaches along with original data

Discussion and Conclusions

In this study, we investigate how different classification algorithms behave when using various methods for missing values imputation. We propose our MIE approach to improve classification with missing data and compare it with other methods for dealing with missing data.

For J48, NB, PART and to a large extent for SMO, the proposed EMB_SE produced the best performance for most levels of missing data with MICE_SE being a closed second. For high levels of missing data SMO worked well with RFI. For the RF algorithm, however, EMB_Hom produced the best performance in most cases. The differences in performance were statistically significant in all cases for J48, NB, PART and in the majority of cases for SMO. For RF, they were statistically significant in some scenarios only.

On the other hand, when comparing different approaches with a control method for imputation in the form of SI, we found that in most cases the proposed MIE techniques that rely on stacking (MICE_SE and EMB_SE) obtain statistically significantly better classification accuracy than the control when working with J48, NB and PART. This was also true for SMO in the majority of scenarios but not for RF where EMB_Hom showed more significant improvements, consistently with our previous results.

It is not possible to directly compare our results to others working on related work due to different datasets and experimental set-up. However, some comparisons are possible. For example, our findings, particularly for J48, are consistent with similar work done by Tran et al. [69] where they combined data imputed by MICE with an ensemble by using the majority vote method. Their proposed work achieved an improvement in terms of the classification accuracy. In our work we obtained further improvements on accuracy when using EMB imputation and stacking ensembles (MICE_SE and EMB_SE).

We proposed the use of dissimilarity to assess how far is the imputed data from the real data so that we can relate this to the performance of the algorithms. For numerical data RFI seems to perform best particularly for growing percentages of missing data. For categorical data EMB appears best by a very small margin, except for the highest missing data scenario. For mixed data EMB seems always best.

However, from further analysis of performance on each data type we can see that the imputation that recovers data best does not necessarily lead to a better classification performance. From the box plot analysis, we find that for all algorithms and data types except for RF, EMB_SE and MICE_SE produce consistently better performance than the others, hinting at the fact that the ensemble plays a big part in producing good results. For RF again most methods seem to perform similarly though EMB_SE and MICE_SE are still consistently good performers. This indicates that the ensemble in itself produces improvements irrespective of the quality of the imputation. As RF is already an ensemble algorithm, the advantages of MI for RF appear less obvious than for the others.

One of our important findings is that even in scenarios of increasing uncertainty, it is possible to obtain results similar or in some cases better than those obtained with the complete data, if the right imputation technique is used. This is an important finding as reasoning with missing data becomes then a lesser problem in the context of MCAR data. In this sense our proposed MIE methods, particularly those using stacking as the ensemble method produce consistently good results. Although multiple imputation may consume time and memory particularly with large datasets, its advantages in terms of representing the uncertainty as well as the ability to introduce diversity for the ensemble classifiers enables us to improve classification accuracy for scenarios with large levels of missing data and for most classification algorithms tested. If an algorithm such as RF is used, then the imputation method appears less relevant, although a poor imputation method like SI can produce deteriorated performance particularly for mixed data.

As a future work, we can increase the number of imputed datasets to test if more diversity produces further improvements. We could also test the implications of MAR data, by providing a different experimental set-up.