Keywords

1 Introduction

While solving a classification task one often faces demanding preprocessing of data. One of the preprocessing steps is the treatment of missing values. In practice, we struggle with randomly located single missing data in instances or with missing entire features. In real-world scenarios, e.g. [10, 25, 26], we have to deal with missing data. Missing values can also be part of a cold-start problem. Imputation treatments for missing values have been widely investigated [8, 14, 24] and plenty of methods how to reconstruct missing data were designed, but these methods are not directly designated for entire missing features reconstruction.

This work focuses on the influence of missing entire features and possibilities of their reconstruction for usage in predictive modeling. We consider the following scenario: a classification model is trained on a dataset containing a complete set of continuous features but has to be used for prediction of classes of a dataset with some entire features missing. Entire feature reconstruction and its usage in an already learned model in order to perform with a reconstructed dataset distinguishes our work from others. Our point of interest is to find out how missing features impact the accuracy of the classification model, what possibilities of missing entire features reconstruction exist, and how the model performs with imputed data. In our work, the reconstruction of missing features, i.e. data imputation, is the very first task of transfer learning methods [17], where the identification of identical, missing, and new features is crucial.

Experimental results of this work should shed more light onto the applicability of state-of-the-art imputation methods on data and their ability to reconstruct entire missing features. We deal with traditional imputation methods: linear regression, k-nearest neighbors (k-NN), and multiple imputation by chained equations (MICE) [24], as well as with modern methods: multi-layer perceptron (MLP), and gradient boosted trees (XGBT) [6]. Experiments are performed on four real and six artificial datasets. The imputation influence is studied on six commonly used binary classification models: random forest, logistic regression, k-NN, naive Bayes, MLP, and XGBT. The amount of missing data varies between one feature and \(50\%\).

This paper is structured as follows. In the next section we briefly review related work. Section 3 introduces imputation methods that are being analyzed in this work. Multiple features imputation is also discussed here. In Sect. 4 we describe the experiments that were carried out and present their results in Sect. 5. Finally, we conclude the paper in Sect. 6.

2 Related Work

There exist many surveys which summarize missing value imputation methods such as [5, 8, 10, 11, 16, 22, 25]. A lot of them are more than five years old and focus on traditional imputation methods.

A very good review of methods for imputation of missing values was provided by [8]. This study is focused on discrete values only with up to \(50\%\) missingness. They experimentally evaluated six imputation methods (hot-deck, imputation framework with hot-deck, naive Bayes, imputation framework with naive Bayes, polynomial multiple regression, and mean imputation) on 15 datasets used in 6 classifiers. Their results show that all imputation methods except for mean imputation improve classification error when missingness is more than \(10\%\). The decision tree and naive Bayes classifiers were found to be missing data resistant, however other classifiers benefit from the imputed data.

In [25], performance of imputation methods was evaluated on datasets with varying amounts of missingness (up to \(50\%\)). Two scenarios were tested: values are missing only during the prediction phase, and values are missing during both induction and prediction phases. Three classifiers were used in this study: a decision tree, k-NN, and a Bayesian network. Imputation by mean, k-NN, regression and ensemble were used as imputation methods. The experimental results show that the presence of missing values always leads to performance reduction of the classifier, no matter which imputation method is used to deal with the missing values. However, if there are no missing data in the training phase, imputation methods are highly recommended at the time of prediction.

Finally, in [3], Arroyo et al. present imputation of missing values of Ozone in real-life datasets using various imputation methods - multiple linear and nonlinear regression, MLP and radial basis function networks, where the usefulness of artificial neural networks is presented.

3 Imputation Methods

Plenty of methods of missing data reconstruction have been designed. They perform differently on various datasets and in practice the most suitable imputation method for a given dataset is usually chosen according to the evaluation of the average performance (e.g. RMSE) of each method in the phase of training [20].

First let us briefly introduce imputation methods which we focus on within this study. The most basic methods are linear regression and the k-NN (see e.g. [9]).

The MICE [21, 24] does not simply impute missing values using the most fitting single value, but it also tries to preserve some of the randomness of the original data distribution. This is being accomplished by performing multiple imputations, see [19]. The MICE comes up with very good results and is currently one of the best-performing methods [24]. In our research we use MICE in a simplified way. This means that multiple imputations are pooled using the mean before the classification model is applied. The reason is that we want to simulate the situation when the use of a classification model is restricted.

The MLP [22] with at least one hidden layer and no activation function in the output layer and the XGBT, see [6] for more details, are considered to be modern imputation methods.

3.1 Multiple Features Imputation

To impute several missing features, there are two ways of accomplishing this task using the previously mentioned methods. The first is to impute all features simultaneously which can be done using k-NN and MLP models. The second, which is usable for all other methods, is to apply the model sequentially one missing feature after another. However, to do this, it is important to choose some order in which the features will be imputed. We focus on an ordering where the most easy to impute features are treated first.

In the case of k-NN and MICE such a sequential imputation is not needed. The reason is that, in the case of k-NN, the neighbors typically do not change in subsequent steps and MICE is already prepared for multiple features imputation using an internal chained equation approach [21, 24].

Linear Imputability

A simple way of measuring imputation easiness of features is to use the multiple correlation coefficient [2]. Multiple correlation coefficient \(\rho _ {X, {\boldsymbol{X}}'}\) between a random variable X and a random vector \({\boldsymbol{X}}' = (X_1', \dotsc , X_n')^T\) is the highest correlation coefficient between X and a linear combination \(\alpha _1 X_1' + \dotsc + \alpha _n X_n' = \boldsymbol{\alpha }^T {\boldsymbol{X}}'\) of random variables \(X_1', \dotsc , X_n'\),

$$ \rho _ {X, {\boldsymbol{X}}'} = \max _{\boldsymbol{\alpha }\in \mathbb {R}^n} \rho _{X, \boldsymbol{\alpha }^T{\boldsymbol{X}}'}. $$

It takes values between 0 and 1, where \(\rho _ {X, {\boldsymbol{X}}'} = 1\) means that the prediction by linear regression of X based on \({\boldsymbol{X}}'\) can be done perfectly and \(\rho _ {X, {\boldsymbol{X}}'} = 0\) means that the linear regression will not be successful at all.

When \(X_1,\dotsc , X_p\) are the p features, we call the multiple correlation coefficient \(\rho _{X_i, {\boldsymbol{X}}_{-(i)}}\) between \(X_i\) and a random vector of other features \({\boldsymbol{X}}_{-(i)} = (X_1, \dotsc , X_{i-1}, X_{i+1}, \dotsc , X_p)^T\) the linear imputability of feature \(X_i\).

The estimation of the linear imputability is based on the following expression

$$ \rho _{X_i, {\boldsymbol{X}}_{-(i)}}^2 = \frac{{{\,\mathrm{cov}\,}}(X_i, {\boldsymbol{X}}_{-(i)})^T \big ({{\,\mathrm{cov}\,}}({\boldsymbol{X}}_{-(i)})\big )^{-1} {{\,\mathrm{cov}\,}}(X_i, {\boldsymbol{X}}_{-(i)})}{{{\,\mathrm{var}\,}}(X_i)}, $$

where \({{\,\mathrm{cov}\,}}(X_i, {\boldsymbol{X}}_{-(i)})\) is a vector of covariances between \(X_i\) and remaining features \(X_1, \dotsc , X_{i-1}, X_{i+1}, \dotsc , X_p\), and \({{\,\mathrm{cov}\,}}({\boldsymbol{X}}_{-(i)})\) is a \(p-1 \times p-1\) variance-covariance matrix of covariances between remaining features.

If we want to impute multiple features, say \(X_i, X_{i+1}, \dotsc , X_{i+k}\), in the first step we choose \(X_j, i \le j \le i+k\) such that \(\rho _{X_j, {\boldsymbol{X}}_{-(i, \dotsc , i+k)}}\) is the largest, where \({\boldsymbol{X}}_{-(i, \dotsc , i+k)} = (X_1, \dotsc , X_{i-1}, X_{i+k+1}, \dotsc , X_p)^T\) is a vector of the remaining features. Then, in the next step, we repeat the process where \(X_j\) is taken as a known feature. Thus we choose \(X_l, i \le l \le i+k, l \ne j\) such that its linear imputability with respect to random vector \({\boldsymbol{X}}_{-(i, \dotsc , j-1, j+1, \dotsc , i+k)}\) is the largest. We continue this way until all missing features are imputed.

Note that we are recalculating linear imputability in every step. This should not be done if the imputation is performed with linear regression since after the re-estimation (on the full training set) one obtains unachievable values.

Information Imputability

Linear imputability is a simple measure of how the linear regression imputation will perform. However, when one uses more sophisticated imputation models like MLP or XGBT that can handle non-linear dependencies, the linear imputability may not be suitable.

Hence we propose another way how to measure the imputability which is based on a particular result from Information theory. If a feature \(X_j\) is predicted by an estimator \(\hat{X}_j\) based on other features represented by a vector \({\boldsymbol{X}}_{-(j)}\), i.e. \(\hat{X}_j \equiv \hat{X}_j\big ({\boldsymbol{X}}_{-(j)}\big )\), then it can be shown (see [7]) that

$$ {{\,\mathrm{E}\,}}\big (X_j - \hat{X}_j\big )^2 \ge \frac{1}{2\pi \mathrm {e}} \mathrm {e}^{2H(X_j | {\boldsymbol{X}}_{-(j)})}, $$

where \(H(X_j | {\boldsymbol{X}}_{-(j)})\) is the conditional (differential) entropy of \(X_j\) given \({\boldsymbol{X}}_{-(j)}\).

Hence the lower bound of the expected prediction error is determined by the conditional entropy \(H(X_j | {\boldsymbol{X}}_{-(j)})\). The greater the entropy is the worse predictions one can achieve at best when estimating \(X_j\) from other features. Thus one may measure imputability through the value of a conditional entropy multiplied by \(-1\) in order to have larger values which correspond to better imputability. Hence we define the information imputability as a value of \(-H(X_j | {\boldsymbol{X}}_{-(j)})\).

The process of multiple feature imputation is now exactly the same as it was using linear imputability. One first imputes the feature with the largest information imputability. The only difference is that in the second and all subsequent steps the recalculation does not make sense since one is not able to get any new information no matter what model will be used for the imputation. This partially simplifies the process of imputation order selection.

On the other hand, the problem that strongly limits its practical usage is the estimation of the conditional entropy. Even the most recently proposed estimators in [15, 23] suffer from the curse of dimensionality. This is due to the fact that all these estimators are based on the k-NN approach introduced by Kozachenko and Leonenko in [12]. As our numerical experiments indicate, the method is limited to approximately five features depending on the underlying joint distribution.

4 Experiments

Our experiments consist of the following steps. First the original dataset is divided into a training part (\(70\%\)) and a test part (\(30\%\)). Several classification models as well as all imputation methods are trained on the training part. The imputation models are trained to impute in scenarios where each individual feature is missing and where randomly selected combinations of multiple features are missing. The degree of missingness varies from \(10\%\) to \(50\%\). Finally, an evaluation of the accuracy of all classification models combined with all imputation methods is performed on the test dataset.

4.1 Settings and Parameters of Imputation Methods

Experiments were done using various settings. In order to keep the report short we present only those with satisfying results. All experiments were implemented in Python 3.

The k-NN imputation (knn) was implemented using the fancyimpute libraryFootnote 1. A missing value is imputed by sample mean of the values of its neighbors weighted proportionally to their distances. In the case where multiple features are missing we impute all missing values at once (per row). In the presented results the hyper-parameter k is always taken as \(k=5\). This value was chosen based on preliminary experiments and with respect to computational time.

For the MICE method (mice) we also used the fancyimpute library. The parameter setup was inspired by [4] and we chose the number of imputations to be 150, the internal imputation model to be a Bayesian ridge regression, and the multiple imputed values to be pooled using the mean.

Linear regression imputation was implemented using the scikit-learn libraryFootnote 2 [18]. We tested two scenarios within the case when multiple features were missing. The first scenario was based on the linear imputability (linreg-li) and an iterative approach (linreg-iter) which corresponds to chained equations in MICE. This approach repeats two steps. First, every single missing value is imputed from the known features only. Second, all the imputed values are iteratively re-imputed from other features (all features except the one being imputed).

The MLP imputation is implemented using the scikit-learn library in two scenarios. The first (mlp) imputes all missing features at once and the second (mlp-li) imputes subsequently based on linear imputability. The hyper-parameters of MLP (learning rate, numbers and sizes of hidden layers, activation function, number of training epochs) were tuned using randomized search. The XGBT was implemented using the xgboost libraryFootnote 3 in two scenarios. The first (xgb-li) is an analogy to mlp-li and the second (xgb-iter) to linreg-iter. The hyper-parameters (learning rate, number of estimators, max depth of trees) were again tuned using randomized search.

The multiple features subsequent imputation scenario using information imputability is not presented here since in preliminary experiments it does not bring any benefits over linear imputability.

4.2 Evaluation

Imputation methods were evaluated using six binary classification models: k-NN, MLP, logistic regression (LR), XGBT, random forest (RF), and naive Bayes (NB), where LR, RF, and NB were provided by the scikit-learn library. We again used the randomized search algorithm to get classifier hyper-parameter configurations for each dataset.

First, we trained all classification models and measured their performance on the full test dataset (no missing features) (see Table 1 for results). Second, we combined them with imputation methods. We then measured the accuracies of all classification models on the imputed test dataset. Finally, we calculated the imputation performances as changes with respect to the accuracies on the full test dataset.

4.3 Datasets

We use both artificial and real datasets which are presented in Table 1. All datasets have continuous features and binary target labels. All datasets contain complete data without missing values. We assume all features are in a suitable form for the classification of the target label.

The real Wine Quality dataset originally contains ten target classes that were symmetrically merged in order to have a binary classification task. The artificial datasets were generated using the make_classification method in the scikit-learn library. They contain informative and redundant features. Informative features are drawn independently from the standard normal distribution. Redundant features are generated as random linear combinations of the informative features. A noise drawn from a centered normal distribution with variance 0.1 is added to each feature.

Table 1. Details of datasets with corresponding classification model accuracies. The number of features (# feat.) does not include the target label. The name ds_a_b_c stands for an artificial dataset where a is the number of features, b is the number of informative features, and c is the number of redundant features. Bold values of accuracy correspond to the two best models for a given dataset.

5 Results of Experiments

Results of the single feature imputation are shown in Table 2, where we present measured accuracy changes using the sample mean ± the sample standard deviation. The top \(10\%\) of imputation methods for each dataset and classification model are indicated by the value printed in bold. Two typical scenarios are shown in more details in Fig. 1.

Results of the multiple features imputation for two best models on each dataset are presented in Tables 3 and 4 for real and artificial datasets, respectively. Visualizations of typical results are given in Fig. 2 for a selected real dataset and in Fig. 3 for a selected artificial dataset. Box plots are used to show the results for different imputation methods and portions of missing features.

Fig. 1.
figure 1

Change in classification accuracy under all imputation methods for single missing features. Each feature is linked between different methods using a line.

Fig. 2.
figure 2

Classification accuracy change of MLP model on real dataset Spambase.

Fig. 3.
figure 3

Classification accuracy change of XGBT model on artificial dataset Ringnorm.

5.1 Discussion

The results are highly dataset specific. For some datasets (Cancer, all ds_... datasets) the decreases in the classification accuracy were only minor, less than \(1\%\), even for \(50\%\) of missing values. On the other hand for some datasets (MAGIC, Ringnorm) the decrease is much greater, \(1\%\)\(2\%\) for \(10\%\) of missing features and \(10\%\) for \(50\%\) of missing features.

Table 2. Mean accuracy changes in percentages (± standard deviation) for single missing feature imputation. Classification methods are shown in the last six columns and imputations methods are given in the second column.
Table 3. Mean accuracy changes shown in percentages (± standard deviation) on real datasets with missingness from \(10\%\) up to \(50\%\) where only the two best classification models for each dataset are shown.
Table 4. Mean accuracy changes shown in percentages (± standard deviation) on artificial datasets with missingness from \(10\%\) up to \(50\%\) where only the two best classification models for each dataset are shown.

From the imputation methods point of view MICE usually performs the best on real datasets. On artificial datasets it places among the best methods only for the Ringnorm and ds_10_7_3 datasets. Its results often have smaller variance than results of other methods.

Results comparative to MICE were often reached using linear regression imputation (specifically linreg-li). Especially on artificial datasets it usually performs the best. In most cases either the MICE or linear regression are the best methods.

XGBT and MLP performances are much more dataset dependent. However, their performance is usually not comparable to the best method and it also strongly depends on what classification model is used and how many features are missing. See e.g. MAGIC dataset where MLP is performing well for \(30\%\) of missing features and performing badly for \(10\%\) of missing features or the Spambase where a similar discrepancy holds for XGBT. Finally, the k-NN almost always performs worse than other methods. The only exception is the ds_20_14_6 dataset with random forest classification model.

Considering the amount of missing features it seems that results depend on the portion of missing features and not on the absolute number of missing features.

When we restrict ourselves to one missing feature reconstruction, the results are again highly dataset specific. For Cancer, Spambase, and ds_... datasets the accuracy after imputation actually increases. This is probably due to the fact that original classification models were overfitted and the proper imputation enables them to generalize better. On the other hand for the MAGIC dataset the performance decrease was around \(1\%\)\(2\%\).

One can summarize that the best imputation methods were MICE, which performs well on real datasets and linear regression, which performs well also on artificial datasets. In some cases comparable results were reached by XGBT and MLP imputation. Again, only the k-NN imputation is not performing well enough.

If one analyzes all classification models (not just the two best), then classification models with higher accuracy perform worse with imputed datasets than less accurate models, as can be expected. The classification accuracy decreases only slightly while using imputed data in a model with low accuracy.

6 Conclusion

We focused on missing entire features reconstruction and its impact on the classification accuracy of an already learned model. We deal with traditional imputation methods: linear regression, k-NN, and MICE, as well as modern methods: MLP, and XGBT. We also proposed two methods, linear and information imputability, for the ordering of missing features when more of them are imputed sequentially. However, in practice information imputability is hard to estimate and does not provide satisfying results.

Comprehensive experiments are presented on four real and six artificial datasets. The imputation influence is studied on six commonly used binary classification models: random forest, logistic regression, k-NN, naive Bayes, MLP, and XGBT. The amount of missing data varies between \(10\%\) and \(50\%\).

As our results indicate MICE and linear regression are generally good imputers regardless of the amount of missingness or the classification model used. This can be seen as some kind of generality when the used classification model is unknown.

As was also shown modern imputation methods MLP and XGBT did not perform as well as expected. They rarely perform among the top methods. Their performance is often one of the lowest. This result is surprising since in many current machine learning tasks these methods are one of the best.

The experimental results of this work shed more light on the applicability of state-of-the-art imputation methods on data and their ability to reconstruct missing entire features. The study is also important thanks to its scope of datasets, methods and portions of missing data (up to \(50\%\)).