1 Introduction

When working with real-world datasets one of the standard problems that needs solving as part of the data preprocessing phase is dealing with missing data. The missingness can be represented by either individual missing data randomly located in instances or by the absence of entire features.

To our best knowledge, not much attention is paid to the second scenario where entire features are missing, i.e., there are no clear answers to questions such as how to face the situation, how the standard imputation method will perform or if there is a need to approach this challenge in a different way.

The aim of our work is to study these issues by experimentally comparing several state-of-the art imputation methods in real-world scenarios where one needs to impute (i.e., reconstruct) entire features. This work follows up on our previous work presented in paper [12], where we focus on the comparison of traditional (k-NN, linear regression, MICE) and modern (multi-layer perceptron, extreme gradient boosted trees) imputation methods.

In the current paper, we research more universal imputers represented by autoencoders and generative neural network models. These models have a common advantage in that one does not need to know which features are missing in advance. On the contrary, regular imputation methods need to be trained for each combination of missing features separately. A typical example where a universal imputer is needed is the prediction of a classification model from sensor data, where a sensor breakdown leads to missing data in one or more features. Usually, the prediction model itself is not able to handle this situation without a significant decrease in its performance. Furthermore, one typically does not know in advance which sensor is going to be broken. The best approach would be to retrain the model using data without missing features. However, in a production setting model retraining is impossible as the existing model needs to respond to corrupted data immediately.

We consider a situation where the prediction model is trained on a complete preprocessed dataset with numeric features, and we study its accuracy changes on new unseen data with imputed missing features. The amount of missing data (i.e. features) varies between \(10\%\) and \(50\%\). Experiments are performed on ten real and two artificial datasets. The impact of imputation is measured as the classification accuracy change of the best performing from six commonly used classification models: logistic regression, multi-layer perceptron, k-NN, naive Bayes, extreme gradient boosted trees [7], and random forest. Besides accuracy we also use root mean squared error (RMSE) (which was also used in [6, 17, 35]) as a measure of the quality of the imputation.

We compare the denoising autoencoder (DAE) [33], Generative Adversarial Imputation Network (GAIN) [35], and Variational Autoencoder with Arbitrary Conditioning (VAEAC) [17] with k-NN and MICE [4], which are considered to be successful traditional imputation methods. Moreover, we introduce Wasserstein Generative Adversarial Imputation Network (WGAIN), a Wasserstein based modification of GAIN, see [2]. WGAIN is a generative imputation model and generally outperforms other presented models on the tested datasets. The Earth-Mover distance and the corresponding discriminator’s critic of the Wasserstein approach do not suffer from vanishing gradients in the way that a vanilla GAN would. This enables the model to capture the desired distribution better.

The paper is organized as follows. In Sect. 2, we briefly review related work in this field. In Sect. 3 the WGAIN model is introduced. Section 4 is devoted to the description of experiments performed, including the evaluation of their results. Finally, the paper is concluded in Sect. 5.

2 Related Work

There are many traditional imputation methods, such as e.g., [11, 24, 32]. Some of the most common and successful are k-nearest neighbors imputation (k-NN) [18] and multivariate imputation by chained equations (MICE) [29, 32].

Approaches based on deep learning have been under active development for the last few years. They use many variants of neural networks starting from multi-layer perceptron, e.g., in [3, 30]. A more advanced approach is based on the autoencoder as a specific kind of neural network aiming to reconstruct inputs on its outputs. Here, one of the most commonly used models is the denoising autoencoder (DAE) [33], e.g., [5, 8, 10, 15, 34]. Typically, they are used in a discriminative way (see [15] for difference), meaning they impute a single value, which is deterministic once the network is trained.

On the other hand, the most recent research focuses on generative models which enables one to sample from the distribution conditioned on the observed features and thus get information about the uncertainty in imputed values. There are two groups of deep learning generative models. First, there are models based on the variational autoencoder (VAE) [19] and its conditional alternations, see [25, 26, 31, 36]. In this group, some of the most successful imputation models are VAEAC [17] and HI-VAE [27].

The second group contains models based on the Generative Adversarial Network (GAN) [16]. Notably, one can encounter them in image reconstruction tasks (i.e., image inpainting), see [20, 22, 28]. One of the most prominent methods based on GAN is the GAIN [35], which uses the generator discriminator mechanism to achieve learning of the desired distribution. The generator observes some components of a real data vector, imputes the missing components conditioned on what is observed, and outputs a completed vector. The discriminator then takes a completed vector and attempts to determine which components were observed and which were imputed. The GAIN forms the base for our modification of the imputation method based on Wasserstein GAN [2], which is introduced in the next section. Only recently, GAIN was outperformed by the previously mentioned VAEAC and HI-VAE. However, for numeric variables, HI-VAE achieves a comparable error to the rest of the methods [27]. Therefore we have chosen only VAEAC for the experimental comparison.

3 Wasserstein Generative Adversarial Imputation Network

In this section, the WGAIN model is introduced as GAIN adapting the discriminative approach from Wasserstein GAN.

Let us denote \(\mathcal {X}= \mathbb {R}^d\) the d-dimensional numeric data domain and let \(\boldsymbol{X} = (X_1,\dotsc , X_d)\) be a random vector with values in \(\mathcal {X}\) whose distribution is denoted by \(\mathrm {P}(\boldsymbol{X})\). Let the mask be a random binary vector \(\boldsymbol{M}\), i.e., random vector with values in \(\{0,1\}^d\). The mask corresponds to unobserved values of \(\boldsymbol{X}\) so that the value 0 of its jth component means that the jth feature of \(X_j\) is missing and the value 1 means that the jth feature of \(X_j\) is not missing. The distribution of \(\boldsymbol{M}\) corresponds to the distribution of missingness in the data. Let us further denote by \(\tilde{\boldsymbol{X}}\) the vector \(\boldsymbol{X}\) having zeros in place of missing values given by

$$ \tilde{\boldsymbol{X}} = \boldsymbol{X} \odot \boldsymbol{M}, $$

where \(\odot \) denotes element-wise multiplication. Our aim is to impute missing values in \(\tilde{\boldsymbol{X}}\) based on information from non-missing features of \(\tilde{\boldsymbol{X}}\) and \(\boldsymbol{M}\). It is done in a generative way and it means that we want to learn the conditional distribution \(\mathrm {P}(\boldsymbol{X} | \tilde{\boldsymbol{X}} = \tilde{\boldsymbol{x}}, \boldsymbol{M} = \boldsymbol{m})\) of \(\boldsymbol{X}\) given \(\tilde{\boldsymbol{X}} = \tilde{\boldsymbol{x}}\) and \(\boldsymbol{M} = \boldsymbol{m}\). To do this let \(\boldsymbol{Z}\) be a random vector with identically distributed independent components having normal distribution \(\text {N}(0,\sigma ^2)\) with variance \(\sigma ^2\) and define

$$ \tilde{\boldsymbol{X}}_{\boldsymbol{Z}} = \boldsymbol{Z} \odot (1 - \boldsymbol{M}) + \boldsymbol{X} \odot \boldsymbol{M}, $$

i.e. \(\tilde{\boldsymbol{X}}_{\boldsymbol{Z}}\) is \(\tilde{\boldsymbol{X}}\) with missing components replaced by normal random variables.

Fig. 1.
figure 1

WGAIN structure and mini-batch data flow.

The WGAIN model consists of two parts, the generator g and the critic f, both represented by deep neural networks. The generator g is constructed as a mapping \(g: \mathcal {X}\times \{0,1\}^d \rightarrow \mathcal {X}\) so that

$$ \hat{\boldsymbol{X}}_{\boldsymbol{Z}} = g(\tilde{\boldsymbol{x}}_{\boldsymbol{Z}}, \boldsymbol{m}) \odot (1 - \boldsymbol{m}) + \tilde{\boldsymbol{x}} \odot \boldsymbol{m} $$

is a random vector whose conditional distribution \(\mathrm {P}(\hat{\boldsymbol{X}}_{\boldsymbol{Z}}| \tilde{\boldsymbol{X}} = \tilde{\boldsymbol{x}}, \boldsymbol{M} = \boldsymbol{m})\), determined by the distribution \(\mathrm {P}(\boldsymbol{Z})\) of \(\boldsymbol{Z},\) should be close to the conditional distribution \(\mathrm {P}(\boldsymbol{X} | \tilde{\boldsymbol{X}} = \tilde{\boldsymbol{x}}, \boldsymbol{M} = \boldsymbol{m})\). Note that \(g(\tilde{\boldsymbol{x}}_{\boldsymbol{Z}}, \boldsymbol{m})\) is a random vector corresponding to \(\tilde{\boldsymbol{x}}\) with all missing components imputed.

In order to train it, we employ the standard squared loss function

$$ L_{\text {MSE}}(\hat{\boldsymbol{x}}_{\boldsymbol{z}}, \boldsymbol{x}) = \Vert \hat{\boldsymbol{x}}_{\boldsymbol{z}} - \boldsymbol{x}\Vert ^2, $$

forcing the output \(\hat{\boldsymbol{X}}_{\boldsymbol{Z}}\) to be close to the original data \(\boldsymbol{X}\). However, it turns out that this condition alone is not sufficient for learning the proper conditional distribution. To improve the performance of the generator, one may introduce a discriminator trying to find out which components of \(\hat{\boldsymbol{X}}_{\boldsymbol{Z}}\) were imputed and use the discriminator for adversarial training. This approach was introduced in [35] and is the base of WGAIN.

In this paper we present a similar way how to improve the conditional distribution of the generator’s output. It is based on the Earth-Mover (EM) distance between two probability distributions \(\mathrm {P}(X), \mathrm {P}(Y)\) defined by

$$ W\big (\mathrm {P}(X), \mathrm {P}(Y)\big ) = \inf _{\gamma \in \mathbf {\Pi }(\mathrm {P}(X), \mathrm {P}(Y))} {{\,\mathrm{E}\,}}_{(X, Y) \sim \gamma } \Vert X - Y \Vert , $$

where \(\mathbf {\Pi }(\mathrm {P}(X), \mathrm {P}(Y))\) denotes the set of all joint distributions (XY) whose marginals are respectively \(\mathrm {P}(X)\) and \(\mathrm {P}(Y)\). The term \({{\,\mathrm{E}\,}}_{(X, Y) \sim \gamma } \Vert X - Y \Vert \) might be understood as a measure of how much probability mass has to be transported in order to transform the distributions \(\mathrm {P}(X)\) into the distribution \(\mathrm {P}(Y)\) when the joint distribution is \(\gamma \). The EM distance can thus be seen as the cost of the optimal transport plan, see [2] and references therein for more details. The EM distance is usually expressed using the Kantorovich-Rubinstein duality as

$$\begin{aligned} W\big (\mathrm {P}(X), \mathrm {P}(Y)\big ) = \sup _{\Vert f \Vert _L \le 1} {{\,\mathrm{E}\,}}_{X \sim \mathrm {P}(X)} f(X) - {{\,\mathrm{E}\,}}_{Y \sim \mathrm {P}(Y)} f(Y), \end{aligned}$$

where \(\Vert f \Vert _L\) means that f is Lipschitz continuous with Lipschitz constant 1 which might be changed to any constant K since it just multiplies \(W\big (\mathrm {P}(X), \mathrm {P}(Y)\big )\) by the same constant.

In Wasserstein GAN one approximates (1) by training the neural network \(f_{\boldsymbol{w}}\) parametrized with weights \(\boldsymbol{w}\) in some compact space \(\mathcal {W}\), thus enforcing the Lipschitz continuity. The function \(f_{\boldsymbol{w}}\) is called the critic and is trained to maximize the expectations difference in (1). For a single dimensional generator g trying to transform random variable Z so that it has the distribution \(\mathrm {P}(X)\) one maximizes

$$ \max _{\boldsymbol{w} \in \mathcal {W}} {{\,\mathrm{E}\,}}_{X \sim \mathrm {P}(X)} f_{\boldsymbol{w}}(X) - {{\,\mathrm{E}\,}}_{Z \sim \mathrm {P}(Z)} f_{\boldsymbol{w}}(g(Z)). $$

In our case we want to minimize the EM distance between \(\mathrm {P}(\hat{\boldsymbol{X}}_{\boldsymbol{Z}}| \tilde{\boldsymbol{X}} = \tilde{\boldsymbol{x}}, \boldsymbol{M} = \boldsymbol{m})\) and \(\mathrm {P}(\boldsymbol{X} | \tilde{\boldsymbol{X}} = \tilde{\boldsymbol{x}}, \boldsymbol{M} = \boldsymbol{m})\). Hence, we take the mask \(\boldsymbol{M}\) as the second argument of the critic as additional information to the first argument given by \(\boldsymbol{X}\) with correct features behind the mask \(\boldsymbol{M}\). The critic is therefore a mapping \(f_{\boldsymbol{w}}: \mathcal {X}\times \{0,1\}^d \rightarrow \mathbb {R}\) trained to maximize

$$ \max _{\boldsymbol{w} \in \mathcal {W}} {{\,\mathrm{E}\,}}_{\boldsymbol{X} \sim \mathrm {P}(\boldsymbol{X})} f_{\boldsymbol{w}}(\boldsymbol{X}, \boldsymbol{M}) - {{\,\mathrm{E}\,}}_{\boldsymbol{Z} \sim \mathrm {P}(\boldsymbol{Z})} f_{\boldsymbol{w}}(\hat{\boldsymbol{X}}_{\boldsymbol{Z}}, \boldsymbol{M}), $$

which is usually estimated by sample means from mini-batches. The overall structure of WGAIN is depicted in Fig. 1.

3.1 Training

The critic \(f_{\boldsymbol{w}}\) is used in adversarial training of both the generator g and the critic itself. There the generator and the critic play an iterative two-player minimax game when the critic wants to recognize the imputed values from the real ones and the goal of the generator is to trick the critic so it cannot recognize them. Moreover, the generator’s output is tighten to the correct output by the squared loss function \(L_{\text {MSE}}\).

Putting it all together, we have two objective functions to minimize. The first corresponds to training of the discriminator given by

$$ J(f_{\boldsymbol{w}}) = \lambda _{f_{\boldsymbol{w}}} \Big ({{\,\mathrm{E}\,}}_{\boldsymbol{Z} \sim \mathrm {P}(\boldsymbol{Z})} f_{\boldsymbol{w}}(\hat{\boldsymbol{X}}_{\boldsymbol{Z}}, \boldsymbol{M}) - {{\,\mathrm{E}\,}}_{\boldsymbol{X} \sim \mathrm {P}(\boldsymbol{X})} f_{\boldsymbol{w}}(\boldsymbol{X}, \boldsymbol{M})\Big ), $$

where the weight \(\lambda \) enables one to increase or decrease the influence of the corresponding gradient. Second is the objective for the generator,

$$ J(g) = - \lambda _g {{\,\mathrm{E}\,}}_{\boldsymbol{Z} \sim \mathrm {P}(\boldsymbol{Z})} f_{\boldsymbol{w}}(\hat{\boldsymbol{X}}_{\boldsymbol{Z}}, \boldsymbol{M}) + \lambda _{\text {MSE}} {{\,\mathrm{E}\,}}_{\boldsymbol{X} \sim \mathrm {P}(\boldsymbol{X}), \boldsymbol{Z} \sim \mathrm {P}(\boldsymbol{Z})} L_{\text {MSE}}(\hat{\boldsymbol{X}}_{\boldsymbol{Z}}, \boldsymbol{X}), $$

where the first term \(\lambda _g\) and \(\lambda _{\text {MSE}}\) are weights enabling one to strengthen or weaken the influence of squared loss function. The optimization is done via alternating gradient descent, where the first step is updating the critic \(f_{\boldsymbol{w}}\) and the second step is updating the generator g. Hence, when perfectly trained, the discriminator gives negative values to cases with imputed features and positive values for cases with true features. On the other hand, the generator entering the critic will be pushed to obtain large positive values of the critic as it gives to real values.

The pseudo-code of the WGAIN training is given in Algorithm 1.

figure a

4 Experiments

An experimental validation of WGAIN using ten real and two artificial publicly available datasets is presented below. These datasets contain numeric data only and are devoted to the classification task. Their overview, together with the corresponding best performing classification models, is given in Table 2.

During the experiments, all datasets were divided as follows: \(70\%\) of data was used to train all classification and imputation models and \(30\%\) was used as a test set to evaluate imputation performance. The imputation models were trained to impute in scenarios where randomly selected combinations of multiple features are missing. The amount of missingness varies from \(10\%\) to \(50\%\) of missing features. Finally, evaluation of the accuracy of the classification model combined with all imputation methods is performed on the test dataset.

4.1 Imputation Models and Their Parameters

Let us start with the presented WGAIN model. The generator and the critic architectures were the same for all datasets and are described in Table 1. During the training, the following settings were used:

  • The original data \(\boldsymbol{X}\) are sampled in mini-batches of size \(m = 128\).

  • The missingness is introduced using the mask \(\boldsymbol{M}\) with the following distribution: for each training point, the portion of missingness is sampled from a uniform distribution between 0 and maximum missing rate, which was chosen to be 0.3. Then the binary elements of \(\boldsymbol{M}\) were independently sampled with this portion of missingness, i.e., its item is 0 with a probability which was previously sampled.

  • The components of random vector \(\boldsymbol{Z}\) are i.i.d. with normal distribution having 0 mean and standard deviation 0.01.

  • The weights of the objectives functions \(J(f_{\boldsymbol{w}})\) and J(g) are \(\lambda _{f_{\boldsymbol{w}}} = 10\), \(\lambda _g = 2\), and \(\lambda _{\text {MSE}} = 1\).

  • Maximal norm used in clipping of the critic weights is \(w_{\max } = 1\).

  • We use RMSProp with learning rate \(\alpha = 0.0001\) as optimizers.

  • The number of training epochs is 8000.

Table 1. Architecture details of the WGAIN. Abbreviation: FC = fully connected layer.

The GAIN implementation follows the original paper [35] and is analogous to the described WGAIN with the following differences:

  • The generator architecture differs only in the sizes of layers, which are all equal to the input dimension.

  • The discriminator architecture is analogous to the generator architecture except for the sigmoid activation function on the last layer.

  • The binary elements of \(\boldsymbol{M}\) are independently sampled with the common portion of missingness, which is 0.2.

  • The hint rate used for the hint matrix is 0.9.

  • As an optimizer, we use Adam with learning rate of 0.0001.

  • The number of training epochs is 7000.

In the case of DAE, we follow the structure presented in [15]. For the hyper-parameters search, the hyperband [21] algorithm was used. The typical best setup is the following: ELU as an activation function, three layers in both the encoder and decoder parts, the size of the code is twice the input dimension, and no regularization is used.

DAE, GAIN, and WGAIN models were implemented in the TensorFlow libraryFootnote 1.

The implementation of VAEAC was based on the repositoryFootnote 2 corresponding to the original paper [17]. All hyper-parameters stayed in the default settings.

For the MICE method (mice), we used the IterativeImputer class from the scikit-learn libraryFootnote 3. In the default settings, the implementation uses Bayesian ridge regression as the internal imputation model and multiple imputations are pooled by the mean.

The k-NN imputation (knn) was implemented using the fancyimpute libraryFootnote 4. A missing value is imputed by sampling the mean of the values of its neighbors weighted proportionally to their inverse distances. In the case where multiple features are missing, we impute all missing values at once (per row). For the hyper-parameter k values 11, 13, 15, 17, 19, 21, 23, 25 were tested. The best k was chosen based on the RMSE value.

4.2 Evaluation

The impact of imputation is evaluated using the classification accuracy changes of the best performing classification model chosen from the six commonly used ones: logistic regression (LR), multi-layer perceptron (MLP), k-nearest neighbors (k-NN), naive Bayes (NB), extreme gradient boosted trees (XGBT) (for details see [7]), and random forest (RF). The best hyperparameters for each model were found using randomized search algorithm. The accuracy of the best performing model for each dataset is shown in Table 2. Furthermore, the root mean squared error (RMSE) between the original and the imputed data is also used for evaluation, e.g., [6, 17, 35].

Table 2. Details of datasets with the corresponding best performing classification model and its accuracy on the test set. The number of features (# f.) does not include the target label. The # r. stands for the number of records.

After all classification models were trained, and the most accurate one for each dataset was chosen, they were combined with imputation methods. Then, the accuracies of classification models on the imputed test dataset were measured.

Since it is not sound to compare accuracies for different datasets, we use a rank comparison. To do so, the algorithms are ranked for each dataset separately, the best performing algorithm getting the rank of 1, the second-best rank 2, etc. An example of accuracies and corresponding ranks for 10% of missingness is presented in Tables 4 and 5. Even in cases when WGAIN is not the best, its performance is always comparable to the best performers. The only exception is the EEG dataset, where k-NN imputation performs the best and the WGAIN is in second place with a difference of almost two percent.

The algorithms can be compared, taking the mean over the datasets. The results can be seen in Table 9. When the degree of missingness varies from \(10\%\) to \(30\%\) the WGAIN performs the best. When the degree of missingness is upwards of \(30\%\) the GAIN outperforms the WGAIN.

Table 3. Mean ranks of the RMSE for different degrees of missingness.
Table 4. Mean of the accuracies for 10% of missing features.
Table 5. Ranks of accuracies of the imputation methods for 10% of missing features.
Table 6. Mean of the RMSE for 10% of missing features.
Table 7. Ranks of RMSE of the imputation methods for 10% of missings.

The results of the ranking evaluation can be statistically evaluated using the Friedman test [13, 14] and the corresponding posthoc tests. For more details, see [9]. P-values of Friedman \(\chi ^2_F\) and \(F_F\) tests are shown in Table 8. One can see that from \(20\%\) to \(40\%\) of missing data the null-hypothesis, that all methods perform the same, can be rejected at a \(10\%\) significance level. However, when the Bonferroni-Dunn post-hoc test is applied the performance of WGAIN is significantly better than DAE only and just for \(20\%\) and \(30\%\) of missing data.

Table 8. P-values of Friedman \(\chi ^2_F\) and \(F_F\) test.
Table 9. Mean ranks of the accuracy changes for different degrees of missingness.

The same ranking process is repeated for RMSE with results in Table 3. An example of RMSE and corresponding ranks for 10% of missingness is presented in Tables 6 and 7. Interestingly, the WGAIN performance is one of the worst, whereas the GAIN performs the best. This is in contrary to the fact that the WGAIN imputes the best from the accuracy point of view. Hence, we can see that low RMSE, which is usually taken as a measure of imputation quality may not lead to the desired performance on the target task. On the other hand, the RMSE differences are relatively small as can be seen in Table 6.

5 Conclusion

We propose a Wasserstein Generative Adversarial Imputation Network as a new deep learning imputation model. It is inspired by the GAIN. However, the discriminator is replaced by the Wasserstein critic. It is known that the Wasserstein approach does not suffer from vanishing gradients in the way that a vanilla GAN does. This enables the model to capture the desired distribution better. One may assume such benefits in WGAIN as well. We experimentally showed that in the imputation performance measured by classification accuracy, the WGAIN outperforms the other methods when the degree of missingness is lower than or equal to \(30\%\). In other cases, it is competitive. In future work, we would like to focus on the use of WGAIN in image inpainting tasks.