Background

Missing data is a pervasive problem in big data research, clinical trials and epidemiological studies [1]. There are a number of reasons that could account for missing data, such as non-response to questionnaires, study participants lost to follow up, omission of data entry, failure of equipment, or incomplete or lost records [2, 3]. Mere exclusion of cases with missing data from analysis may lead to biased inference, reduced statistical power and generalisability of results [4, 5]. According to missingness assumptions, the problem of missing data can be classified into three categories, including missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR) [6,7,8]. In general, the majority of the missing data in medical research are assumed to be MAR [9]. In contrast to MCAR, where there are no systematic differences between the missing and observed values, with MAR data, there will be differences between missing and observed values but these differences can be explained by other observed data [10,11,12].

Multiple imputation by chained equations (MICE) is the most commonly used statistical procedure for handling missing data [5], particularly for data that are MAR [13]. MICE is widely available in many statistical software including SPSS, STATA and R. Although it is important to note that MICE may lead to biased results because, by default it uses predictive mean matching (pmm) and logistic regression (LR), which are limited in the ability to handle non-linear relationships and interactions between variables [14]. A mean to overcome non-linearity is through random forest, an ensemble machine learning algorithm of multi-classification or decision tree regression [15]. Stekhoven et al. have developed a method ‘missForest’ (based on random forest) to impute missing values in mixed-type datasets [16]. Subsequent studies have shown that missForest outperformed MICE in both simulated and real world datasets [17]. However, a drawback of missForest is that its long computation time, limiting its practicality in big data research.

Generative adversarial network (GAN), an unsupervised algorithm, is a popular machine learning method that has been widely applied in both data generation [18] and image processing [19]. Generative adversarial imputation nets (GAIN), which is based on GAN, was recently developed and found to outperform other methods in terms of imputation accuracy in substituting MCAR data in five open-source datasets [20]. However, the accuracy of GAIN for imputing MAR data and of mixed-type variables, both of which are common in medical research remains unclear.

The main aim of this study was to evaluate the accuracy of GAIN in imputing missing values in real-world clinical datasets with mixed-type variables. Further, this study also aimed to examine the computation efficiency of GAIN as well as compare its performance with those of MICE and missForest. It is anticipated that the results will inform other researchers on the choice of missing data imputation methods in big clinical data research.

Methods

Study setting and datasets

Two large real world clinical datasets from two longitudinal cohort studies on primary care patients with chronic diseases were used. The first dataset was that of a study on the prediction of complications and mortality among a cohort of 141,516 patients with diabetes [21]. A total of 14 (out of 21) independent baseline variables had missing data, of which 12 variables had a missingness rate of less than 20%. Overall, the proportion of missing data ranged from 0.50% (systolic blood pressure) to 48.99% (urine albumin to creatinine ratio [Urine ACR]). Urine ACR showed the highest proportion of missing data as it was not routinely collected in Hong Kong primary care prior to 2010. We selected 50,000 subjects without any missing values for these 21 variables (15 continuous predictors and six categorical predictors) at baseline and seven dependent outcome variables measuring various complications of diabetes and mortality.

The second dataset was that of a cohort study evaluating the effectiveness of a risk assessment and management programme for patients with hypertension [22]. We identified 10 independent variables, including five continuous variables and five categorical variables, for inclusion in the analyses. In the original dataset, the data missingness rate for these 10 variables ranged from 1.5 to 26%. A total of 10,000 subjects without any missing values for these 10 variables were randomly selected. The data were extracted together with the data for the two outcome variables (cardiovascular diseases [CVD] and mortality) in order to replicate the imputation analyses and strengthen the generalizability of the results from the first dataset.

For easy reference, the first dataset is referred as the ‘DM-data’ and the second is referred as the ‘HT-data’. The description of the characteristics for these two datasets can be found in Supplementary Tables 1 & 2.

Institutional Review Board of the University of Hong Kong—the Hospital Authority Hong Kong West Cluster (reference number: UW 15–258) approved this study and usage of data. Individualized informed consent is not required. All methods on the datasets were carried out in accordance with relevant guidelines and regulations.

Missing data simulation

For both DM-data and HT-data, data ‘missing at random’ (MAR) was simulated at different missingness rates (20 and 50%) to create the datasets for the imputation testing [17, 23]. The missingness was introduced to independent variables following Bernoulli distributions based on linear combination of dependent variables (fully-observed). At each missingness rate, ten different incomplete datasets were generated using different randomised linear combination parameters. We did not simulate missing values in the dependent variables, although they were incorporated in the imputation process as auxiliary variables [24].

Imputation procedures with GAIN

A number of improvements were applied to the basic GAIN construction built by Yoon et al. [20] to optimize the model. First, the random noise was substituted by the mean value of each variable so as to reach the optimal solution faster. Batch normalization with gradient descent optimizer was also used to allow a larger learning rate. Combination of the loss of continuous and categorical variables with separate weights (α and β) was used to deal with a dataset with mixed types of variables. A greedy search strategy was adopted to seek the best combination of hyper-parameters. This strategy was adopted due to the large number of hyper-parameters to be tuned in the GAIN training process, including k, phint, α, β, number of iterations, number of hidden layers, number of neurons in each layer, activation functions, learning rate and optimizer. The code is available at Github (https://github.com/dongdongdongdwn/GAIN-Dovey) and the optimal hyper-parameters are presented in the Supplementary Table 3. The brief imputation procedures with GAIN are presented in Algorithm 1.

figure a

MICE and missForest

Imputation by MICE and missForest were carried out by standard procedures [16, 24] with R package mice v3.6.0 and missForest. The imputation model of MICE was specified as predictive mean matching (pmm) and logistic regression (LR) as default, respectively, for continuous variables and categorical variables. The iteration number was set to 10. For missForest, the number of trees was set to 20, and the number of variables randomly sampled at each split was set to \( {d}^{\frac{1}{2}} \) (sqrt dimensionality). The max-iterations number of missForest was set to 10. The iteration numbers of MICE and missForest were determined based on preliminary experiments to ensure they could achieve the best performance (as shown in Supplementary Fig. 1).

Outcome measures and data analysis

Accuracy was measured by imputation error, defined as the difference between the imputed values and real values. It was assessed by normalized root mean square error (NRMSE) for continuous variables and proportion of falsely classified (PFC) subjects for categorical variables. NRMSE and PFC were defined as follows:

\( \mathrm{NRMSE}=\frac{\sqrt{\frac{1}{N}{\sum}_{i=1}^N{\left(\hat{x_i}-{x}_i\right)}^2}}{\frac{\sum_{i=1}^N{x}_i}{N.}} \)

\( \mathrm{PFC}=1-\frac{N_{correct}}{N} \)

where \( \hat{x_i} \) is the imputed value and xi is the original value in continuous variables, Ncorrect is the total number of correctly classified values in categorical variables.

For each simulated incomplete dataset, the imputation was repeated 100 times using each method. The mean NRMSE for each continuous variable was calculated by averaging the NRMSE obtained from the 100 imputations. The mean PFC was calculated by averaging the PFC obtained in each imputation for categorical variables. NRMSE and PFC were treated as continuous variables in the comparative analysis, and their distributions were tested by Shapiro-Wilk normality test. Correspondingly, the differences in mean NRMSE or PFC among methods were tested by one-way ANOVA or non-parametric test.

Density plots and bar plots were used to illustrate the imputation differences among methods, for representative continuous variables and categorical variables respectively. For the DM-data, systolic blood pressure (SBP), fasting glucose, hypertension history and smoking status were selected to represent normal distributed continuous variables, skewed continuous variables, balanced categorical variables and imbalanced categorical variables, respectively. Likewise, age, total cholesterol to high-density lipoprotein (TC/HDL) ratio, sex and lipid lowering drugs usage were selected as the representative variables for the HT-data.

For DM-data with 5000 to 50,000 subjects, the computation time of each method to complete an imputation process on a personal computer (PC) and high performance computing (HPC) device was recorded and plotted for comparison. The relevant machine configuration of the PC and HPC can be found in Supplementary Table 3.

Missing data simulation, MICE, missForest and comparison were operated in R 3.5.1. GAIN was developed with Python 3.5. The level of significance for all statistical tests was set as 0.05.

Results

Experiments on DM-data

Table 1 presents the imputation errors (NRMSE and PFC for continuous and categorical variables, respectively) of different imputation methods at missingness rates of 20 and 50%. Overall, GAIN and missForest were superior to MICE for both continuous and categorical variables, irrespective of the missingness rates (p < 0.001). When the missingness rate was 20%, GAIN was superior to missForest with lower imputation errors (p < 0.05) for highly skewed (skewness> 4) continuous variables (e.g., creatinine, fasting glucose, urine ACR) and highly imbalanced categorical variables (proportion of minority class was close to or lower than 10%, e.g., lipid lowering drug usage, DM treatment). MissForest showed better accuracy for some normally distributed continuous variables (e.g., age, SBP, DBP) and some relatively balanced categorical variables (e.g., sex, hypertension history) (p < 0.05). GAIN and missForest showed similar accuracy for the remaining variables (p > 0.05). However, GAIN was superior to missForest for the majority of variables when the missingness rate increased to 50% (p < 0.05). No statistically significant differences were observed between GAIN and missForest for the less skewed continuous variables (e.g., age, SBP, LDL-C) and relatively balanced categorical variables (e.g., sex, hypertension history).

Table 1 Imputation errors of different methods in DM-data

Experiments on HT-data

The imputation errors in the HT-data of different methods are presented in Table 2. The findings were similar to those found in the DM-data. Overall, GAIN and missForest outperformed MICE for both missingness rates (20 and 50%) irrespective of the type of variables. When the missingness rate was 20%, GAIN was superior to missForest for more skewed continuous variables (e.g., SBP, TC/HDL-C ratio, hospital admission times) and more imbalanced categorical variables (e.g., smoking, hypertensive drugs, lipid lowering drugs). If the missingness rate increased to 50%, GAIN was more accurate than missForest for the majority of the variables (p < 0.05).

Table 2 Imputation errors of different methods in HT-data

To illustrate the differences of the imputation errors among methods, density plots and bar plots were used to visualize the representative variables at 50% missingness rate. Density plots, showing the distribution of the absolute difference between imputed values and real values of continuous variables, are presented in Fig. 1. The absolute differences between real values and values generated by GAIN were more close to 0 and concentrated, indicating good accuracy. MICE tended to have a broader distribution of errors and a higher density of greater errors. The differences in the patterns among different methods were more noticeable on data that were skewed (e.g. fasting glucose, TC/HDL ratio).

Fig. 1
figure 1

Density plots displaying the distribution of the absolute difference between imputed values and true values on continuous variables by different methods (missingness rate = 50%). (Note: a and b are representative continuous variables in DM-data, c and d are representative continuous variables in HT-data)

The bar plots illustrate the distribution of imputed values and the correct proportion in each category (Fig. 2). The imputed values of MICE and GAIN showed the same distribution as the original data, while missForest generated a higher proportion of the majority group but a lower proportion of the minority group. Meanwhile, for both balanced (i.e. sex, hypertension history) and imbalanced categorical variables (i.e. smoking, lipid lowering drugs usage), GAIN imputation resulted in a more accurate allocation to the minority group when compared to the other two methods.

Fig. 2
figure 2

Bar plots displaying the distribution of imputed allocation of categorical variables by different methods (missingness rate = 50%). (Note: a and b are representative continuous variables in DM-data, c and d are representative continuous variables in HT-data; Shaded areas indicate the proportion that correctly imputed in each category by each method)

Computation time

The computation time of one imputation process on the DM-data by each method using PC and HPC for different sample sizes are presented in Fig. 3. MICE was the fastest for small sample sizes (up to 30,000 subjects) and GAIN was the fastest for the larger sample (50,000 subjects). MissForest showed much longer computation times for all sample sizes compared to the other two methods. The computation time of missForest increased exponentially with increasing sample size.

Fig. 3
figure 3

Computation time of one imputation process by each method on DM-data. a Computation time on PC; b Computation time on HPC

Discussion

Missing data are inevitable in medical research and it is important that appropriate methods are used to solve this problem in order to make full use of the data and get unbiased inference. This study has introduced a novel imputation method, GAIN, and demonstrated its imputation accuracy and efficiency outperformed two commonly used methods (MICE and missForest). The major strength of this study was the use of two large real-world clinical datasets with mixed-type variables. To the best of our knowledge, this was the first study to evaluate the application of GAIN for the imputation of missing clinical data with mixed type variables.

Overall, GAIN showed similar imputation accuracy as missForest when the missingness rate was relatively low (20%) but performed better than missForest when the missingness rate was higher (50%). GAIN also had better accuracy for imputing skewed continuous variables and imbalanced categorical variables. Furthermore, the imputation time of GAIN increased only slightly with increasing sample size, making it the most efficient method for performing big data analytics on a sample size of more than 30,000.

Imputation performance and data characteristic

These findings matched those observed in an earlier study where GAIN outperformed other imputation methods on a cancer dataset in which all variables were continuous [20]. It is important to recognise that the imputations of mixed-type variables are challenging but essential for clinical research [10]. Our results provide preliminary evidence that GAIN is a suitable method for the imputation of missing clinical data with mixed type of variables, particularly those with highly skewed and imbalanced data.

The results of this study also showed that, despite MICE being commonly used, there is still room for its improvement [14, 15]. As can be seen from the density plots, the default setting of MICE (pmm) replicated some observed extreme values to seek for the same distribution as the observed data, however, these extreme values might be far from the real values and lead to inaccuracy. On the other hand, missForest and GAIN, through machine learning, are more “moderate” and produced credible values, which are closer to the mean level of the observed data, yielding more accurate imputation results.

Imputation performance and missingness rate

It is recognized that data with a higher missing proportion are likely to increase further inference bias. There is no consensus on the maximum missing data rate that would allow for substitution by imputation since it is determined by various factors, including the missingness assumption, participation of auxiliary variables, data quality and also imputation methods [25]. In medical research and clinical trials, the rule of thumb for an acceptable missingness rate is 20% or less [26, 27], but much higher rates are commonly observed in real practice. For example, as shown in the two large real-world clinical datasets in this study, the data missingness rates of some variables were nearly 50%. In order to explore how the imputation accuracy would be affected by the data missingness rate, we evaluated the three methods on simulated data with missingness rates of 20 and 50%. It was found that GAIN was more resistant to the effects of a higher missingness rate. This is because the imputation power of GAIN depend not only on observed values but also on the feedback from the discriminator. GAIN therefore has the potential to accept a higher threshold of data missingness rate and maximize the use of research data.

Computation time

In addition to the measures on accuracy, this study also recorded the computation time as a performance indicator. Computation time cannot be neglected, especially with the large datasets in many cohort studies. GAIN stands out in its efficiency by virtue of its unique mechanism in which the number of parameters is relatively independent of the sample size.

Multiple imputation (MI) is recommended to avoid the uncertainty of single imputation. However, it will increase the computation time. In general, if MI is adopted, the imputation times (m) is at least 5 with some researchers using 10 or more [8]. MissForest will take approximately 8 days (200 h) of PC computation time to impute the missing data with a sample size of 50,000 with multiple imputation of 10 times. The computation time will also lengthen exponentially as the sample size increases. The utilization of HPC and parallel processing may save some time but may not be feasible in many settings.

Further implication for practice

There is no one best procedure to solve the problem of missing data in medical research. Indeed, the selected method will depend on the missingness assumption as well as auxiliary variables that could explain why the data is missing [28]. For example, complete case analysis might be preferable over MI in some situations [20]. Based on our findings, we would suggest taking into consideration missingness rate, variable distribution, and the expected computation time when choosing the appropriate imputation method. In addition, the use of more than one imputation method and sensitivity analysis could improve the reliability of the results.

Limitation

This study had a number of limitations. First and foremost, this study had only focused on the imputation accuracy but not post-imputation statistical inference effectiveness of different imputation methods. The goal of missing data imputation is to obtain statistically valid inferences from incomplete data rather than to re-create the true data. Van Buuren has pointed out that imputation is not prediction, and the method that best recovers the true data might be nonsensical or contain severe flaws [8]. Further studies should be conducted to evaluate these imputation methods with respect to post-imputation statistical inferences. Second, a missingness rate of more than 50% was not simulated in this study as some researchers have suggested that a missingness rate of more than 50% is not acceptable for clinical studies [25]. Third, the variables included in this study were cross-sectional data, hence the results may not be generalizable to missing data problem in longitudinal studies with repeated observations.

Conclusion

Overall, when compared to MICE and missForest, GAIN showed better accuracy in the imputation of missing data in large real world clinical datasets, particularly for imbalanced and skewed data, and when the missingness rate was high (50%). GAIN also has outstanding computation speed in handling large samples (greater than 30,000 subjects) and holds potential as an accurate and efficient method for missing data imputation in future big data clinical research.