Background

Logistic regression analysis is a widely used method in epidemiological and medical studies for developing prognostic models. Although researchers try to avoid it, missing data occur in many study designs. The most recommended method for processing incomplete data is Multiple Imputation (MI) [1]; this method is nowadays integrated in almost all regular statistical software packages and is therefore within reach of many researchers. MI generates multiple imputed datasets, after which a complete data analysis can be applied in each imputed dataset. Finally, parameter estimates can be combined using Rubin’s Rules (RR) [2]. If a dataset contains missing values, it is recommended to apply MI before excluding variables in logistic regression with backward selection (BWS) from the pooled model [3,4,5,6]. RR can be used to calculate pooled coefficients and standard errors (SE’s). This is easy to apply for continuous and dichotomous variables where a single Wald statistic is used to calculate a P-value to determine significance [7]. This is complex for categorical variables with more than two levels. For variable selection with categorical variables several methods have been developed [1, 7]: 1. Method D1 (Multivariate Wald test): this test pools within and between covariance matrices across imputed datasets. After that, the total parameter covariance matrix of the multivariate Wald test is corrected to account for the missing data [1, 8]. 2. Method D2 (Pooling Chi-square statistics): this test uses the chi-square values from the multiple parameter Wald or likelihood ratio tests across imputed datasets and pools them [9]. 3. Method D3 (Combining Likelihood Ratio statistics): this test is based on pooled likelihood ratio statistics and requires fitting multiple models across imputed datasets for each categorical variable in the data and can therefore be a very time-consuming process [10, 11]. A disadvantage is that none of these methods is integrated in regular statistical software packages for variable selection when developing prognostic models. Also, these methods do not always obtain the optimal P-values, so selection of the correct variables is not guaranteed [10]. Therefore, researchers sometimes opt for a single imputation, often leading to an incorrect selection of variables [12].

Van de Wiel et al. introduced the median of the P-values (MPR) as method to compare prognostic models [13]. Eekhout et al. compared MPR with the pooling methods D1, D2, D3 in MI-datasets in a simulation study. They found that the MPR was an attractive rule for statistical inference of categorical variables with more than two levels because it showed to have equal power as the D1-, D2- and D3-method but was much easier to apply in any software package [1].

The MPR method may therefore be a potential attractive method for variable selection including models with categorical variables. Until now, these methods have never been compared to derive a prognostic model in logistic regression models after MI.

Therefore, the aim of this study is to evaluate four different pooling methods for variable selection in Multiply Imputed datasets for categorical, dichotomous, and continuous variables in a logistic regression with a backward selection (BWS) procedure. The selection frequency of the variables, the P-values of the pooled selection results and the stability of the models in Multiply Imputed datasets will be compared with the results from the BWS-procedure in the complete dataset (without missing data). All analyzes will be repeated in a real-world dataset.

Method

To evaluate the results of the four different pooling methods in Multiply Imputed datasets after a BWS-procedure, we conducted a simulation study and repeated the procedures in a real-world dataset (NHANES).

Simulation datasets

  • 1) To generate simulated datasets, we used as input parameters the mean values and the standard deviations (SD) of the variables of an empirical dataset of low back pain patients [14]. See Table 1. A total of 9 variables were drawn from a multivariate normal distribution using the mean and SD as input parameters.

  • 2) A set of 9 variables, including categorical, dichotomous, and continuous variables (normally distributed) was generated.

  • 3) In the simulated datasets, categorical and dichotomous variables were initially considered to be continuous to determine their level of correlation and were then subsequently categorized (in 4 categories) and dichotomized by using cutoff values from the empirical dataset. For the categorical variables ‘Cat1’ and ‘Cat2’ the cutoff values (in percentages) were: [0, 0.6, 0.8, 0.9, 1] and [0, 0.3, 0.6, 0.8, 1]. For both dichotomous variables the median was used as a cutoff value.

  • 4) The outcome measure was obtained by first calculating the linear predictor score by multiplying coefficients from the empirical dataset (Table 1) by the predictor values. A logistic regression model was then used to convert these scores into probabilities and a uniform distribution to convert these into a binary outcome, according to the rule that when the probability is lower than the value from the uniform distribution the outcome is 1 and 0 otherwise.

  • 5) Four different datasets were simulated. Two sets with 200 observations, one with a correlation degree of 0.2 and the other with a degree of 0.6 and similarly two sets with 500 observations.

  • 6) One of the regression coefficients in the model had an effect size of zero to mimic the behavior of a noise variable during variable selection (‘Noise’).

Table 1 Means and variances used for the simulated dataset

Generating Missing data

Multiple Imputation is indicated under the missing at random (MAR) mechanism, which means that missing data can be covered by observed data [11]. In each of the four simulated datasets, 20 percent missing data were created in both categorical variables and two continuous variables (‘Noise’ and ‘Cont4’). To create these missing data, the missing at random (according the MAR) mechanism has been used. This means that the probability for missing data in the two categorical and continuous variables was related to other variables in the dataset. For this, data were made missing in each variable as a function of another variable (e.g., X1 =  X2 * 0.4 and X1 =   X2 * 0.167), to create a realistic data situation. We used the commands ‘defMiss’ (for defining the missing data matrix with the formula option) and ‘genMiss’ in R software package “simstudy”. These formula values were chosen in such a way that 20% of the data in each variable was missing and around 50% of the cases in each simulated sample.

Imputation method

MI was performed in each simulated dataset generating 5 imputed datasets using Multivariate Imputation by Chained Equations (MICE) including the outcome in the imputation model [15, 16].

Pooling methods

Four different pooling methods were used:

  • 1) The pooled sampling variance method (D1), which contains a combination of the pooled parameter estimates and the pooled sampling variances of each imputed dataset to construct a test that resembles a multivariate Wald test [8, 17].

  • 2) Multiple parameter Wald test (D2) which pools the chi-square values from the multiple parameter Wald or likelihood ratio tests [9].

  • 3) Meng and Rubin pooling method (D3) which pools likelihood ratio tests [10].

  • 4) The median-P-rule (MPR) which uses the median P-value of the significance tests conducted in each imputed dataset. Hence, it depends on P-values only and not on the parameter estimates [13].

Statistics and analyses

Logistic regression analyses were performed in all original complete simulated datasets and in all multiply imputed datasets. The coefficients, SE’s, P-values, and all developed prognostic models were compared. The BWS-procedures were performed using different P-out selection values (between 1.0 and 0.05) to develop parsimonious prediction models containing the strongest prognostic variables as well as larger prediction models containing strong and less strong prognostic variables.

The entire procedure was repeated 500 times and the results were compared with the results of the BWS-procedure in the complete dataset (no missing data). All statistic procedures were performed in R.

Comparing pooling methods

When comparing the pooling methods for variable selection, the focus was on three points:

The selection frequency of the variables

The selection frequency of the variables of each method was compared with that of the complete datasets, i.e., without missing values, which served as a reference model. The frequency was obtained by summing up how many times a variable was selected in a model divided by the total number of simulated models within that run. Subsequently it was evaluated which pooling method showed the most similar selection frequencies compared to those in the complete dataset.

The P-values of the selected variables

To compare the P-values, all P-values were first naturally log-transformed to be able to make even the smallest differences in P-values graphically visible. The median of these P-values of the four pooling methods was compared with the median of the P-values in the complete dataset.

The stability of the selected prognostic models

Model stability was evaluated by providing model selection frequencies to quantify how many times a particular set of prognostic variables was selected [4, 18]. The first 10 unique prognostic models were evaluated for all pooling methods and for the models selected in the complete dataset. For example: In one simulation run 500 initial models were fitted by applying BWS and from these 500 models, 10 unique models in the complete dataset were selected. With method D1, 375 models were identical to those 10 unique models in the complete dataset and so counted for 375/500 = 75 percent of the same unique models. With method D2 we saw 350 unique models (70 percent). D3 showed 340 (68 percent) and MPR 450 (90 percent) the same unique models. In this example, MPR was the most stable pooling method. This way of analyzing the stability of the prognostic models was executed for all BWS-procedures under all different conditions.

Analyses in the NHANES-dataset

The NHANES-dataset was used as a real-world data set to evaluate the performance of the same four pooling methods (D1, D2, D3 and MPR). Twelve variables were analyzed of which 6 continuous, 2 dichotomous and 4 categorical variables, with a dichotomous variable as outcome measure. MI was performed, generating 5 imputed datasets (m = 5), using Multivariate Imputation by Chained Equations (MICE) and including the outcome in the imputation model [15, 16]. A BWS-procedure was conducted with P-out < 0.05 in all pooling methods. The selected variables and developed models were compared.

Software

To generate complete simulation data we used the R package “simstudy”, that was also used to generate the missing data (functions ‘defMiss’ and ‘genMiss’). Backward selection in the complete data was conducted with the package “rms” (function fastbw), imputations were done using the “mice” package and pooling with the “psfmi” package.

Results

The selection frequency of the variables

What can be determined from Fig. 1 is that in general the MPR method performed better than the other selection methods when the P-out values became stricter. With a p-out ≤ 0.3 in 41.6 percent of the simulated samples, the MPR selected the same variables as in the complete dataset. The D1, D2 and D3 methods in only respectively 7.1, 3.5 and 7.1 percent of the cases. Using a stricter p-out (0.1) the MPR method selected the same variables as in the complete dataset in 46.8 percent of the cases and with a P-out of 0.05 in 53.7 percent of the cases. Using less strict P-out values (1.0 and 0.5) in 60.5 percent and 51 percent of the cases all pooling methods performed similarly.

Fig. 1
figure 1

The selection frequency of the variables using different P-out-values compared to the complete dataset

Table 2 shows for all pooling methods the selection frequencies (percentages) of variables after BWS in Multiply Imputed datasets compared to the complete dataset. Categorical variables that were selected with the MPR-method were closest to those in the complete dataset. For the continuous and dichotomous variables none of the four pooling methods showed a better selection frequency compared to the complete dataset. Therefore, the MPR was the best pooling method to select prognostic variables. The noise variable had a very low selection frequency for all methods and was selected in less than 12 percent of the cases using a P-out ≤ 0.1 and 6 percent in models with a P-out ≤ 0.05.

Table 2 Percentages selection frequency of variables after backward selection in Multiply Imputed datasets using four different pooling methods and in the complete dataset

The P-values of the selected variables

To compare the P-values of the selected variables with those in the complete datasets, the median of the log-transformed P-values was used. Table 3 shows that for the continuous and dichotomous variables, the pooling methods showed inconsistent results, sometimes the P-values were more close to those obtained in the complete dataset, e.g. the variable ‘Dich1’ in the dataset N = 200 with correlation degree 0.2 and a p-out ≤ 0.1 and sometimes not, e.g. the variable ‘Dich1’ in the dataset N = 200 with correlation degree 0.6 and a p-out ≤ 0.05. Overall, the P-values for categorical variables obtained with the MPR-method were consistently closer to those obtained in the complete dataset regardless of the sample size, the degree of correlation or the chosen P-out.

Table 3 P-values of the pooled variables after log-transformation and calculation of the median

Figure 2 shows the percentages of agreement between the P-values of the selected variables by the different pooling methods and the complete dataset. It is shown that the MPR-method agreed most with the complete dataset for categorical variables with scores of 100 percent agreement. Evaluating the different levels of p-out, it was clear that for all p-outs the differences between the pooling methods in categorical variables were in favor of the MPR method. For the dichotomous variable the MPR never performed worse than the other pooling methods. For the continuous variables also the D1-method performed reasonably well and performed better than the MPR by stricter p-out values.

Fig. 2
figure 2

Percentages agreement between the P-values of the selected variables by the different pooling methods and the complete dataset.

The stability of the selected prognostic models

Table 4 shows how the selected models, after applying the different pooling methods, were related to the selected models in the complete dataset. The MPR-method provided models that were more similar to the models in the complete dataset than the other pooling methods. Especially in smaller datasets (n = 200), the MPR-method selected more frequently the same prognostic models. The analyses with a p-out ≤ 1.0 are not shown because these are not of added value for the development of prediction models. See Table 4 and Fig. 3.

Table 4 Comparing selected prognostic models to the developed models in the complete dataset
Fig. 3
figure 3

Model selection frequencies of the first ten unique prognostic models from the four pooling methods, quantifying how likely these models were selected compared to the models from the complete dataset

Selected variables in the prognostic models in the NHANES-dataset

Table 5 shows that in all imputed NHANES-datasets, the four pooling methods showed a strong similarity in variables selected. Mostly the same variables were selected with comparable p-values. This was in agreement with the findings of the simulation study.

Table 5 Selected variables selected in the NHANES-dataset by the four pooling methods

Discussion

Main findings

In defining a prognostic model after MI with different types of variables including categorical variables, it is important to use an overall test to conclude if a categorical variable is relevant for the model. In the present study Multiply Imputed simulated datasets were used and four selection methods (D1, D2, D3 and MPR) for categorical, dichotomous, and continuous variables in a logistic regression model with a BWS-procedure were evaluated. The selection frequency of variables, P-values of the selected variables and the stability of the selected models were compared with the results from those in the complete dataset. The performance of the MPR is tested in many different conditions and variations and approved to be an easy-to-apply method and to be consistently better (as well in terms of selection frequency as in terms of P-values and the stability of the models) than the other pooling methods in categorical variables in a MI-context. For continuous and dichotomous variables no consistent differences were found between the four pooling methods.

Comparison with the literature

Eekhout et al. concluded that to obtain correct and powerful pooled P-values for significance testing of categorical variables with the MPR, compared to the D1, D2 and D3 methods, the outcome should be omitted from the imputation model [1]. To obtain a powerful significance test for continuous and dichotomous variables with RR after MI, the MI procedure has to include the outcome variable, as was indicated by Moons et al. [19]. We repeated our simulation study in datasets with a sample size of n = 500 and n = 2000 and an extra categorical variable with five categories under two different conditions: one included the outcome variable in the imputation model, and one excluded the outcome. We observed no differences in the median-P-values of the selected predictor variables or the stability of the selected models. Only the selection frequency of the predictor variables was slightly higher when the outcome was included in the imputation model, but this was identical for all pooling methods. The larger the datasets, the smaller the differences between the four pooling methods. We therefore conclude that for overall significance testing of categorical variables the outcome variable can be included in the imputation model.

Heinze et al. and Wallisch et al. stated that variable selection can compromise stability of a final model. This is an often-ignored problem of data-driven variable selection [4, 18]. Also, Royston and Sauerbrei stated that model stability has to be proven because many different factors influence the stability of the selected models [20, 21]. In our simulation study, we examined the stability of the selected models in Multiply Imputed datasets by repeating each procedure 500 times. An interesting result is that the MPR pooling method resulted in more stable variable selection than the other pooling methods. This result was also reflected in the analyzes in the NHANES-real-world dataset. Austin et al. and Wood et al. stated that variable selection in Multiply Imputed datasets has to be done from the pooled model using RR which is easily done for continuous and dichotomous variables but less straightforward for categorical variables [22, 23]. We distinguished the selection of all types of variables and showed that the MPR-method performed equally well as RR for the continuous and dichotomous variables and better than the D1-, D2- and D3-methods for the categorical variables. The usability of the pooling methods depends on their availability in statistical software. Most software packages do not provide these methods in combination with variable selection and are therefore out of reach for applied researchers. The strength of the MPR-rule is that it can easily be applied in any software package and is not a time-consuming process.

Strengths and limitations

Our aim was to compare four different selection methods. A strength is that we applied two different ways of pooling and selecting variables: 1. Rubin’s Rules (RR) were applied to pool the continuous and dichotomous variables and the pooling methods D1, D2, D3 and MPR for categorical variables. 2. All variables were pooled by the D1, D2, D3 and MPR method.

No differences were found between those two ways of pooling and selecting variables, i.e., the MPR outperformed all other methods. Another strength is that we used various p-out values to evaluate the behaviour of the pooling methods when the selected models contained variables with a strong or less strong relationship with the outcome as can be found in normal practice. We found that in most scenario’s the MPR method resulted in the most stable models.

Also, a strength is that we, in addition to the study performed by White and Austin et al., carried out many different simulated conditions based on empirical data. We evaluated the selection frequency of the variables, the P-values of the selected variables and the model stability of the selected models [22, 23]. Also, we added a noise variable to evaluate if all the methods coped well with this variable. In most of these conditions, the MPR-method was not worse than the other methods. A limitation could be that the simulation study used a small number of covariates than are used in practical data sets. However, the NHANES-dataset contained a mixture of weaker and stronger variables, like in real-world datasets, and the results in the NHANES dataset confirmed what we saw in the simulation study.

Another limitation could be that we used only two different correlation levels in our simulation sets (0.2 and 0.6). However, to set up our simulation study we initially used the paper of Wood, White and Royston [23] about variable selection methods in multiply imputed datasets, that came closest to the aim of our study. They reported a correlation of 0.62 and defined that as a high correlation value. We therefore used a high correlation of 0.6 in our study. We wanted to compare this high correlation with a lower correlation and used the value of 0.2. We think, that by using these values for the correlation, we were well able to test the methods in datasets that are commonly seen in medical studies containing variables with comparable lower and higher correlations. Another limitation may be that we used a fast backward selection procedure to select variables in the complete datasets [24]. It is known that this may be not the most efficient selection method [24, 25]. An alternative may be to use more advanced methods like the least absolute shrinkage and selection operator (LASSO) [25]. However, the LASSO is developed for situations where the number of predictors is much higher than the number of persons. This is not the case in a lot of medical and epidemiological datasets. Another problem with LASSO estimation is its dependence on the scale of the covariates. A solution for this is to apply internal standardization in LASSO software to unit variance before variable selection. After that, the regression coefficients are than back transformed to the original scale. It is however not clear yet if standardization of variables of the type “one size fits all” is the best choice for all modeling purposes. Therefore, using the fast backward selection procedure was the best option to compare the pooled selection methods with a similar selection procedure in the complete datasets [4]. Another limitation could be that we considered all continuous variables as normally distributed while in practice there are also non-linear relationships, so further research will be necessary about the selection of these type of variables in multiply imputed datasets.

Conclusion

Evaluating four pooling methods (D1, D2, D3 and MPR) for variable selection in multiply imputed datasets for categorical, dichotomous, and continuous variables in logistic regression analyses with a BWS-procedure, the MPR-pooling method performed consistently better than the other methods to select categorical variables in smaller datasets (N ≤ 500 participants). The variable selection frequencies, their P-values, the selection frequencies of the prognostic models as well as their stability were more similar to the analyses in the complete datasets using the MPR-method. For continuous and dichotomous variables none of the four pooling methods performed actually better than one of the others. In large datasets there were almost no differences between the four pooling methods. These results were confirmed in the analyzes in a real-world dataset (NHANES). Considering that MPR is the most simple and easy pooling method to use for epidemiologists and applied researchers, we carefully recommend using the MPR-method to pool categorical variables with more than two levels after MI in combination with BWS-procedures. Because MPR never performed worse than the other methods in continuous and dichotomous variables we also advice to use MPR in these types of variables.