1 Introduction

Squared terms are often included in real-life data models to accommodate some form of nonlinearity. When the analysis model contains the partially observed covariates and corresponding squared terms, some challenges arise:

  1. 1.

    The analysis and imputation models should accommodate squared terms, i.e., the squares themselves should be considered in the imputation procedure of corresponding linear terms.

  2. 2.

    The relation between the square term and its lower-order polynomial should be preserved.

  3. 3.

    The analysis model parameter estimates should be unbiased.

To obtain unbiased estimates, one could impute the squared term as if it were another variable. We will refer to this as Transform, then Impute (TTI; Von Hippel 2009). However, this approach distorts the relationship between the original variable and its square. A straightforward process to preserve the quadratic relation during imputation is calculating the squared term only after imputation (Impute, then Transform, ITT). However, ITT biases estimates of regression coefficients, as its contribution during imputation is ignored (Von Hippel 2009; Vink and van Buuren 2013). Moreover, both these partial fixes only work when the missingness is completely random.

To solve these issues for a more general class of missingness mechanisms, Vink and van Buuren (2013) propose to impute the combination of the original variable and its square and decompose it into distinct roots. This polynomial combination approach (PC) is built around predictive mean matching, a nonparametric imputation hot-deck technique that does not assume a specific distribution for the data (Rubin 1987; Little 1988). Seaman et al. (2012) demonstrated that predictive mean matching gives biased estimation when the analysis is a linear regression with a quadratic term and the missingness mechanism is missing at random. However, PC yields unbiased estimates for MCAR and MAR missingness mechanisms by applying a reasonable donor selection procedure on the polynomial combination instead.

More recently, Bartlett et al. (2015) proposed a substantive model compatible approach (SMC-FCS), which generalizes the imputation of nonlinear covariates beyond the squared term model. The SMC-FCS technique is efficient but needs the correct data analysis model during imputation to obtain draws of values that conform to this model. It yields unbiased estimates if 1) the substantive model is correctly specified and 2) the normality assumption of missing variables with quadratic effects is tenable because of a restriction of the software (the package smcfcs (Bartlett et al. 2021) in R (R Core Team 2021)). When the missing variable with quadratic effects does not follow the normal distribution, one could apply an appropriate transformation to make the normality assumption plausible (e.g., log-normal distribution). More interestingly, Bartlett et al. (2015) suggest that the SMC-FCS estimates can yield unbiased inference, meaning that estimates are both unbiased and properly covered cf. Neyman (1934). Such an investigation into coverage of multiply imputed parameters was not part of the study by Vink and van Buuren (2013).

We now have two techniques that seem promising in imputing squared terms: the polynomial combination method and SMC-FCS. Both approaches have appealing properties, preserve the relationship between the square and its base, and yield unbiased estimates. However, both techniques differ fundamentally in their approach. SMC-FCS is a strictly model-based technique that requires the correct specification of the complete-data model and the substantive model. On the other hand, PC is a hot-deck technique with a data-driven estimation procedure that only requires the specification of the polynomial combination. We highlighted the most used properties and promising methods in Table 1 (Von Hippel 2009; Vink and van Buuren 2013; Bartlett et al. 2015).

Table 1 Summary of properties of four squared term imputation methods

The interpretation of Table 1, taking the SMC-FCS approach as an example, should be as follows:

  1. 1.

    SMC-FCS yields unbiased regression estimates \(\beta \) provided that the missing mechanism is MAR;

  2. 2.

    SMC-FCS does preserve the quadratic relationship between the original variable and its square;

  3. 3.

    SMC-FCS produces correct coverage rates of corresponding regression estimates \(\beta \);

  4. 4.

    SMC-FCS is somewhat robust against the violation of normality assumption of the covariates with quadratic effects;

  5. 5.

    SMC-FCS is a parametric imputation approach, which means SMC-FCS requires an explicit specified imputation model.

The “ - ” sign in a cell indicates that the method cannot produce unbiased estimates. Whether the PC method has a correct coverage rate is not thoroughly studied and will be investigated in the following section. There are four blank cells left because the TTI and ITT methods cannot produce unbiased regression estimates or preserve the quadratic relationships; it is redundant to investigate the violation of normality and model specification for them.

In this manuscript, we evaluate the performance of imputing squared terms with SMC-FCS and the PC method to investigate these techniques’ strengths and limitations in different scenarios. In the next section, we briefly discuss the SMC-FCS and PC methodology and propose a minor adjustment to the PC method.

2 Polynomial combination

In this section we detail both the original polynomial combination (OPC) approach proposed by Vink and van Buuren (2013) and a modification that is robust against perfect prediction issues. We refer to the modified polynomial combination approach as MPC.

2.1 Original polynomial combination

Suppose the model of scientific interest is

$$\begin{aligned} Y = \alpha + X\beta _{1} + X^2\beta _{2} +\epsilon \end{aligned}$$
(1)

with \(\epsilon \sim N(0, \sigma ^2)\). We assume that Y is complete and that \(X = (X_{obs}, X_{mis})\) is partially missing.

The original polynomial combination method first performs predictive mean matching (PMM; Little 1988) on the combined variable \(Z = X\beta _{1} + X^2\beta _{2}\), and then decomposes Z into components X and \(X^2\). Under the model in Eq. (1), two roots of variable X are:

$$\begin{aligned} X_{-}= & {} -\frac{1}{2\beta _{2}}(\sqrt{4\beta _{2}Z + \beta _{1}^2} + \beta _{1})\nonumber \\ X_{+}= & {} \frac{1}{2\beta _{2}}(\sqrt{4\beta _{2}Z + \beta _{1}^2} - \beta _{1}) \end{aligned}$$
(2)

where the discriminant \(4\beta _{2}Z + \beta _{1}^2\) should be larger than 0. For any imputed Z, we select either \(X = X_{-}\) or \(X = X_{+}\) and square it to derive the square term \(X^2\).

The choice between the roots \(X_{-}\) and \(X_{+}\)is made by random sampling, conditional on Y, Z, and their interaction YZ. The binary random variable V is defined as 0 if \(X < X_{min}\) and 1 if \(X > X_{min}\), where the minimum of the parabola \(X_{min} = -\beta _{1}/2\beta _{2}\). We model the probability \(P(V = 1)\) by logistic regression as

$$\begin{aligned} \text {logit} P(V = 1) = Y\beta _{\text {Y}} + Z\beta _{\text {Z}} + YZ\beta _{\text {YZ}} \end{aligned}$$
(3)

on the observed data, where \(\text {logit} P(V = 1) = \text {log}(P(V = 1) / P(V = 0))\) is the logistic function. Under the assumption of ignorability, we apply the same model to calculate the predicted probability \(P(V = 1)\) for \(Z_{mis}\), where \(Z_{mis}\) denotes the polynomial combination of \(X_{mis}\) and \(X_{mis}^{2}\). Finally, a random draw from the binomial distribution is made (\(V = 0\) or 1), and the corresponding (negative or positive) root is selected as the imputation.

2.2 Modification of polynomial combination

Since we estimate binary variables V in the OPC imputation procedure, it is necessary to avoid bias due to perfect prediction. When imputers apply the original polynomial combination method, perfect prediction occurs when all the observed binary variables \(V_{obs}\) are equal to one (or zero). In this case, the likelihood tends to a limit as one or some regression coefficients tend to infinity, which leads to seriously implausible imputations of the binary variable V (White et al. 2010).

Suppose all observed X are located on the parabolic function’s right arm, then the perfect prediction arises. If no corrections are performed, the coefficients of the logistic function \(\text {logit}P(V = 1) = Y\beta _{Y} + Z\beta _{Z} + YZ\beta _{YZ}\) will have extremely wide and flat posterior distributions, which tends to derive extremely positive or negative estimates of coefficients. Provided all observed X are located on the right arm of the parabolic function, some missing values of X would be addressed incorrectly on the left arm, as shown in Fig. 1a.

A computationally convenient approach to avoid perfect prediction is data augmentation (van Buuren 2018, section 3.6.2). We augment the data with a few extra observations and add a small weight to these observations (White et al. 2010, section 5.2). To improve the polynomial combination method, we calculate any unobserved dichotomous outcomes (whether to take the positive or negative distinct real root for \(X_{mis}\)) \(V_{mis}\) by logistic regression of V given Y, Z, and YZ (i.e., Eq. 3) with the augmented data instead of the observed data. More specifically, based on the observed V, Y and Z, the augmented data adds eight subjects shown in Table 2, with the weight 3/8, to the observed data. When the population estimation of the probability \(P(V = 1)\) equals one (or zero), we expect the modified polynomial combination method would provide more plausible imputations, as shown in Fig. 1b.

Table 2 Augmented data

2.3 SMC-FCS

The substantive model compatible fully conditional specification (SMC-FCS) is a parametric imputation method proposed by Bartlett et al. (2015). In general, the missing predictor is imputed based on other predictors. A rejection sampling (e.g., Metropolis-Hastings algorithm) is used where the acceptance ratio is generated based on the likelihood of the substantive model. Suppose \(\phi \) is a vector containing the coefficients of the model f(Y|X) and \(\theta _{i}, i = 1, \dots , p\) is a vector containing the coefficients of the model \(f(X_{i}|X_{-i})\), where \(X_{-i}\) are all the other covariates excluding \(X_{i}\). The parametric density function of the partially observed variable \(X_{i}\) is proportional to \(f(Y|X, \phi )f(X_{i}|X_{-i}, \theta _{i})\), rooted in the Bayesian rule:

$$\begin{aligned} f(X_{i}|X_{-i}, Y)= & {} \frac{f(X_{i}, X_{-i}, Y)}{f(Y, X_{-i})}\nonumber \\\propto & {} f(Y|X_{i}, X_{-i})f(X_{i}|X_{-i}). \end{aligned}$$
(4)

Since the density generally does not follow a standard parametric family, the rejection sampling is necessary to draw coefficients from the posterior distributions of \(\phi \) and \(\theta _{i}\). With the assumption of independent priors \(f(\phi )\) and \(f(\theta _{i})\), the posterior distributions of \(\phi \) and \(\theta _{i}\) would be:

$$\begin{aligned} \phi&\sim f(Y|X_{i}, X_{-i}, \phi )f(\phi )\nonumber \\ \theta _{i}&\sim f(X_{i}|X_{-i}, \theta _{i})f(\theta _{i}). \end{aligned}$$
(5)

The statistical property of this approach is that if the substantive model f(Y|X) is correctly specified, the imputation model will be congenial to the analysis model (Meng 1994). The lack of congeniality can sometimes produce implausible imputations that result in biased inferences in the downstream analysis (Robins and Wang 2000).

Fig. 1
figure 1

Imputations (triangles) generated by OPC and MPC. We see that in a some imputations fall outside of the range of the observed (circle) and unobserved values (square), due to the OPC algorithm assigning the donor values to the incorrect distinct real root. In b the MPC approach assigns the imputations to the distinct real root that corresponds to the observed and unobserved data

3 Evaluation

We evaluated the average biases across all simulations, the coverage of nominal 95% confidence intervals, and the average width of corresponding confidence intervals of the regression weights \(\beta _{1}\) and \(\beta _{2}\).

3.1 Simulation setup

The outcome Y was simulated according to the scientific model:

$$\begin{aligned} Y = \alpha + X\beta _{1} + X^2\beta _{2} +\epsilon \end{aligned}$$
(6)

with \(\alpha = 0\), \(\beta _{1} = 1\), \(\beta _{2} = 1\) and \(\epsilon \sim N(0, \sigma _{\epsilon }^2)\). The value of \(\sigma _{\epsilon }\) varied according to different distributions of X so that the coefficient of determination \(R^2\) was always equal to 0.75.

The predictor X was generated from a normal, a skew-normal, or a normal mixture distribution. The mean of X was either 0 or 1, and the variance was 1 for all three distributions. The abscissa at the parabolic minimum was \(X = -1/2\). Hence, when the location of X was 0, there was a strong U-shaped association between Y and X. If X had location 2, the relationship between Y and X would be somewhat linear. For the skew-normal distribution, we set the slant parameter to be 6 when the mean of X equaled 0 and \(-\)3 when the mean of X equaled 2. For the normal mixture distribution, X was drawn from \(N(-0.875, 0.234)\) and N(0.875, 0.234) with equal probability to have mean 0 and N(1.125, 0.234) and N(2.875, 0.234) with equal probability to have mean 2.

We generated a sample of size \(n = 100\) and repeated 1000 simulations for each missingness scenario. For each simulation scenario, 30 percent missingness was induced jointly in X and \(X^2\) for five missingness mechanisms: MCAR, MARleft, MARmid, MARtail, and MARright. Specifically, MCAR denotes that the probability of X being missing is the same for all cases. While with a left-tailed (MARleft), centered (MARmid), both tailed (MARtail) or right-tailed (MARright) missingness mechanism, a higher probability of X being missing is assigned to the units with low, centered, extreme and high values of Y respectively. Let R be the response indicator for X, where R equals 0 if X is missing and 1 otherwise. For MARleft, the missingness probability is defined as \(\text {logit} P(R = 0) = -X + \bar{x} + \gamma _{l}\), where \(\gamma _{l}\) was chosen to make the probability of missing X equal to 0.3. Similarly, the missingness probability is defined as \(\text {logit} P(R = 0) = -|X - \bar{x}| + \gamma _{m}\), \(\text {logit} P(R = 0) = |X - \bar{x}| + \gamma _{t}\) and \(\text {logit} P(R = 0) = X - \bar{x} + \gamma _{r}\) for MARmid, MARtail and MARright, where \(\gamma _{m}\), \(\gamma _{r}\) and \(\gamma _{t}\) were chosen to make corresponding probabilities of missing X equal to 0.3 ((van Buuren 2018), section 3.2.4). All missingness was generated with the ampute function (Schouten et al. 2018) from the package MICE (van Buuren and Groothuis-Oudshoorn 2011) in R. The mice.impute.quadratic function in the package MICE was modified by including data augmentation.

3.2 Simulation results

We compared five approaches: TTI, ITT, OPC, MPC and SMC-FCS and focused on some remarkable findings of OPC, MPC and SMC-FCS. The results of the TTI and ITT simulations reiterated the corresponding conclusions in Table 1. In general, TTI did not preserve the quadratic relation, even though it gave unbiased and confidence-valid estimates in some cases (e.g., with MCAR and standard normal distribution X). Furthermore, ITT had considerable bias under nearly all combinations of missingness mechanisms and distributions of X.

Table 3 shows the average biases, the coverage of the nominal 95% confidence intervals, and the average width of confidence intervals for \(\beta _1\) and \(\beta _2\) when E(X) equals 0. The outcome Y follows a U-shape. With MCAR, MARleft and MARmid and when X is distributed as normal, skewed normal or a mixture of two normals, OPC and MPC gave unbiased estimates and correct CI coverage. The CI coverage of SMC-FCS was close to 95%. However, with X skew-normal distributed MCAR and MARmid, SMC-FCS was slightly biased. With MARtail and MARright, SMC-FCS outperformed OPC and MPC when X followed a normal distribution or a normal mixture distribution. OPC and MPC had slight bias and somewhat reduced CI coverage (approximately 85%) with X distributed according to a normal, a skewed normal or a mixture of two normals. SMC-FCS was unbiased and had CI coverage close to 95% with normal and mixture normal. However, with skewed normal X, SMC-FCS was somewhat biased and the CI had slightly lower than nominal coverage.

Table 3 Simulation results for five missingness mechanisms when imputing a squared term regression where the mean of X equals 0. Shown are absolute bias of the estimate, coverage of the 95% confidence interval for the estimate and the average confidence interval width
Table 4 Simulation results for five missingness mechanisms when imputing a squared term regression where the mean of X equals 2. Shown are absolute bias of the estimate, coverage of the 95% confidence interval for the estimate and the average confidence interval width

Table 4 demonstrates the mean biases of \(\beta _{1}\) and \(\beta _{2}\), the empirical coverage and the mean width of the corresponding 95% CIs where X is location 2 and scale 1. Almost all observed values of Y are on the right arm of the quadratic function. With normal X, SMC-FCS consistently yielded confidence-valid estimates because of the congeniality of the analysis model and imputation model. However, with MAR (MARleft, MARmid, MARtail and MARright), SMC-FCS gave a slightly biased estimate for \(\beta _{1}\). OPC and MPC gave unbiased results and the CI had approximately 95% coverage under MCAR and MARmid. With MARleft, OPC and MPC were slightly biased for \(\beta _{1}\), but the 95% CI for \(\beta _{1}\) and \(\beta _{2}\) had the correct coverage. With MARtail and MARright, OPC and MPC were unbiased. The CI of MPC (around 90%) had higher nominal coverage than OPC (around 80%). With X distributed according to a skewed normal, OPC and MPC yielded unbiased estimates under MCAR, MARmid, MARtail and MARright but slightly biased estimates under MARleft. The CI coverage of OPC and MPC was close to 95% under MCAR, MARleft and MARmid. Like the case with normal X, the CI of MPC (around 90%) had better coverage than OPC (around 85%). With X distributed as a mixture of two normals, SMC-FCS gave biased results under all missingness mechanisms and its CI had somewhat reduced coverage under MARright. OPC and MPC were unbiased under MCAR, MARmid and MARtail but biased under MARleft and MARright. MPC had CI coverage close to 95% under all missingness mechanisms. The CI from OPC had correct coverage under MCAR, MARleft and MARmid, but approximately 85% coverage under MARtail and MARright.

Fig. 2
figure 2

The plot of means and confidence intervals of \(\beta _{1}\) and \(\beta _{2}\) from the first 100 simulations. The model of interest is \(Y = X + X^2 + \epsilon \), where \(X \sim \frac{1}{2}N(1.125, 0.234) + \frac{1}{2}N(2.875, 0.234)\). The missingness mechanism is MARright and the imputation approach is SMC-FCS. The imputations seem to primarily overestimate the true parameter estimate of \(\beta _{1}\) in (a) and underestimate the true parameter estimate of \(\beta _{2}\) in (b)

We investigated if the biases of SMC-FCS were caused by Monte Carlo error. Figures 2a, b demonstrate that the bias is due to simulation error. The estimates for \(\beta _{1}\) and \(\beta _{2}\) show primarily overestimation and underestimation, respectively. This implies that, when applying SMC-FCS, the explicit specification of the distribution of the incomplete variable with the quadratic effect may need careful consideration.

4 Conclusion

We evaluate the performance of four imputation approaches for incomplete data problems where the model of scientific interest contains squared terms. We improve the performance of the polynomial combination method by incorporating a data augmentation step, thus realizing more plausible imputations when the missingness covariate relates almost exclusively to one arm of the quadratic curve.

In our simulation studies, although ITT preserves the quadratic relations, it gives biased estimates under almost all combinations of experimental factors. Oppositely, TTI provides unbiased estimates in some cases but fails to keep the quadratic relations in the imputed data. With normally distributed predictors and right-tailed missingness mechanisms, the performance of SMC-FCS is superior to that of MPC, with coverages closer to the nominal level. However, when the normality assumption is violated, the polynomial combination yields less biased estimates. Overall, both the SMC-FCS and polynomial combination methods produce plausible imputations of squared terms and outperform TTI and ITT. Differences between the approaches only become apparent under intense MARtail and MARright scenarios in simulation. However, these two mechanisms are more extreme than we are likely to see in practice since there is a strong relationship between the outcome Y and the probability of the variable X being unobserved in the tail. All in all, when differences in performance are found, such differences are small, and it may be challenging to interpret them as meaningful. This means that, in practice, the choice for an imputation approach could largely be a choice of preference.

If there is a solid, well-known scientific model, we highly recommend using SMC-FCS to sharpen results. The substantive model would then be correctly specified, ensuring that the distribution from which imputations are generated is compatible. SMC-FCS is a reliable model-based method to impute predictors with quadratic effects. It is theoretically well-grounded, and procedures are available for substantive models based on standard regression, discrete outcomes and proportional hazards van Buuren (2018). However, with an increasing number of variables in the dataset, it becomes increasingly challenging to infer the correct substantive model based on the incomplete data a priori. The strategy of applying SMC-FCS in practice is performing model selection once imputed datasets are generated to ensure the accuracy of substantive model specification, which is not a trivial process (Bartlett et al. 2015). Usually, the substantive model is specified according to prior studies or assumptions.

In contrast, we advise using the polynomial combination approach when the scientific model is less specific or when modeling efforts are challenging. It is proven to be a valid data-driven imputation method that is flexible in applying because we only need to specify the quadratic term. This makes it straightforward to implement in any imputation effort. The polynomial combination method is based on predictive mean matching, and the performance of imputation procedures involving PMM are proven to work well in a wide range of research problems (Vink et al. 2015; Rubin 1986; Little 1988). Therefore, we expect that the polynomial combination approach could be of great practical importance in incomplete data analyses with squared terms.