Background

Longitudinal data collected from the same subjects over time are frequently used in observational studies and clinical trials. Traditional models for longitudinal data analysis are generalized linear mixed-effects models (LMM), marginal models like generalized estimating equations (GEE), and transitional models. In various types of studies, especially longitudinal ones, researchers frequently face significant challenges, such as missing data. During follow-up, some subjects may withdraw or become lost to follow-up at planned visits. Subjects who participate only during a particular study period may complete only a subset of the information [1, 2].

Most conventional statistical models deal only with complete cases, and missing data are omitted before fitting statistical models (this is the default in most statistical software programs and is called the listwise deletion method). Excluding these observations has disadvantages, including loss of information, loss of precision, reduction in statistical power, and potentially biased estimates [3]. Therefore, different approaches have been introduced to impute missing values and can be classified as either single-imputation (SI) or multiple-imputation (MI).

MI methods for imputing missing data in software programs are based on two approaches: joint modeling (JM) and fully conditional specification (FCS). JM approaches for MI are based on the multivariate distribution or the joint distribution of incomplete variables (often, the multivariate normal (MVN) distribution is considered and can be referred to as multivariate normal imputation (MVNI)) [4]. In FCS approaches, missing observations of each incomplete variable are imputed given all the other predictor variables, cycling iteratively through a sequence of univariate imputation models [5].

Several JM and FCS methods, like JM-MVN (joint multivariate normal imputation) and FCS-standard approaches, have been proposed to handle missing values in cross-sectional studies [6,7,8,9,10,11]. These approaches are also appropriate for imputing missing values in balanced longitudinal data where longitudinal measurements are obtained at fixed time intervals. In this case, the JM-MNV and FCS-standard methods treat time-dependent variables as distinct variables (wide format) for the imputation of balanced missing longitudinal data [3, 5].

Sometimes, longitudinal data are collected at unequal time intervals along with many longitudinal predictor variables. Then standard JM-MVN and FCS methods cannot be used for imputing missing values in this case because large numbers of time-dependent predictor variables may lead to problems like overfitting and multicollinearity among distinct predictor variables.

However, ignoring the longitudinal and multilevel structures when imputing missing values of longitudinal data and multilevel data may lead to biased inferences for the estimates of regression coefficients and their standard errors [12, 13]. Recently, several studies extended MI methods for imputing missing values in multilevel data [14, 15] and longitudinal data [4, 16, 17]. These extensions are also available in several software programs such as R [9, 18,19,20,21,22,23,24,25,26], Mplus [27], STATA [28, 29], Blimp [30], REALCOM-IMPUTE [31], SAS [32], and Stat-JR [33]. In addition, in 2013 and 2016, Genolini et al. introduced several SI approaches to impute monotone/dropout and non-monotone/intermittent missing data in longitudinal studies [34, 35].

A few studies compared MI approaches for imputing missing values in longitudinal studies [16, 17, 36]. These studies used parametric approaches like LMM as an analysis model of interest. However, it is uncertain how well the different MI approaches perform when the statistical model of interest is a non-parametric longitudinal model. In addition, there is no comparison of SI and MI approaches in the literature, where the target analysis is a non-parametric longitudinal model. Hence, the present study is the first to consider non-parametric estimation methods for longitudinal data analysis following missing data imputation with SI and MI approaches.

In this study, the non-parametric longitudinal method of interest is the longitudinal regression tree algorithm proposed by Sela et al. This algorithm is named the random effects expectation–maximization (REEM) tree algorithm [37].

The primary purpose of this study is to evaluate the performance of MI and SI approaches for imputing missing values in longitudinal data. The longitudinal data for this study were obtained from the longitudinal Tehran cardiometabolic genetic study (TCGS) to assess the association between diastolic/systolic blood pressure (DBP/SBP) and predictor variables such as age, gender, and body mass index (BMI).

Methods

Tehran cardiometabolic genetic study (TCGS)

Subjects of the study are extracted from TCGS, an ongoing cohort study based on the framework of the Tehran Lipid and Glucose Study (TLGS). TLGS is the first prospective cohort study in West Asia, and was conducted in Tehran, the capital of Iran. This study was designed to assess the epidemiology of non-communicable diseases of participants from district 13 of Tehran with 24 years of follow-up. The first or baseline phase of the TLGS study was started in February 1999, and the individuals were selected via a multistage stratified cluster random sampling method with follow up every three years. The primary purpose of the TLGS is to determine the prevalence of cardiovascular diseases (CVD) and risk factors for the Tdevelopment of these diseases. The design of the TLGS study has been reported elsewhere [38,39,40,41,42].

In the present study, some participants of TCGS were used to evaluate the association between DBP/SBP and predictor variables such as age, sex, and BMI [43, 44]. The predictor variables such as sex and age had no missing values (Table S1 in supplementary file provides sample data for the first 20 individuals of TCGS). However, the BMI and outcome variables (DBP/SBP) had missing values in all six waves of the TCGS study. SI and MI approaches were used to impute missing values of incomplete variables. Parametric linear mixed-effects and non-parametric longitudinal regression tree models were used for longitudinal data analysis after imputing missing data. The study structure for selecting individuals and statistical analysis plan is shown in Fig. 1. In the following sections, each step for data analysis is fully described.

Fig. 1
figure 1

The study structure for the selection of individuals and statistical analysis

Missing data mechanisms

Three missing data mechanisms are defined for generating missing data: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR [3]. We can distinguish between the mechanisms by defining outcome Y = (Yobserved, Ymissing) and missing indicator R (1: observed and 0: missing). Each missing data mechanism suggests a general imputation approach.

If missingness probability is not related to observed and unobserved data, then data are MCAR. According to this mechanism, the distribution of missing data P \(\left(\mathrm{R}|\mathrm{Y}\right)\) = P(R). While this assumption is unrealistic, the listwise deletion method is an unbiased method for dealing with missing data when this assumption is established. Multiple imputation (MI) approaches are developed under the MAR assumption, which states that the probability of missingness is related to the observed data and is not dependent on the unobserved data. In this case, the distribution of missing data is defined as P \(\left(\mathrm{R}|\mathrm{Y}\right)\) = P \(\left(\mathrm{R}|{\mathrm{Y}}^{\mathrm{observed}}\right)\), and MI methods can generate unbiased and efficient results [45]. Traditional models for longitudinal data analysis like GEE and models based on the maximum likelihood estimation like GLMMs also lead to valid estimates when missing mechanisms are MCAR and MAR, respectively [46].

The MNAR mechanism occurs when the probability of missingness is related to observed and unobserved data, or the distribution of missing data is equal to P \(\left(\mathrm{R}|{\mathrm{Y}}^{\mathrm{observed}},{\mathrm{Y}}^{\mathrm{missing}}\right)\). Selection and pattern-mixture models have been introduced to handle MNAR data [47]. Assuming that the mechanism is MCAR is a very strong assumption; there are several tests to assess this assumption against not MCAR [48, 49]. The MAR assumption is most common in practice, and this can be tested against the MCAR assumption. However, it is never possible to rule out the assumption of MNAR. The missing data mechanism of multiple variables may be a mixture of any or all mechanisms described here.

Missing data patterns

In longitudinal studies, missing values are based on non-monotone/intermittent and monotone/dropout processes [50]. The non-monotone pattern is created when study information is not available for a subject at one time point, but the subject returns at a subsequent time point. A monotone pattern is unlike a non-monotone pattern; if a subject misses a particular follow-up, then this subject is not available again. In practice, these two patterns can occur together for different measures.

SI approaches to impute missing values in longitudinal data

In SI approaches, a single value is estimated for each missing data point. In 2013 and 2016, Genolini et al. introduced several SI approaches to impute monotone/dropout and non-monotone/intermittent missing data in longitudinal studies [34, 35].

To understand the computational strategy related to this section, a data set of n clusters (subjects) is considered. A time-dependent variable is recorded at t time points for each cluster. In this case, a trajectory for cluster i and a cross-sectional measurement for a particular time point k, is defined as the sequence yi. = (yi1, yi2, …, yit) and as the vector y.k = (y1k, y2k, …, ynk), respectively. Let yik show a missing value for cluster i at a specific time point k. yik is non-monotone missing if time points as a < k < b exists and yia and yib are not missing. yik is monotone missing if for all time points h > k, yih is missing.

SI methods are classified into three imputation classes: cross-sectional (methods such as cross-mean, cross-median, and cross-hot deck), longitudinal (methods such as traj-mean, traj-median, traj-hot deck, last observation carried forward (LOCF), next observation carried backwards (NOCB), interpolation LOCF, interpolation global, interpolation local, and interpolation bisector), and cross-sectional-longitudinal (methods such as copy mean LOCF, copy mean global, copy mean local, and copy mean bisector). The cross-sectional imputation methods deal with observed data at a specific time (across clusters) for replacing the missing values in this time, whereas longitudinal imputation methods deal with observed data of the same cluster to impute the missing values in this cluster. The cross-sectional-longitudinal imputation methods utilize both cross-sectional information (yi.) and longitudinal information (y.j).

Cross methods

The cross methods include cross-mean, cross-median, and cross-hot deck. In the cross-mean method, yik at a particular time point is estimated by the mean of all values observed at the time point of interest. Likewise, the cross-median method uses the median of all values observed at the time point of interest instead. In the cross-hot deck method, yik at a particular time point is estimated by using a randomly selected value observed at the time point of interest.

Traj methods

The traj methods include traj-mean, traj-median, and traj-hot deck. In the traj-mean method, yik is estimated by the mean of all values observed at the trajectory of interest yi. The traj-median method uses the median of all values observed at the trajectory of interest. In the traj-hot deck method, yik at a particular time point is estimated by a randomly selected value observed from the trajectory interest.

LOCF (Last Occurrence Carried Forward) and NOCB (Next Occurrence Carried Backward)

In the LOCF and NOCB methods, yik is estimated by the last and next observed value of the trajectory of interest, respectively.

Interpolation methods

Interpolation methods include the four methods: interpolation-LOCF, interpolation-global, interpolation-local, and interpolation-bisector. In all interpolation methods, a non-monotone missing (yik) is replaced by drawing a line between the values immediately surrounding yik, and the mathematical formula of this step is as follows: \({\mathrm{y}}_{\mathrm{ia}}+(\mathrm{k}-\mathrm{a})\frac{{\mathrm{y}}_{\mathrm{ib}}-{\mathrm{y}}_{\mathrm{ia}}}{(\mathrm{b}-\mathrm{a})}\) (yia and yib are values immediately surrounding yik). But these methods have different strategies to deal with monotone missing values. For example, the interpolation-LOCF uses LOCF or NOCB methods to solve this problem. In the interpolation-global method, replacing a monotone missing is performed by drawing a line linking the first and the last observed values. In the interpolation-local method, monotone missing values are generated first by drawing a line linking the first and second non-missing value. Then, monotone missing values at the end of the trajectory are replaced by drawing a line linking the last and penultimate non-missing value. Finally, the interpolation-bisector method provides an intermediate method using the bisector of interpolation-global and interpolation-local for imputing monotone missing values, and the imputed values are chosen on the bisectors.

Copy-mean methods

The copy-mean methods include copy-mean LOCF, copy-mean global, copy-mean local, and copy-mean bisector. The copy-mean LOCF method is based on two steps for the imputation of missing observations. First, missing values are imputed using the LOCF method to provide an initial approximation of these values. Then the mean trajectory of the population is used to refine the initial approximation in the previous step. Let,\(({\overline{\mathrm{y}} }_{.1}, \dots ,{\overline{\mathrm{y}} }_{.\mathrm{t}})\): the mean trajectory of a populationyik: the first missing value of ith trajectory\({\mathrm{y}}_{\mathrm{ik}}^{\mathrm{LOCF}}\): the imputed value for yik using LOCF method for all time points k ≥ d\(\left({\overline{y} }_{.1}^{\mathrm{LOCF}}, \dots ,{\overline{\mathrm{y}} }_{.\mathrm{t}}^{\mathrm{LOCF}}\right)\): the mean trajectory of a population with missing values using the LOCF methodAVk: the average variation at kth time point and is equal to \({\overline{y} }_{.k}-{\overline{\mathrm{y}} }_{.\mathrm{k}}^{\mathrm{LOCF}}\)

The missing value yik is obtained from the copy mean LOCF by adding AVk to the imputed value for yik using the LOCF method \(\left({\mathrm{y}}_{\mathrm{ik}}^{\mathrm{LOCF}}+{\mathrm{AV}}_{\mathrm{k}}\right)\)

The computational strategies of the copy-mean local, copy-mean global, and copy-mean bisector are similar to the copy-mean LOCF method, except these methods use the interpolation-local method, interpolation-global method, and interpolation-bisector method to provide an initial approximation of missing values, respectively.

MI approaches to impute missing values in longitudinal data

MI approaches were proposed by Rubin in 1987 [51], and are flexible and popular methods for imputing missing values based on three steps: imputation step, analysis step, and pooling step [52, 53]. MI approaches use a Bayesian strategy, where posterior estimation is conducted using the Markov Chain Monte Carlo (MCMC) method. These approaches can be decomposed into two approaches, JM and FCS, described in subsequent sections.

In the imputation step, missing data are estimated several times by sampling through their posterior predictive distribution given the observed data and the parameter values of the imputation model. In the next step, multiple complete data sets produced via the imputation step are analyzed using the statistical model of interest. Finally, the results obtained from the analysis step, such as the estimates of regression coefficients with their standard error and criteria of predictive performance, are pooled using Rubin's rules to account for the uncertainty of the imputation [3, 51]. This process is shown graphically in Figure S1 (supplementary file).

Generating more than ten datasets positively affect statistical power [54]; Enders (2010) argues that generating 20 multiple imputed datasets is appropriate [55]. Based on this, we generated 20 multiple datasets of imputed data to impute missing data in the TCGS study using MI approaches.

MI approaches have an attractive property because auxiliary variables can be included in the imputation step without being in the analysis step. Auxiliary variables provide information about missing data and improve the quality of missing data imputation [45]. Researchers also can use a larger number of these variables to reduce the negative effects of MNAR [26].

The MI approaches are based on the MAR assumption, and the inclusion of auxiliary variables in the imputation step raises the plausibility of this assumption. The imputation model should include all the estimations in the analysis step in these imputation approaches. Otherwise, the analysis may generate biased estimates. In real applications, imputation models with many variables can lead to problems such as multi-collinearity and non-convergence [26].

In the present study, we used a number of MI approaches available in R software for missing data imputation; these approaches include joint multivariate LMM (JM-MLMM) [4], joint multivariate LMM with heteroskedastic covariance matrices across all clusters (JM-MLMM-het) [56], full joint (JM-FJ) [57], substantive model compatible joint modelling approach (JM-SMC) [58], substantive model compatible joint modelling approach with heteroskedastic covariance matrices across all clusters (JM-SMC-het) [59], FCS-LMM [26], FCS-LMM with heteroskedastic residual variance across all clusters (FCS-LMM-het) [26], FCS-LMM-latent normal (FCS-LMM-LN) [60], FCS-LMM-LN with heteroskedastic residual variance across all clusters (FCS-LMM-LN-het) [60], FCS-MLMM with latent normal variables (FCS-MLMM-LN) [22], hierarchical multiple imputation (hmi) [23], and joint analysis and imputation (JointAI) [61]. Each approach is explained in the following sections. In addition, we mention the number of iterations, burn-in period, and convergence criteria for each MI approach.

JM-MLMM method

Schafer and Yucel (2002) introduced the JM-MLMM method for imputing missing values in longitudinal data using joint multivariate linear mixed models (JM-MLMM) instead of treating time-dependent variables as distinct variables for imputing variables with missing values. In this method, qualitative variables are imputed as continuous or dummy variables. The JM-MLMM method also assumes that random effects are based on a normal distribution with constant covariance matrices across all subjects (clusters) [4].

JM-FJ method

In real applications, longitudinal data with missing values may be a mixture of qualitative and quantitative variables, so the normality assumption for these incomplete variables may be unrealistic. Goldstine et al. (2009) suggested the JM-MLMM-LN method based on the JM-MLMM method, which includes latent normal (LN) variables for imputing a mixture of normal and non-normal variables [57]. Asparouhov and Muthen (2010) proposed the JM-FJ method based on the JM-MLMM-LN method to impute missing values in longitudinal data using all variables in the imputation process as outcome variables.

These two methods and the JM-MLMM method are implemented in package mitml [19]. The potential scale reduction criterion near 1 or < 1.05 for all parameters and diagnostics plots were used to assess the convergence. If the potential scale reduction criterion was larger than 1.05, the iterations of the burn-in period were increased.

JM-SMC and JM-SMC-het methods

Goldstine et al. (2014) extended the JM-MLMM-LN method to the JM-SMC method by defining the joint imputation method as the product of the analysis model and the joint distribution of variables [58]. The JM-SMC method can also accommodate random covariance matrices across all subjects, and this method is defined as the JM-SMC-het method. These methods use diagnostics plots to assess convergence.

In the present study, all JM approaches were conducted based on the 1000 iterations and 5000 iterations for a burn-in period to establish the stability of parameters distribution.

FCS-LMM and FCS-LMM-het methods

FCS-LMM is an FCS adaptation of the JM-MLMM method, proposed by van Buuren et al. (2011). This method fits a multilevel LMM to impute missing values of incomplete variables conditional to other variables, cycling iteratively based on the univariate imputation models. In this method, qualitative variables are imputed as continuous variables or as dummy variables. The FCS-LMM method assumes normal distributions for all variables with missing values and a fixed residual variance across all subjects [26]. Van Buuren (2011) extended the FCS-LMM method to the FCS-LMM-het method to deal with heteroskedastic residual variance across all clusters [14].

FCS-LMM-LN and FCS-LMM-LN-het methods

Enders et al. (2017) suggested the FCS-LMM-LN method by extending the FCS-LMM method to LN variables [60]. This method imputes missing data using a value randomly selected from observed values having the nearest predicted mean based on the LMM to particular missing data. In the FCS-LMM-LN-het method, the continuous variables with missing values are imputed using a LMM.

FCS-MLMM-LN method

Audigier and Resche-Rigon modified the JM-MLMM-LN approach to impute missing observations based on an FCS framework where only one variable is considered missing at a time [22]. At each step, all of the variables in the imputation model are considered as outcomes (one of variables in the imputation model is treated as incomplete variable and the rest are considered as complete variables). Using this approach, the incomplete binary and categorical variables are imputed using latent normal variables as for JM-MLMM-LN.

All FCS approaches mentioned were conducted based on the 20 iterations and 5 iterations for a burn-in period to establish the stability of parameters distribution (In these approaches, the convergence of estimations can occur with 5 or 10 iterations). In addition, diagnostic graphs were used to examine convergence [26].

Non-parametric longitudinal analysis method

The longitudinal tree-based methods are non-parametric methods for analyzing longitudinal data. Medical studies have used these methods to determine disease risk factors and identify high- and low-risk subgroups of patients [62, 63] by extracting homogeneous subgroups of observations that can be appropriately used for subgroup analysis [64]. Since most studies evaluating longitudinal changes in the outcome variable are conducted in the context of a heterogeneous population, traditional parametric longitudinal models might not provide a good fit and could potentially result in biased estimates. In addition, the actual values of the model parameters may differ between homogeneous subgroups. Because the tree-based models can extract homogeneous subgroups of observations and estimate heterogeneous treatment effects, they may be better positioned to assist the clinician in decision-making [65, 66].

Unlike traditional parametric longitudinal models, these methods do not require assumptions about the functional form of the data and are robust to outliers and multicollinearity. They can accommodate non-linear relationships and high-order interactions. The monotone transformations of predictor variables do not have any effect on the results. The interpretation of tree methods is straightforward because the results are shown graphically [67,68,69,70].

The classification and regression tree (CART) algorithm is the best-known tree algorithm for cross-sectional data modeling [71]. Sela et al. (2012) extended this tree algorithm for longitudinal data by combining the LMM and the CART algorithm. This longitudinal regression tree algorithm is named the random effects expectation–maximization (REEM) tree algorithm [37].

The previous section described how the LMM uses a parametric linear form for fixed-effects; this form cannot easily handle complex non-linear relationships or datasets with very large numbers of predictor variables. The REEM tree algorithm solves this problem using a non-parametric method like the CART algorithm to estimate the fixed effects. The estimation method of REEM is as follows:

  1. 1

    Set the initial values equal to zero for \({\widehat{b}}_{i}\).

  1. 2

    Run the following steps until the convergence of \({\widehat{b}}_{i}\) (convergence is established when change in the likelihood or restricted likelihood < predetermined tolerance value (e.g. 0.001)).a) Fit a regression tree to estimate an initial approximation of \(f\) using the CART algorithm, based on response variable,\({y}_{it}- {Z}_{it}{\widehat{b}}_{i}\), predictor variables,\({x}_{it}=({x}_{it1}, \dots , {x}_{itK})\), for i = 1, …, I and t = t = 1, …,\({T}_{i}\). This regression tree generates a set of predictor variables, I(\({x}_{it}\in {\mathrm{g}}_{\mathrm{p}}\)), where \({\mathrm{g}}_{\mathrm{p}}\) ranges over all terminal nodes of the tree.b) Run the LMM, \({y}_{it}={Z}_{it}{b}_{i}+\sum_{p}\mathrm{I}\left({x}_{it}\in {\mathrm{g}}_{\mathrm{p}}\right){\mu }_{p}+{\varepsilon }_{it}\), to estimate \({\widehat{b}}_{i}\) from the fitted model.

  1. 3

    Use estimated predicted response \({\widehat{\mu }}_{p}\) from the fitted LMM in step 2b instead of the predicted response at each terminal node of tree.

Simulation study

Performance of the various imputation approaches was compared using simulation data. We generated 1000 datasets, each of which included 1000 individuals, mimicking the TCGS data. In each simulated dataset, variables were generated as follows (all parameters in the data generating models were estimated from the original data to ensure that the simulated datasets were comparable to a real data example):

  1. 1

    Sex variable was generated using a binomial distribution with probabilities 0.5.

  1. 2

    Age variable at the first wave was generated using a truncated normal distribution with exact minimum = 1, maximum = 84, mean = 39.34, and standard deviation = 16.23. Age at other waves was generated as follows:

$$\begin{array}{cc}{\mathrm{Age}}_{\mathrm{ij}}= {\mathrm{Age}}_{\mathrm{i}1}+\left(\mathrm{j}-1\right)\times 3\mathrm{ i}\hspace{0.17em}=\hspace{0.17em}1, \dots , 3645,\mathrm{ j}\hspace{0.17em}=\hspace{0.17em}2, \dots , 6& \mathrm{i}\hspace{0.17em}=\hspace{0.17em}1, \dots , 3645,\mathrm{ j}\hspace{0.17em}=\hspace{0.17em}2, \dots , 6\end{array}$$
  1. 3

    The main predictor variable (BMI) at each wave was generated based on age and sex as well as individual-level random effects and individual-level noise in each wave:

$$\begin{array}{cc}{\mathrm{BMI}}_{\mathrm{ij}}=19.86+0.136\times {\mathrm{Age}}_{\mathrm{ij}}+2.360 \times {\mathrm{Sex}}_{\mathrm{i}}+ {\mathrm{\varnothing }}_{0\mathrm{i}}+ {\mathrm{\varnothing }}_{\mathrm{ij}}& \mathrm{i}=\hspace{0.17em}1, \dots ., 3645,\mathrm{ j}\hspace{0.17em}=\hspace{0.17em}1, \dots , 6\end{array}$$

where \({\mathrm{\varnothing }}_{0\mathrm{i}}\) = N (0, 4.50) is the random intercept and \({\mathrm{\varnothing }}_{\mathrm{ij}}\) = N (0, 1.57) is the residual error.

  1. 4

    The outcome variable, DBP at each wave was generated using the following linear process:

$$\begin{array}{l}{\mathrm{DBP}}_{\mathrm{ij}}=55.03+0.098\times {\mathrm{Age}}_{\mathrm{ij}}-3.434 \times {\mathrm{Sex}}_{\mathrm{i}}+ {0.707 {\times \mathrm{ BMI}}_{\mathrm{ij}}+\upgamma }_{0\mathrm{i}}+ {\upgamma }_{\mathrm{ij}}\\ \mathrm{i}\hspace{0.17em}=\hspace{0.17em}1, \dots ., 3645,\mathrm{ j}\hspace{0.17em}=\hspace{0.17em}1, \dots , 6\end{array}$$

where \(\gamma\) 0i = N (0, 6.35) is the random intercept and \({\upgamma }_{\mathrm{ij}}\)= N (0, 6.88) is the residual error.

After generating simulated data, missingness for some observations of BMI and DBP are generated based on the MAR mechanism. The following equations are used:

$$\mathrm{logit }\left(\mathrm{P}\left({\mathrm{BMI}}_{\mathrm{ij}}=\mathrm{missing}\right)\right)={\beta }_{0\mathrm{j}}+{\beta }_{1\mathrm{j}}{\mathrm{Age}}_{\mathrm{ij}}+{\beta }_{2\mathrm{j}}{\mathrm{DBP}}_{\mathrm{ij}}$$
$$\mathrm{logit }\left(\mathrm{P}\left({\mathrm{DBP}}_{\mathrm{ij}}=\mathrm{missing}\right)\right)={\varphi }_{0\mathrm{j}}+{\varphi }_{1\mathrm{j}}{\mathrm{Age}}_{\mathrm{ij}}+{\varphi }_{2\mathrm{j}}{\mathrm{BMI}}_{\mathrm{ij}}$$

The parameters \(\beta\) 0j, \(\beta\) 1j, \(\beta\) 2j, \(\varphi\) 0j, \(\varphi\) 1j, and \(\varphi\) 2j were determined based on the TCGS data to ensure a similar proportion of missing data for each variable at each wave; the proportions in both TCGS study and simulation study are shown in Table 1. These parameters are as follows:

Table 1 The proportions of missing data in both TCGS study and simulation study
$${\upbeta }_{0}=\left\{-2.646, -2.634, -3.047, -3.271, -2.872, -2.440\right\}$$
$${\varphi}_{0}=\left\{-1.701, -1.815, -2.091, -2.386, -2.104, -1.522\right\}$$
$${\upbeta }_{1}={\varphi}_{1}=0.002$$
$${\upbeta }_{2}={\varphi}_{2}=0.02$$

Criteria for comparing the performance of missing data imputation methods

The performance of imputation approaches under the MAR mechanism and statistical methods applied to the real and simulated data sets was compared by evaluating the standard errors of regression coefficients (SE), MSE, RMSE, MAD, deviance, and AIC. The imputation approaches with smaller value in terms of SE, MSE, RMSE, MAD, deviance, and AIC indicate better performance.

Software programs

R software was used to impute missing longitudinal data and data analysis, and R packages used are mentioned in Table 2. The R codes of SI and MI approaches for missing data imputation and data simulation are available in https://github.com/MinaJahangiri/R-codes-of-missing-imputation-methods.

Table 2 R packages for data analysis

Results

The study's variables with missing values are BMI (predictor variable) and DBP/SBP (outcome variables). Figure 2 shows the frequency of missing values for these variables at each phase of the TCGS study. The descriptive statistics of TCGS data are shown in the Table S2 (supplementary file). The percentage of missing values in a particular combination of variables based on the long data format is visually shown in Figure S2 (supplementary file). Seventy-six percent observations have no missing values, and 21% have missing values for BMI, DBP, and SBP, simultaneously. The missing data pattern of TCGS data indicated that both monotone and non-monotone patterns were present. The formal MCAR test indicated that the MCAR assumption is not reasonable (P < 0.001, so the MCAR assumption is rejected at a significance level of 0.05).

Fig. 2
figure 2

Frequency of missing values for variables like body mass index (BMI), diastolic blood pressure (DBP), and systolic blood pressure (SBP) at each phase of the TCGS study (the frequency of missing values for BMI variable at each phase of the TCGS study are 918, 969, 690, 612, 844, and 1199, respectively, the frequency of missing values for DBP variable at each phase of the TCGS study are 907, 849, 669, 526, 712, and 1084, respectively, and the frequency of missing values for SBP variable at each phase of the TCGS study are 907, 840, 669, 526, 712, and 1082, respectively)

After the imputation of missing values, the data modeling results based on the LMM and REEM longitudinal tree algorithm for the two outcome variables (DBP and SBP) are shown in Tables 3, 4, 5 and 6. According to Tables 3 and 5, all SI approaches except cross-median, and cross-hot deck performed similarly with respect to estimates of regression coefficients and their standard error estimates.

Table 3 Results of linear mixed effects model for diastolic blood pressure (DBP)
Table 4 Results of random effects expectation–maximization (REEM) tree algorithm for diastolic blood pressure (DBP)
Table 5 Results of linear mixed effects model for systolic blood pressure (SBP)
Table 6 Results of random effects expectation–maximization (REEM) tree algorithm for systolic blood pressure (SBP)

The results shown in Tables 3, 4, 5 and 6 indicated that parametric longitudinal models are not appropriate for analyzing the TCGS data. There appears to be non-linear relationships (for BMI variable) and the assumptions of homoscedasticity and normality of residuals are clearly not established by using the diagnostic plots such as standardized residuals versus BMI variable, quantile–quantile (QQ) plot of residuals, and plot of standardized residuals versus fitted values for linear mixed effects model, respectively. These assumptions for the LMM appear to have been violated regardless of the SI approach used; as an illustration, we show the diagnostic plots for the traj-mean method in Figures S3, S4, S5, S6, S7 and S8 (supplementary file). We compared the parametric and non-parametric models using SI approaches, and the parametric longitudinal models resulted in larger MSE, RMSE, and MAD than the longitudinal regression tree algorithm (Tables 3, 4, 5 and 6). Given these advantages of the non-parametric model, the comparison of SI and MI imputation approaches is only explained based on the REEM tree algorithm.

When comparing SI approaches that were used in conjunction with the REEM tree algorithm, traj-mean method performed the best (lowest MSE, RMSE, MAD, and deviance), and the cross-hot deck performed the worst (Tables 4 and 6). The tree structure of the REEM tree algorithm using traj-mean for imputation of missing values for two outcome variables, DBP and SBP, are shown in Figure S9 and Figure S10 (supplementary file).

Density plots of the observed and imputed data for incomplete variables like BMI, DBP, and SBP using mice packages are shown in Figure S11, S12, S13 and S14 (supplementary file). Figure S15 (supplementary file) also demonstrates the trace line plots of the mean and standard deviation of the imputed values against the iteration number for each replication. These trace lines are intermingled without any particular trend, so it appears estimation has converged. Due to space limitations, trace line plots for other FCS approaches are not shown, though these plots also indicated convergence. In addition, the Rhat statistic of mean and variances for all incomplete variables based on the FCS-LMM-LN and FCS-LMM-LN-het methods are near one, so the convergence of these methods is also established (Table S3 in supplementary file).

When comparing FCS approaches, the FCS-LMM-LN performed the best and FCS-LMM/FCS-LMM-het performed the worst (Tables 4 and 6). All of the JM approaches had a similar performance (Tables 4 and 6). The diagnostic plots of the JM approaches indicate convergence. In addition, the potential scale reduction criterion was near 1 for all parameters based on JM-FJ and JM-MLMM methods. Due to space limitations, these diagnostic plots are only included for the JointAI and JM-MLMM methods in Figures S16, S17, S18, S19 and S20 (supplementary file), respectively.

The simulation results were consistent with the real data analyses. The longitudinal regression tree algorithm provided better performance than the LMM for analyzing the simulated data under the missing at random (MAR) mechanism. In addition, the SI traj-mean method provided better performance (lowest MSE, RMSE, and MAD) than other imputation approaches (Tables 7 and 8). We have not assessed the bias, because the longitudinal tree algorithm, unlike LMM, does not generate the estimates of coefficient regression. Rather, we have based our evaluation of the methods on prediction performance.

Table 7 Simulation results of linear mixed effects model for diastolic blood pressure (DBP)
Table 8 Simulation results of random effects expectation–maximization (REEM) tree algorithm for diastolic blood pressure (DBP)

Discussion

Missing values are a significant problem in longitudinal studies, and managing this problem is essential. In the current study, we compared the performance of SI and MI approaches to impute longitudinal missing data in the context of using LMM and the REEM tree algorithm for data modelling. Previous studies have compared the performance of MI approaches when the statistical model of interest is a parametric longitudinal model; the performance of MI approaches when the statistical model of interest is a non-parametric longitudinal model is less well understood.

The current study provides a comprehensive assessment using missing imputation approaches for handling missing data in the TCGS dataset and simulated data under the MAR mechanism. To evaluate this aim, we compared the performance of 16 SI approaches and 12 MI approaches to fit the REEM tree algorithm and LMM when assessing the association between DBP/SBP and predictor variables such as age, gender, and BMI. We also focused on the R-packages and provided R code for data modeling after using the SI and MI approaches, as well as missing longitudinal data simulation.

The real and simulated data results suggest that the REEM tree algorithm could perform better than parametric longitudinal models. Tree algorithms have some advantages compared to parametric longitudinal models, and we propose that researchers use these methods for future longitudinal studies. These algorithms can accommodate large data sets, non-linear relationships, and interactions, and can extract homogeneous subgroups of data. The interpretation of the tree algorithm is straightforward because the result is graphically shown and is robust to multicollinearity and outliers. These algorithms are also invariant to monotone transformations of independent variables and do not require additional distributional assumptions [67, 68, 72,73,74,75].

Generally, the comparisons of imputation methods indicated little difference between them. However, a SI approach (traj-mean) had better performance among all imputation approaches in fitting the REEM tree algorithm for both outcome variables DBP and SBP.

In addition, we evaluated the computational time of imputation approaches. JM approaches are much more resource-intensive than FCS approaches. These methods may not be practicable in longitudinal studies with many clusters and predictor variables with high missing rates. In FCS approaches, the convergence of estimations can occur with 5 or 10 iterations. The number of iterations to establish convergence of JM approaches is much larger than FCS approaches, and FCS approaches have higher computational speed than SI approaches. Overall, MI approaches require more computing resources than SI approaches. Therefore, in real applications with many clusters (e.g., TCGS), SI approaches are likely to be more cost-effective in terms of computational time. The SI approaches are not based on the maximum likelihood estimation and are classified as non-parametric imputation approaches. These methods are appropriate for non-parametric trajectories (e.g., BMI trajectories in TCGS data). Based on the computational time and performance of SI method of traj-mean, it appears this can be an excellent approach to deal with missing observations in data similar to the TCGS data set.

In the current study, we only assessed the MCAR mechanism, though there are sensitivity analyses (selection models and pattern-mixture models) that can be performed to assess the appropriateness of the MAR assumption; unfortunately, these models are unavailable for longitudinal data with missing values in longitudinal quantitative predictor and outcome variables [17, 76,77,78].

Past studies compared MI approaches for fitting parametric longitudinal models, such as LMM with random intercepts and LMM with random intercepts and slopes. These studies indicate that all MI approaches provide consistent regression coefficients [16, 17]. Some studies also compared these imputation methods for missing data in the context of multilevel data and concluded that these methods provide consistent regression coefficients [15]. In addition, two studies comprehensively compared the SI approaches to impute monotone and non-monotone missing data in longitudinal studies. Unlike the current study, the copy mean method was more effective than other SI approaches [34. Like these two studies, Zhang (2016) also indicated that the copy-mean method had better performance for imputing missing values [79].

To the best of our knowledge, the present study is the first to compare the SI, and MI approaches to impute missing longitudinal data with many time points, clusters, and values using real and simulation data. However, this present study has two limitations, one of which was related to the violation of parametric longitudinal model assumptions. Another limitation was related to the computational time of the simulation study. We used SI approaches and a non-parametric longitudinal model like the REEM tree algorithm to deal with this limitation. For future studies, the non-parametric imputation methods using multivariate skew-normal distribution for the random effects can impute missing longitudinal data. In addition, in the case with unequal time intervals, functional data analysis could be helpful.

Conclusion

The result of this study should be generalized with caution to other data sets with different characteristics. Because imputation methods can have different levels of performance with different data sets, certain conditions such as missing the data mechanisms or the rate of missingness might lead analysts to opt for different imputation options. Therefore, we conclude that researchers apply all imputation methods (SI and MI) in the context of fitting their statistical models, and then select the imputation method that demonstrates the best performance based on the criteria highlighted in this paper.